Splitting off <@U06D0TMQ2U8>’s message <https://me...
# singer-tap-development
m
Splitting off @pat_nadolny’s message https://meltano.slack.com/archives/CMN8HELB0/p1682534466101709?thread_ts=1682484019.956399&amp;cid=CMN8HELB0 into a new thread: What’s the way to implement this behavior? Is there a good example of implementing this kind of select/catalog functionality?
I think my big question here is: if I have a
select
set in meltano.yml:
Copy code
select:
  - stream_name.field_name
  - stream_name_two.*
etc, how can I get those values in my tap? I can’t find any sort of “get_selected_fields” type method in the SDK
I want to be building the catalog dynamically, so don’t want to just get “all” possible streams and then filter them on “selected = true”
it appears that if I am building the catalog_dict manually (which I am at the moment), I could set these selection breadcrumbs (example test: https://github.com/meltano/sdk/blob/02be0d610650fe28ae3c05925bc82d171ffa9b2e/tests/_singerlib/test_catalog.py#L153-L166) but that assumes I have those entities+fields available and I don’t know how to get them 😕
@pat_nadolny would you be able to provide any direction on this?
p
@Matt Menzenski I'm in the process of learning the best way to do this as well 😅 . My understanding of how this works generally: 1. when you run the tap with meltano, behind the scenes it runs a discover command equivalent to
tap-mongodb --discover --config config.json > .meltano/run/tap-mongodb/properties.json
to generate the full catalog. When you run
meltano select tap-mongodb --list
it shows you all streams available in your catalog file. 2. then it updates the catalog streams to be
"selected": true
or false based on your select criteria in your meltano.yml. 3. when you run the tap
meltano run tap-mongodb target-jsonl
it passes the updated catalog into the tap equivalent to
tap-mongodb --config config.json --catalog .meltano/run/tap-mongodb/properties.json
4. the tap receives the updated catalog as input and only syncs the appropriate streams. In step 1 the tap receives no input catalog so it should run the dynamic schema generation code to build the schema for all available streams (in mongodb case, filtered using only the databases in the config). Then in step 3/4 the tap receives a catalog as input so it should not dynamically generate the schema, it should use what was provided and sync the appropriate streams that are selected. Does that make sense?
With all of that said...I'm still in the process of learning the best way to do that as it relates to the SDK
m
Does that make sense?
Yes, although I think there’s a piece I’m still missing. It seems like a tap should be able to access the “selected entities and fields” catalog information without first having to do something like scan all tables in the database first and then filter them.
p
Do you mean in step 1 when no catalog is provided?
m
maybe? I need to do more testing - I think I am going to spin up a new tap from the SDK and test this behavior without any customization
I may be just totally misunderstanding how the default behavior works
I stood up a new tap from the SDK following these options:
Copy code
$ cookiecutter <https://github.com/meltano/sdk> --directory="cookiecutter/tap-template"
You've downloaded /Users/matt/.cookiecutters/sdk before. Is it okay to delete and re-download it? [yes]: y
source_name [MySourceName]: sdk-testing
admin_name [FirstName LastName]:
tap_id [tap-sdk-testing]:
library_name [tap_sdk_testing]:
variant [None (Skip)]:
Select stream_type:
1 - REST
2 - GraphQL
3 - SQL
4 - Other
Choose from 1, 2, 3, 4 [1]: 4
Select auth_method:
1 - API Key
2 - Bearer Token
3 - Basic Auth
4 - OAuth2
5 - JWT
6 - Custom or N/A
Choose from 1, 2, 3, 4, 5, 6 [1]: 6
Select include_cicd_sample_template:
1 - GitHub
2 - None (Skip)
Choose from 1, 2 [1]: 1
I got it runnable (the hyphen in
sdk-testing
needed to be replaced with an underscore). Then I added a
select
block. Here, the
users.age
is a stream + property that is defined in the tap, while
documents
is a stream not defined in the tap at all.
Copy code
select:
      - '!users.age'
      - 'documents.*'
Then I ran `meltano select`:
Copy code
$ meltano select tap-sdk-testing --list --all
2023-04-27T18:39:11.375589Z [info     ] The default environment 'test' will be ignored for `meltano select`. To configure a specific environment, please use the option `--environment=<environment name>`.
Legend:
	SelectionType.SELECTED
	SelectionType.EXCLUDED
	SelectionType.AUTOMATIC

Enabled patterns:
	!users.age
	documents.*

Selected attributes:
	[SelectionType.EXCLUDED] groups.id
	[SelectionType.EXCLUDED] groups.modified
	[SelectionType.EXCLUDED] groups.name
	[SelectionType.EXCLUDED] users.age
	[SelectionType.EXCLUDED] users.city
	[SelectionType.EXCLUDED] users.email
	[SelectionType.EXCLUDED] users.id
	[SelectionType.EXCLUDED] users.name
	[SelectionType.EXCLUDED] users.state
	[SelectionType.EXCLUDED] users.street
	[SelectionType.EXCLUDED] users.zip
I can see the
documents.*
in the response there, which is really encouraging to me.
I’m now wondering if this is something that the
meltano
CLI can pull from a tap but which the tap can’t pull itself?
p
I’m now wondering if this is something that the
meltano
CLI can pull from a tap but which the tap can’t pull itself?
Right - meltano is pushing this info to the "dumb" tap that has no understanding of the fact that its being called by meltano
I think what happening in https://meltano.slack.com/archives/C01PKLU5D1R/p1682621405808209?thread_ts=1682543443.459519&amp;cid=C01PKLU5D1R is that meltano is asking the tap "give me a catalog of all streams that you can sync" and the tap returns a catalog file which meltano stores locally then the
select
cli calls are all reading or manipulating the locally stored catalog. Then once we ask meltano to run a sync, it sends that updated catalog to the tap and says "run a sync with this catalog now that I've updated it with only my certain streams selected"
m
so maybe the path forward is to make my tap more reliant on a provided catalog? thinkspin
Thank for the feedback @pat_nadolny! I think I’m getting there
p
Yeah the tap shouldnt do any dynamic schema generation stuff if a catalog is provided. It should take the catalog's schema as source of truth.
m
I got this working https://github.com/menzenski/tap-mongodb/pull/2 CC @pat_nadolny @matt_elgazar