Hey looking for a quick answer that I might know already but Meltano #singer-tap-development

Hey looking for a quick answer that I might know a...

alexander_butler

02/06/2023, 6:54 PM

Hey looking for a quick answer that I might know already but want to confirm. In the SDK, there is no catalog "caching" between runs unless you override the

discover_streams

to defer to/leverage the the

input_catalog

? So basically, if a catalog is provided, it doesn't do anything except "register raw streams" in the mapper.

visch

02/06/2023, 7:14 PM

yes. Another way to say it is

discover

generates the

catalog

, the tap uses the

catalog

to run in a "normal" run.

alexander_butler

02/06/2023, 7:16 PM

Right but every time you run an SDK tap,

run_discovery

is called. No matter what. Even if a catalog is passed. So for someone who has a discovery that consumes a ton of API calls or is time intensive, it will run every single time.

visch

02/06/2023, 7:17 PM

https://github.com/meltano/sdk/blob/main/singer_sdk/tap_base.py#L500-L511 that's not true

alexander_butler

02/06/2023, 7:17 PM

It is true, 1 sec

visch

02/06/2023, 7:18 PM

You're going to blow up my mind if it isn't 😄

alexander_butler

02/06/2023, 7:18 PM

I said the method name wrong like a dummy, its

discover_streams

that is called every time

alexander_butler

02/06/2023, 7:18 PM

which is what generates the catalog

alexander_butler

02/06/2023, 7:18 PM

during discovery

alexander_butler

02/06/2023, 7:19 PM

but when passed a catalog POST-discovery, it still runs

visch

02/06/2023, 7:20 PM

I think I understand what you're saying now I thought it was something more core to "singer" taps.

alexander_butler

02/06/2023, 7:21 PM

I have a 10 minute plus discovery that eats a lot of API and it runs everytime because the underpinning catalog generating function is

discover_streams

but it isn't apparent to the end user they should leverage the

input_catalog

attribute inside the

discover_streams

to not have to dynamically hit the API. This is only relevant for poeple who use

discover_streams

dynamically like for the Salesforce SDK tap I am working on

alexander_butler

02/06/2023, 7:21 PM

so SDK discovery isnt useful in that sense if you really think about the code path

alexander_butler

02/06/2023, 7:22 PM

from the perspective of a passed catalog 🙂

visch

02/06/2023, 7:30 PM

Trying to give good context, I do this for a lot of jobs and when providing a

catalog

discover, doesn't run but trying to find some examples and I haven't dove into this exact path / issue.

visch

02/06/2023, 7:36 PM

Is your end user a tap developer here? Or someone running the tap via meltano? Best example I have where I know I"m pretty sure I"m doing this right is

tap-postgres

https://github.com/MeltanoLabs/tap-postgres/blob/main/tap_postgres/tap.py#L38-L49 https://github.com/meltano/sdk/blob/main/singer_sdk/tap_base.py#L551-L570 is the caching mechanism that I think you're asking about too.

alexander_butler

02/06/2023, 7:45 PM

Here is some pseudocode that maybe helps show what is run every time. I think my problem is somewhere along the lines, I could've sworn I read it somewhere, it was recommended to override the

discover_streams

if your tap was dynamic. I think therein lies my mistake

alexander_butler

02/06/2023, 7:45 PM

I think I should override what you linked right? EDIT: instead of what I am overwriting

visch

02/06/2023, 7:47 PM

You're right that you have to override that function, but I think how it gets queried is the key. In the https://github.com/MeltanoLabs/tap-postgres/blob/main/tap_postgres/tap.py#L38-L49 I posted it's quering over

catalog_dict

which get's generated if a catalog wasn't passed to the tap. The interface is a bit tricky (not intuitive unfortunately) so I totally understand where you're coming from (I haven't had an HTTP tap yet where I needed to dynamically generate schemas so I haven't had to exactly do what you're after right now but the postgres example is close).

visch

02/06/2023, 7:48 PM

So I think if you move most of that logic to overriding

catalog_dict

you may be in a good spot.

visch

02/06/2023, 7:49 PM

They key is if you have a catalog already we don't need to do the expensive operation

alexander_butler

02/06/2023, 7:50 PM

This is super helpful, I wasn't blocked persay as I was kind of sleuthing my way around why the expensive op was happening when passing a catalog after discover. But seeing you example really shined a light on the correct place to put the logic. It should probably be called out in the SDK cuz there are plenty of people who will override the same function as me and (perhaps optimistically) expect it to "just work" lol

Open in Slack

Previous Next