Hey looking for a quick answer that I might know a...
# singer-tap-development
a
Hey looking for a quick answer that I might know already but want to confirm. In the SDK, there is no catalog "caching" between runs unless you override the
discover_streams
to defer to/leverage the the
input_catalog
? So basically, if a catalog is provided, it doesn't do anything except "register raw streams" in the mapper.
v
yes. Another way to say it is
discover
generates the
catalog
, the tap uses the
catalog
to run in a "normal" run.
a
Right but every time you run an SDK tap,
run_discovery
is called. No matter what. Even if a catalog is passed. So for someone who has a discovery that consumes a ton of API calls or is time intensive, it will run every single time.
a
It is true, 1 sec
v
You're going to blow up my mind if it isn't 😄
a
I said the method name wrong like a dummy, its
discover_streams
that is called every time
which is what generates the catalog
during discovery
but when passed a catalog POST-discovery, it still runs
v
I think I understand what you're saying now I thought it was something more core to "singer" taps.
a
I have a 10 minute plus discovery that eats a lot of API and it runs everytime because the underpinning catalog generating function is
discover_streams
but it isn't apparent to the end user they should leverage the
input_catalog
attribute inside the
discover_streams
to not have to dynamically hit the API. This is only relevant for poeple who use
discover_streams
dynamically like for the Salesforce SDK tap I am working on
so SDK discovery isnt useful in that sense if you really think about the code path
from the perspective of a passed catalog 🙂
v
Trying to give good context, I do this for a lot of jobs and when providing a
catalog
discover, doesn't run but trying to find some examples and I haven't dove into this exact path / issue.
Is your end user a tap developer here? Or someone running the tap via meltano? Best example I have where I know I"m pretty sure I"m doing this right is
tap-postgres
https://github.com/MeltanoLabs/tap-postgres/blob/main/tap_postgres/tap.py#L38-L49 https://github.com/meltano/sdk/blob/main/singer_sdk/tap_base.py#L551-L570 is the caching mechanism that I think you're asking about too.
a
Here is some pseudocode that maybe helps show what is run every time. I think my problem is somewhere along the lines, I could've sworn I read it somewhere, it was recommended to override the
discover_streams
if your tap was dynamic. I think therein lies my mistake
I think I should override what you linked right? EDIT: instead of what I am overwriting
v
You're right that you have to override that function, but I think how it gets queried is the key. In the https://github.com/MeltanoLabs/tap-postgres/blob/main/tap_postgres/tap.py#L38-L49 I posted it's quering over
catalog_dict
which get's generated if a catalog wasn't passed to the tap. The interface is a bit tricky (not intuitive unfortunately) so I totally understand where you're coming from (I haven't had an HTTP tap yet where I needed to dynamically generate schemas so I haven't had to exactly do what you're after right now but the postgres example is close).
So I think if you move most of that logic to overriding
catalog_dict
you may be in a good spot.
They key is if you have a catalog already we don't need to do the expensive operation
a
This is super helpful, I wasn't blocked persay as I was kind of sleuthing my way around why the expensive op was happening when passing a catalog after discover. But seeing you example really shined a light on the correct place to put the logic. It should probably be called out in the SDK cuz there are plenty of people who will override the same function as me and (perhaps optimistically) expect it to "just work" lol