Question, I have a tap.. it 1. queries a api endp...
# singer-tap-development
e
Question, I have a tap.. it 1. queries a api endpoint.. pushes out some relational data result.. lets call it pipeline A, this would run every weekend and right now can take anywhere from 45 minutes to 12 hours 2. Pipeline B .. would run daily and.. use the last known fresh results stored from pipeline A ... which in my case is being pushed to Postgresql... How do I write the 2nd pipeline B tap. logic.. I guess I can at first hardcode it to... go back into the postgresDB configured by pipeline A ... (the metadata of each is in the same tap in the yaml file) so I would use SQLAlchemy pypi library.. to pull that last known run out of DB X Table Y.. result from pipeline A .. in order to then query and fill up results for pipeline B Am I tapping things correct or?
im also starting to look up.. how do I select which stream to pull in Pipeline A vs Pipeline B https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md#metadata still reading up but.. if anyone sees a example tap I can get inspired by that'd really help. thanks
it'd really really help to somehow link these bits of doc.. with code examples
v
2, is normally referring to as T in ELT. So a lot of people would say use DBT (including me 😄 )
When we say use DBT all that really means is run SQL code (dbt just formats it well and does all the things well that software developers hate about sql )
good example, and this is ran via meltano (populates the meltanohub 😄 )
e
ah hah... okay now I finally have a reason to use DBT (been reading it a lot but never had the clear use case) I refactored my python code which I was previously using the pull data out.. and was about to plug that into the tap.. but can def take a look at this instead
thanks @visch!
so the ELT kind of .. mindset would say... run Pipeline A.. push data into table X ... then get the exact set of "latest" good data .. into maybe a 2nd table .. table Y let's say then Pipeline B just.. runs against table Y .. no logic other than pull the entries and use those entries as primary key to query the 2nd Stream
v
I'm not following so much on pipeline b. Generally it's something like, load your source data into something (raw schema in a datawarehouse is a common pattern) , now you are onto transformation step, everything after your raw data is loaded. There's a lot of conventions, the getting started tutorial at dbt really helped me grok modern data warehouses, and what dbt is, and some general patterns. Basically anything you want to do happens in the transform step. Normally in analytic work that's building stage models, analytical models (dimensions, wide tables, anything you want). I think about it more as transformation gets the data into the format I need to actually do things with.
e
I think the issue here is.. most people would use a child stream.. which I can grok would sync "everything" ... I'd have 1 pipeline run Query A and as it spits out it's values use some or all of that for Query B.. as B is a child a A.. but in my case.. I cannot run that Query A everytime.. it takes as I said 45 minutes to 12 hours right now.. depending on how deep my search can be set to go.. so I was thinking I simply use results of A.. PRIOR to the call.. by going back into the target and fetching the "last good result" of that long long query... It feels like this is a perhaps "anti-pattern" to what most good practices would suggest perhaps? but another option seemed to be use DBT after I've run A ... it will push most of it's results to a simple single table .. and DBT could then collect the last known run and push that to a single child table... thereby simplying what I have to do when I go to query B
A has Raw data and is a parent stream for basically ALL other end points .. but I do not want to query it everytime I go to query B, C, etc...
it will be done only on a weekend to weekend cadence, due to the risk that it takes hours and hours
depending on how deep a search you give it
and my C++ code 😄
I'll start with maybe a bad implementation.. and share it with the community as I then get into DBT for downstream transformations and creating maybe sub-tables.. I have this book on data warehouses I've only briefly peaked at https://www.adlibris.com/se/bok/agile-data-warehouse-design-9780956817204
Maybe I need to pick a tap I can more easily open source.. to get these concepts out into the open.. like the Yahoo FInance data perhaps
IBKR is very strict about who can see their data.. so a tap for it is not the most helpful
here's my
catalog.json
I am now looking up how to invoke my tap.. both from the CLI and from Meltano Pipeline .. to tap ONLY 1 stream (in my case the news) ``` ```
ah now it's clicking.. I see I pass this.. as a signal to my tap which stream I want... yes? https://hub.meltano.com/singer/spec
now the challenge is how do I get that logic in
client.py
to work.. since as of now I see only 1 api end point
get_records
and strangely no other example tap out there has this
client.py
ah I think I see it now.. after inspecting this example https://github.com/dataops-tk/tap-powerbi-metadata/blob/main/tap_powerbi_metadata/streams.py#L35 most taps are doing OAuth, REST... pre-existing paradigms of APIs.. and in my case I am unable to do that so must write the client myself.. which was throwing me off a bit but okay now I get it.