Question I have a tap it 1 queries a api endpoint pushes out Meltano #singer-tap-development

Question, I have a tap.. it 1. queries a api endp...

emcp

10/22/2021, 5:28 PM

Question, I have a tap.. it 1. queries a api endpoint.. pushes out some relational data result.. lets call it pipeline A, this would run every weekend and right now can take anywhere from 45 minutes to 12 hours 2. Pipeline B .. would run daily and.. use the last known fresh results stored from pipeline A ... which in my case is being pushed to Postgresql... How do I write the 2nd pipeline B tap. logic.. I guess I can at first hardcode it to... go back into the postgresDB configured by pipeline A ... (the metadata of each is in the same tap in the yaml file) so I would use SQLAlchemy pypi library.. to pull that last known run out of DB X Table Y.. result from pipeline A .. in order to then query and fill up results for pipeline B Am I tapping things correct or?

emcp

10/22/2021, 5:48 PM

im also starting to look up.. how do I select which stream to pull in Pipeline A vs Pipeline B https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md#metadata still reading up but.. if anyone sees a example tap I can get inspired by that'd really help. thanks

emcp

10/22/2021, 5:59 PM

ah now I am in here... https://sdk.meltano.com/en/latest/partitioning.html

emcp

10/22/2021, 5:59 PM

it'd really really help to somehow link these bits of doc.. with code examples

visch

10/22/2021, 6:31 PM

2, is normally referring to as T in ELT. So a lot of people would say use DBT (including me 😄 )

visch

10/22/2021, 6:32 PM

When we say use DBT all that really means is run SQL code (dbt just formats it well and does all the things well that software developers hate about sql )

visch

10/22/2021, 6:33 PM

https://gitlab.com/meltano/hub/-/blob/main/meltano/transform/sources/github/stg_repositories.sql

visch

10/22/2021, 6:34 PM

good example, and this is ran via meltano (populates the meltanohub 😄 )

emcp

10/22/2021, 9:28 PM

ah hah... okay now I finally have a reason to use DBT (been reading it a lot but never had the clear use case) I refactored my python code which I was previously using the pull data out.. and was about to plug that into the tap.. but can def take a look at this instead

emcp

10/22/2021, 9:28 PM

thanks @visch!

emcp

10/22/2021, 9:34 PM

seems I will start here https://meltano.com/docs/transforms.html#running-a-transform-within-your-elt-pipeline

emcp

10/22/2021, 9:36 PM

so the ELT kind of .. mindset would say... run Pipeline A.. push data into table X ... then get the exact set of "latest" good data .. into maybe a 2nd table .. table Y let's say then Pipeline B just.. runs against table Y .. no logic other than pull the entries and use those entries as primary key to query the 2nd Stream

visch

10/23/2021, 2:05 AM

I'm not following so much on pipeline b. Generally it's something like, load your source data into something (raw schema in a datawarehouse is a common pattern) , now you are onto transformation step, everything after your raw data is loaded. There's a lot of conventions, the getting started tutorial at dbt really helped me grok modern data warehouses, and what dbt is, and some general patterns. Basically anything you want to do happens in the transform step. Normally in analytic work that's building stage models, analytical models (dimensions, wide tables, anything you want). I think about it more as transformation gets the data into the format I need to actually do things with.

emcp

10/23/2021, 9:06 AM

I think the issue here is.. most people would use a child stream.. which I can grok would sync "everything" ... I'd have 1 pipeline run Query A and as it spits out it's values use some or all of that for Query B.. as B is a child a A.. but in my case.. I cannot run that Query A everytime.. it takes as I said 45 minutes to 12 hours right now.. depending on how deep my search can be set to go.. so I was thinking I simply use results of A.. PRIOR to the call.. by going back into the target and fetching the "last good result" of that long long query... It feels like this is a perhaps "anti-pattern" to what most good practices would suggest perhaps? but another option seemed to be use DBT after I've run A ... it will push most of it's results to a simple single table .. and DBT could then collect the last known run and push that to a single child table... thereby simplying what I have to do when I go to query B

emcp

10/23/2021, 9:09 AM

A has Raw data and is a parent stream for basically ALL other end points .. but I do not want to query it everytime I go to query B, C, etc...

emcp

10/23/2021, 9:10 AM

it will be done only on a weekend to weekend cadence, due to the risk that it takes hours and hours

emcp

10/23/2021, 9:10 AM

depending on how deep a search you give it

emcp

10/23/2021, 9:10 AM

and my C++ code 😄

emcp

10/23/2021, 9:12 AM

I'll start with maybe a bad implementation.. and share it with the community as I then get into DBT for downstream transformations and creating maybe sub-tables.. I have this book on data warehouses I've only briefly peaked at https://www.adlibris.com/se/bok/agile-data-warehouse-design-9780956817204

emcp

10/23/2021, 9:19 AM

Maybe I need to pick a tap I can more easily open source.. to get these concepts out into the open.. like the Yahoo FInance data perhaps

emcp

10/23/2021, 9:20 AM

IBKR is very strict about who can see their data.. so a tap for it is not the most helpful

emcp

10/23/2021, 10:22 AM

here's my

catalog.json

I am now looking up how to invoke my tap.. both from the CLI and from Meltano Pipeline .. to tap ONLY 1 stream (in my case the news) ``` ```

catalog.json

emcp

10/23/2021, 10:24 AM

ah now it's clicking.. I see I pass this.. as a signal to my tap which stream I want... yes? https://hub.meltano.com/singer/spec

emcp

10/23/2021, 10:25 AM

now the challenge is how do I get that logic in

client.py

to work.. since as of now I see only 1 api end point

get_records

and strangely no other example tap out there has this

client.py

emcp

10/23/2021, 11:11 AM

ah I think I see it now.. after inspecting this example https://github.com/dataops-tk/tap-powerbi-metadata/blob/main/tap_powerbi_metadata/streams.py#L35 most taps are doing OAuth, REST... pre-existing paradigms of APIs.. and in my case I am unable to do that so must write the client myself.. which was throwing me off a bit but okay now I get it.

Open in Slack

Previous Next