I'm dipping my toes somewhat in tap development, c...
# singer-tap-development
j
I'm dipping my toes somewhat in tap development, currently looking at Branch.io. The way it works there is that you enable daily data dumps, and then they provide an API to get your data. You're not getting the data, instead they provide links to an S3 bucket where you can fetch some gzipped CSVs they generate daily. I'm wondering if I can somehow "chain" the code I'd be writing that interacts with the Branch.io API to
tap-csv
so I don't won't to write all the code for dealing with CSVs etc.
Or even
tap-s3-csv
!
This is the API btw: https://help.branch.io/developers-hub/reference/daily-exports-api So I'd use a key/secret set to make one API call. That returns a JSON object where each element represents a stream. Its value is a HTTP URL linking to a gzipped CSV structure.
a
You could chain meltano commands to do api -> s3 then s3 -> postgres
Ah, but your tap returns links to s3, I don't think there's a way to 'capture' the tap output and pass to another tap
j
Nah, OK. And no sanctioned way of using another tap's internals as a library, like (fantasizing here...)
from tap_csv import csv_parser
I just want to avoid having to reimplement a bog-standard (but at the same time probably pretty hard to get exactly right) thing
a
Could your tap also download and persist the data to s3 directly? And then just emit the s3 url as part of its output? You could run that in combination with an arbitrary target?
meltano run tap-branchio target-postgres tap-s3 target-snowflake
or similar?
j
I don't think I'd be able to persist anything in a bucket I don't own. Maybe I don't understand your question...
I have an API for getting URLs for day-fresh CSV data. Maybe it doesn't actually matter that they are in S3 for this case, because I access them over HTTP.
So, yeah, if I could somehow get that logic working, and then instrument
tap-csv
by pointing it to these URLs I'd be good. Doesn't seem like that's the case.
but you know what...
tap-csv
is 250 lines if code. I could include that as a submodule in my
tap-branchio
repo and go to town with the `import`s
Wonder what @Edgar RamĂ­rez (Arch.dev) would think?
a
I was thinking you download the data from the s3 url you are given, and then persist to a bucket you do own (rather than parsing / reading the csv fully).
j
Ah.
Myeah, I guess I could. That just doesn't scratch my itch... I kinda want my tap to do the whole job
a
Not sure if copying the whole tap-csv is the way to go, but you could examine the parsing code and lift it into your tap. I would have thought it's just wrapping pythons
csv
j
It is
a
You could emit everything as text or if you have well defined streams/schemas do some type casting in your individual streams.
j
yeah. Maybe that is a better approach overall. Besides,
tap-csv
needs to handle different styles of CSV whereas I (hopefully) only have to deal with one. And you're right about the schemas, hadn't thought of that,.
That'd also give me a little more freedom in terms of the interface. Would be nifty to use BytesIO or somesuch for streaming so I won't have to persist blocks of data to disk and can do gunzipping on the fly
Thanks for the rubberducking session 🙂
a
If memory use is an issue
j
I'm not sure I can access it with boto3. Guess I could try. But what I'm given by the API is a HTTP link looking like
<https://branch-exports-web.s3.us-west-1.amazonaws.com/api_export/y%3D2024/m%3D08/d%3D20/app_id%3D4DC3025>...
👍 1
That means bucket name
branch-exports-web
, region
us-west-1
and I got the object path too... Maybe I can use boto3, but otherwise I can just use HTTP and range queries if the object is yuge
e
You could certainly add tap-csv as a dependency 🙂. It's probably gonna be pretty stable for the foreseeable future but I'd still pin it to a specific tag, since there's no contract on the Python API. In the long term, I'd like to get file encoding and decoding capabilities built into the singer-sdk, so it's easier to write, for example, parquet files that are then inserted into a dwh, or like your example extract records from an API response in CSV format.