I m dipping my toes somewhat in tap development currently lo Meltano #singer-tap-development

I'm dipping my toes somewhat in tap development, c...

Jens Christian Hillerup

08/21/2024, 11:22 AM

I'm dipping my toes somewhat in tap development, currently looking at Branch.io. The way it works there is that you enable daily data dumps, and then they provide an API to get your data. You're not getting the data, instead they provide links to an S3 bucket where you can fetch some gzipped CSVs they generate daily. I'm wondering if I can somehow "chain" the code I'd be writing that interacts with the Branch.io API to

tap-csv

so I don't won't to write all the code for dealing with CSVs etc.

Jens Christian Hillerup

08/21/2024, 11:24 AM

Or even

tap-s3-csv

Jens Christian Hillerup

08/21/2024, 11:37 AM

This is the API btw: https://help.branch.io/developers-hub/reference/daily-exports-api So I'd use a key/secret set to make one API call. That returns a JSON object where each element represents a stream. Its value is a HTTP URL linking to a gzipped CSV structure.

Andy Carter

08/21/2024, 11:37 AM

You could chain meltano commands to do api -> s3 then s3 -> postgres

Andy Carter

08/21/2024, 11:37 AM

https://docs.meltano.com/reference/command-line-interface/#run

Andy Carter

08/21/2024, 11:39 AM

Ah, but your tap returns links to s3, I don't think there's a way to 'capture' the tap output and pass to another tap

Jens Christian Hillerup

08/21/2024, 11:40 AM

Nah, OK. And no sanctioned way of using another tap's internals as a library, like (fantasizing here...)

from tap_csv import csv_parser

Jens Christian Hillerup

08/21/2024, 11:41 AM

I just want to avoid having to reimplement a bog-standard (but at the same time probably pretty hard to get exactly right) thing

Andy Carter

08/21/2024, 11:43 AM

Could your tap also download and persist the data to s3 directly? And then just emit the s3 url as part of its output? You could run that in combination with an arbitrary target?

meltano run tap-branchio target-postgres tap-s3 target-snowflake

or similar?

Jens Christian Hillerup

08/21/2024, 11:43 AM

I don't think I'd be able to persist anything in a bucket I don't own. Maybe I don't understand your question...

Jens Christian Hillerup

08/21/2024, 11:44 AM

I have an API for getting URLs for day-fresh CSV data. Maybe it doesn't actually matter that they are in S3 for this case, because I access them over HTTP.

Jens Christian Hillerup

08/21/2024, 11:45 AM

So, yeah, if I could somehow get that logic working, and then instrument

tap-csv

by pointing it to these URLs I'd be good. Doesn't seem like that's the case.

Jens Christian Hillerup

08/21/2024, 11:48 AM

but you know what...

tap-csv

is 250 lines if code. I could include that as a submodule in my

tap-branchio

repo and go to town with the `import`s

Jens Christian Hillerup

08/21/2024, 11:48 AM

Wonder what @Edgar Ramírez (Arch.dev) would think?

Andy Carter

08/21/2024, 11:51 AM

I was thinking you download the data from the s3 url you are given, and then persist to a bucket you do own (rather than parsing / reading the csv fully).

Jens Christian Hillerup

08/21/2024, 11:51 AM

Ah.

Jens Christian Hillerup

08/21/2024, 11:51 AM

Myeah, I guess I could. That just doesn't scratch my itch... I kinda want my tap to do the whole job

Andy Carter

08/21/2024, 11:53 AM

Not sure if copying the whole tap-csv is the way to go, but you could examine the parsing code and lift it into your tap. I would have thought it's just wrapping pythons

csv

Jens Christian Hillerup

08/21/2024, 11:53 AM

It is

Andy Carter

08/21/2024, 11:54 AM

You could emit everything as text or if you have well defined streams/schemas do some type casting in your individual streams.

Jens Christian Hillerup

08/21/2024, 11:55 AM

yeah. Maybe that is a better approach overall. Besides,

tap-csv

needs to handle different styles of CSV whereas I (hopefully) only have to deal with one. And you're right about the schemas, hadn't thought of that,.

Jens Christian Hillerup

08/21/2024, 11:56 AM

That'd also give me a little more freedom in terms of the interface. Would be nifty to use BytesIO or somesuch for streaming so I won't have to persist blocks of data to disk and can do gunzipping on the fly

Jens Christian Hillerup

08/21/2024, 11:57 AM

Thanks for the rubberducking session 🙂

Andy Carter

08/21/2024, 11:57 AM

You could stream it line by line https://kokes.github.io/blog/2018/07/26/s3-objects-streaming-python.html

Andy Carter

08/21/2024, 11:58 AM

If memory use is an issue

Jens Christian Hillerup

08/21/2024, 11:59 AM

I'm not sure I can access it with boto3. Guess I could try. But what I'm given by the API is a HTTP link looking like

<https://branch-exports-web.s3.us-west-1.amazonaws.com/api_export/y%3D2024/m%3D08/d%3D20/app_id%3D4DC3025>...

👍 1

Jens Christian Hillerup

08/21/2024, 12:00 PM

That means bucket name

branch-exports-web

, region

us-west-1

and I got the object path too... Maybe I can use boto3, but otherwise I can just use HTTP and range queries if the object is yuge

Edgar Ramírez (Arch.dev)

08/21/2024, 4:14 PM

You could certainly add tap-csv as a dependency 🙂. It's probably gonna be pretty stable for the foreseeable future but I'd still pin it to a specific tag, since there's no contract on the Python API. In the long term, I'd like to get file encoding and decoding capabilities built into the singer-sdk, so it's easier to write, for example, parquet files that are then inserted into a dwh, or like your example extract records from an API response in CSV format.

3 Views

Open in Slack

Previous Next