Is using S3 or other blob storage as a sink something that i Meltano #singer-tap-development

Is using S3 or other blob storage as a sink someth...

jam

10/12/2022, 3:50 PM

Is using S3 or other blob storage as a sink something that is on the roadmap for the singer-sdk? Has anyone already implemented this themselves?

pat_nadolny

10/12/2022, 4:00 PM

@jam I know we have a singer most wanted issue for an all in one S3 target but theres a variety of options already out there, I'm working on getting these registered on MeltanoHub: • target-s3-parquet (gupy-io) - SDK based • target-s3-csv (pipelinewise) • target-s3-jsonl (ome9ax) What file format are you looking to write to? Would any of those do what youre looking for?

jam

10/12/2022, 4:08 PM

We're currently loading into s3 as csv or parquet. We were then reading from s3 to Snowflake. We could technically go directly from source to snowflake, but we want to keep the files in S3 too. It sounded like a great place to use a sink, but it looks like S3 isn't supported yet.

jam

10/12/2022, 4:08 PM

Is there a way to load from tap -> multiple targets with a single meltano command?

edgar_ramirez_mondragon

10/12/2022, 4:57 PM

Is there a way to load from tap -> multiple targets with a single meltano command?

@jam There isn’t at the moment. As far as I can tell, the main impediment for that is the coupling of tap & target created by the state. Even if you could create a fan-out that delivers singer messages to multiple targets, if one of the targets fails, and the other continues, they’ll end up with different states. The user would be forced to have separate invocations of the tap in the next run to use the different starting states.

jam

10/12/2022, 5:07 PM

Thanks Edgar, that makes perfect sense. We do have a need to store files that we load into S3. Would you recommendation be to run a tap|target from source to S3, and then upon successful completion of that, load from S3|Snowflake? Would this not be a good usecase for an S3 sink (If it existed)?

jam

10/12/2022, 5:55 PM

This is what I'm referencing, for those interested:

This capability is currently in development by the Singer Working Group to align on a consistent approach for supporting batch sync messages in the Spec or as part of its officially supported patterns. The capability, which was inspired by Wise’s Fast Sync feature, has the goal of bypassing some of the performance limitations inherent to the Singer Spec by using the optimized import/export features of modern data warehouses (Snowflake, Redshift, etc.). For example if a tap and target both had batch support the tap could theoretically know to directly write records to csv files somewhere (i.e. S3) in an ideal format for the target to import, ultimately skipping most of the piping step. Taking this approach allows connectors to improve performance by both using the source and/or target’s most optimized technique for importing and exporting data while also avoiding the additional costs of serializing, deserializing, and piping each record. Join the issue conversation here!

https://hub.meltano.com/singer/docs/#batch

pat_nadolny

10/12/2022, 6:27 PM

@jam - Edgar actually implemented the batch message type feature so he can correct me but thats primarily to take advantage of fast import/export functionality of sources/destinations that have them like most warehouses. Your use case sounds more like a backup strategy, right? You want a copy of your raw data stored in S3 as a backup, in addition to loading it to Snowflake where its used.

jam

10/12/2022, 6:28 PM

I would say that's accurate, with the caveat that our current process does use the import functionality that Snowflake provides. So it's for both reasons.

pat_nadolny

10/12/2022, 6:41 PM

@edgar_ramirez_mondragon I'm starting to recall a thread or office hours discussion around having this as a feature of batch message type. Like choosing to not delete the intermediate files. Do you know if we have an issue for this (I couldnt find one) or if theres a reason its not feasible?

jam

10/12/2022, 6:42 PM

Since the scope was local data storage, I imagine memory was the constraint.

edgar_ramirez_mondragon

10/12/2022, 6:44 PM

@pat_nadolny yeah, I recall that discussion about whether to delete the batch files or not. Currently the implementation does not delete batch files after they’re used and there’s no issue logged to change that. We do plan to support other intermediate locations for storage and adding support for S3 should be relatively easy, though there’s no issue for that either.

aaronsteers

10/12/2022, 6:47 PM

@jam - I contributed a feature like this to pipewise-target-snowflake a while back, specifically because we wanted to

retain_s3_files

for backup purposes. The nice thing about this approach is that the target still gets full fidelity of the upstream tap's catalog metadata, where otherwise some data types would morph if doing this as a two-step process. (Scanning metadata like data types on the S3 files might have less fidelity than what you'd get from the tap itself.)

pat_nadolny

10/12/2022, 6:47 PM

Is there an issue for the S3 intermediate location? I think that solves this use case. Tap-x to target-snowflake using batch message mode with S3 intermediate storage would result in all the data being loaded to Snowflake along with a backup raw copy in the S3 intermediate bucket.

aaronsteers

10/12/2022, 6:48 PM

@pat_nadolny - Yes, as you say, I think we were thinking of this as two parts: 1. Add s3:// as a supported storage root for batch files. 2. Add cleanup_files=false option for batch config on the target side. (This is the default behavior as of now, per Edgar's comment above.)

aaronsteers

10/12/2022, 6:49 PM

I don't know if we have issues open for those yet

jam

10/12/2022, 6:52 PM

I'll look into the issue creation process and see if I can get them created? Any guidance you all can provide on how to best do this would be appreciated. I'd also love to contribute in anyway that I can!

aaronsteers

10/12/2022, 7:44 PM

Issue tracker is here, in case it helps: Https://github.com/Meltano/SDK/issues

aaronsteers

10/12/2022, 7:45 PM

You can do a search for "batch" and "S3" - that would surface the issue if it's already logged.

jam

10/13/2022, 4:46 PM

Following up: I went ahead and added what we discussed as an issue here. Let me know if I missed anything!

aaronsteers

10/13/2022, 11:18 PM

Thanks, @jam! Much appreciated.

Open in Slack

Previous Next