This is in some ways tangential to Meltano but I have been w Meltano #random

This is in some ways tangential to Meltano, but I ...

Henning Holgersen

09/25/2022, 6:58 PM

This is in some ways tangential to Meltano, but I have been working on a snowflake->SQL Server pipeline, and when sketching the SQL Server target it appeared that bulk load via azure storage would be the most efficient. Since Snowflake can dump to azure storage, it would be inefficient to pipe each record in this case. But I like the singer/meltano scaffolding. I think it would be feasible to write a tap+sink that uses records of references to the blobs written instead of records of data. Has anyone done anything like this? I’m a little stuck on the metadata part. The “record” would be something like

{'type': 'RECORD', 'stream': 'ECON_DATASETS', 'record': {'FILE_NAME': '<azure://abc.blob.core.windows.net/export/<...>.csv>', 'SIZE_BYTES': 851, 'MD5': 'b67e6ac5d0e68ffb3b4feb192bcd58f7', 'LAST_MODIFIED': 'Sun, 25 Sep 2022 14:00:52 GMT'}, 'version': 1, 'time_extracted': '2022-09-25T15:31:53.209274Z'}

, but I would like to append the actual data schema too. Especially since plain SQL Servers seems to prefer CSVs.

aaronsteers

09/25/2022, 7:09 PM

This is very timely! On Friday we released support for the BATCH message time in our SDK. This allows taps and targets to send files, pretty much like you describe, except under a new message type. Initially the feature supports .jsonl and .jsonl.gz formats. We went with jsonl add our first implementation because it does not require extensive configuration like csv. Jsonl is slightly less preformant, but more interoperable, and the performance/size trade-off is somewhat compensated by having support for gzip out of the gate - reducing io and network costs of the more verbose jsonl format. We will be announcing this on Monday (tomorrow)!

aaronsteers

09/25/2022, 7:10 PM

https://sdk.meltano.com/en/latest/batch.html

Henning Holgersen

09/25/2022, 7:27 PM

So basically the same tap or target can potentially receive both record and batch data? This is amazing, and quite the timing.

aaronsteers

09/25/2022, 7:29 PM

Yes! The way it works is basically: 1. The user configures in the tap a preference for batch messaging. 2. The target will either accept the batch messages, or fail if batch is not supported.

aaronsteers

09/25/2022, 7:34 PM

Based on your message above, looks like you'd want tap-snowflake to target-mssql. As it turns out, @ken_payne is already working on native batch message support on the snowflake side. Would you be interested in building the target-mssql side? We're happy to assist, and we'll have a reference implemention on target-snowflake coming soon, if helpful.

Henning Holgersen

09/25/2022, 7:51 PM

He is working on batch on the tap-snowflake side? If so, I’d be happy to help with a target-mssql. As mentioned I have been sketching a little based on transferwise’s target-postgres, which is quite readable. It might make sense to start with a primitive record-type implementation while the batch type matures a little.

aaronsteers

09/25/2022, 7:52 PM

He's working simultaneously on the snowflake tap and target together as a pair. 👍

aaronsteers

09/25/2022, 7:56 PM

When you do fork, I recommend starting with an SDK-based implementation, so that you get free updates when new features and fixes get added, and you get built-in processing for most common functions. The MeltanoLabs/target-snowflake might be a good one to fork. It's still WIP but there's a lot less code to maintain and you get built-in sqlalchemy handlers for record processing and most generic functions.

Open in Slack

Previous Next