Hi everyone Maybe someone can advise on the already existing Meltano #getting-started

Hi everyone! Maybe someone can advise on the alrea...

anna_antypenko

06/14/2022, 7:57 PM

Hi everyone! Maybe someone can advise on the already existing loader? I need to write data in s3 and divide large json file into smaller ones for better performance. I see that we have this option (loads 1 big file) and our team thinks of creating a custom loader per our need. However, described above seems like a common problem and maybe we already have something similar? Thanks for any thoughts on this!

visch

06/22/2022, 12:35 PM

The performance reason is curious, if the file is getting up there you should be good but I"m curious what performance issues you're hitting and with what tool exactly? Lots of times people take S3 data and then load it into a DW, or maybe hit it with Athena. The hard question here is what do you want to divide the file up by? DB's do a great job at this kind of thing 🤷

anna_antypenko

06/22/2022, 1:02 PM

Why did this come up, during backfilling: we ended up with jsonl files of more than 1GB (but we have more use cases other than backfilling). The external table runs really slow because of this and splitting files into multiple ones (~100MB each) helps with the speed if we query it from Athena directly. Generally, we have s3 --> glue --> external table --> staging table (redshift) --> public table (redshift) flow and some incremental setup if needed. In this case, the staging table was written for a really long time and used a lot of resources.

visch

06/22/2022, 1:11 PM

Got it so you have a "Data Lake" that you query with multiple tools (Glue sometimes, Athena Sometimes) Another way to handle this is to load direct to redshift but I get that wouldn't work for you all. From here I think adding a feature to https://github.com/ome9ax/target-s3-jsonl to split files up based on whatever you'd like it to split them on (File size could work) would do it 🤷 I don't know of any off hand that do that I posted an issue here for you as well https://github.com/meltano/hub/issues/617 As having a bunch of different targets for s3 hurts like s3-csv, s3-json, s3-parquet is kinda silly we should probably have a generic target for Files similar to

tap-spreadsheetsanywhere

Open in Slack

Previous Next