So I m saving data to a parquet file and then loading it bac Meltano #getting-started

So I'm saving data to a parquet file and then load...

jessica_fosler

06/29/2023, 12:32 AM

So I'm saving data to a parquet file and then loading it back up and sending it out to a database. Reason I'm doing in two steps is the extraction process takes forever and I dont want to hold a database connection open overnight. I saved it out to target-parquet. Cool. Got the file I need. Now I have a parquet file and I add this loader

Copy code

meltano add loader tap-parquet

I get this error - how do I configure the tap-parquet virtual environment to fulfill this requirement?

Copy code

ERROR: No matching distribution found for singer-sdk<0.2.0,>=0.1.0

visch

06/29/2023, 10:42 AM

Don't have a computer in front of me but this smells like a setting on your machine as https://pypi.org/project/singer-sdk/0.1.0/ does exist. Maybe a different default pypi repo? Not certain though

visch

06/29/2023, 10:43 AM

Very very old version though so getting that updated would be best

user

06/29/2023, 2:39 PM

woah it actually is pinned to 0.1.0 https://github.com/dataops-tk/tap-parquet/blob/d9c50ea92a68b7777e31ca622468e1dadd86d9ce/pyproject.toml#L11

jessica_fosler

06/29/2023, 10:35 PM

Yeah I definitely have had trouble round tripping data between taps and targets when using files as an intermediate stage. My mailchimp data takes forever to download - and it's got commas everywhere. JSON data, names (because people wrote LASTNAME, FIRSTNAME) So I'm like sweet, I'll use tap-csv and swap the delimiter to be

- because CSVs can have anything. But then the next tap I picked up didn't support custom delimiter correctly and wrote garbage tables out. I'm like ok, it's got JSON data in there, maybe I should write to a more structured format. Tried DuckDB. It crashed writing out the file. I'm like OK the file is fundamentally JSON and I'm converting it, so let's stop doing that. So I wrote out to JSONL. Except I couldnt install the tap-singer-jsonl cause it wants a whole other version of python. I'm like ok, DuckDB is all new and experimental, maybe I should go for an older standard - like parquet. So I write the file out to target-parquet. And then I try to install tap-parquet, but I cant. So then I'm like fine, snowflake has instructions for manually loading a parquet file from a stage. So I write a bit of SQL to put the parquet file to an internal stage and when I go to copy into, it tells me target-parquet wrote out a file without magic bytes. SO, what do you think my next option is? Local postgres instance? Save the tap-csv and manually set the delimiter in snowflake to be

? Am I trying too hard?

jessica_fosler

07/01/2023, 9:42 AM

In the end what I've done is make a three line python script to turn JSONL into JSON and then put it into an internal stage with Snow SQL. Feels a bit like cheating, fortunately with this one I only have to run it a couple times.

user

07/05/2023, 6:51 PM

@jessica_fosler oh no that sounds like a pain 😢 . A new mailchimp SDK based tap was added to the hub last week, I wonder if that would be faster and potentially handle your data issues too https://hub.meltano.com/extractors/tap-mailchimp. Worth a try 🤷

jessica_fosler

07/08/2023, 3:25 AM

OOO cool beans - @pat_nadolny I am learning meltano and Snowflake at the same time so it's all coming like a firehose. One thing I did learn is that I was silly to convert it to JSON - that snowflake itself supports NDJSON which is like practically the same as JSONL - the format that the JSONL dumps out as I could upload to an internal snowflake stage with the file format JSON - I did use the strip outer array option too, which I dont know if that was actually necessary. The current tap is choking on dumping out report activity. I do wonder if this one handles it better. I dont have to dump again for a few months but I might give that new connector a shot. I can't wait to update the python in this docker from 3.9 but I've got a few other things that are blocking it from being updated. It must be that 3.10 is like the magic minimum version.

jessica_fosler

07/08/2023, 3:26 AM

But I guess the takeaway for you might be Snowflake is calling JSONL NDJSON in their docs, might be worth double listing the tap in hub.meltano.com and/or updating the readme so it shows up in search

jessica_fosler

07/08/2023, 3:27 AM

https://github.com/ndjson/ndjson.github.io/issues/1#issuecomment-109935996

jessica_fosler

07/08/2023, 3:28 AM

https://docs.snowflake.com/en/sql-reference/sql/create-file-format

jessica_fosler

07/08/2023, 3:28 AM

message has been deleted

user

07/10/2023, 6:53 PM

might be worth double listing the tap in hub.meltano.com and/or updating the readme so it shows up in search

@jessica_fosler can you say more about what you mean by this? Most users of Meltano and target-snowflake should never have to even consider how all of this works to get it functioning, unfortunately I think your use case with mailchimp made it more complicated due to bugs. The meltanolabs variant of target-snowflake writes JSONL data to an internal stage and loads it into tables for you, so you shouldnt ever have to create stages/files formats/etc manually.

jessica_fosler

07/10/2023, 10:56 PM

The mailchimp download initial sync took like 17h to pull down the data, and what I was probably overly optimising for was not wanting to hold a warehouse connection on for that long. So I wanted to save to disk and then push to snowflake. But I was a bit stuck on step 2 of reading the JSONL files because my docker container is stuck on python 3.9. So I could write out to JSONL in python 3.9 but not read from JSONL in python 3.9. Which meant I had to flip to manually uploading the data.

user

07/11/2023, 1:45 AM

Another option you can try is saving the raw tap output to a file then later piping it into the target once its done.

meltano invoke tap-mailchimp > output.json

then later

cat output.json | meltano invoke target-snowflake

. Its not necessarily a long term solution since you can't use invoke for schedules but it could be good for a one time backfill

jessica_fosler

07/11/2023, 3:20 AM

ooooo. Thanks heaps!

jessica_fosler

07/11/2023, 3:21 AM

Also we love your work @pat_nadolny - have been learning a lot from meltano-squared

Open in Slack

Previous Next