So I'm saving data to a parquet file and then load...
# getting-started
j
So I'm saving data to a parquet file and then loading it back up and sending it out to a database. Reason I'm doing in two steps is the extraction process takes forever and I dont want to hold a database connection open overnight. I saved it out to target-parquet. Cool. Got the file I need. Now I have a parquet file and I add this loader
Copy code
meltano add loader tap-parquet
I get this error - how do I configure the tap-parquet virtual environment to fulfill this requirement?
Copy code
ERROR: No matching distribution found for singer-sdk<0.2.0,>=0.1.0
v
Don't have a computer in front of me but this smells like a setting on your machine as https://pypi.org/project/singer-sdk/0.1.0/ does exist. Maybe a different default pypi repo? Not certain though
Very very old version though so getting that updated would be best
j
Yeah I definitely have had trouble round tripping data between taps and targets when using files as an intermediate stage. My mailchimp data takes forever to download - and it's got commas everywhere. JSON data, names (because people wrote LASTNAME, FIRSTNAME) So I'm like sweet, I'll use tap-csv and swap the delimiter to be
^
- because CSVs can have anything. But then the next tap I picked up didn't support custom delimiter correctly and wrote garbage tables out. I'm like ok, it's got JSON data in there, maybe I should write to a more structured format. Tried DuckDB. It crashed writing out the file. I'm like OK the file is fundamentally JSON and I'm converting it, so let's stop doing that. So I wrote out to JSONL. Except I couldnt install the tap-singer-jsonl cause it wants a whole other version of python. I'm like ok, DuckDB is all new and experimental, maybe I should go for an older standard - like parquet. So I write the file out to target-parquet. And then I try to install tap-parquet, but I cant. So then I'm like fine, snowflake has instructions for manually loading a parquet file from a stage. So I write a bit of SQL to put the parquet file to an internal stage and when I go to copy into, it tells me target-parquet wrote out a file without magic bytes. SO, what do you think my next option is? Local postgres instance? Save the tap-csv and manually set the delimiter in snowflake to be
^
? Am I trying too hard?
In the end what I've done is make a three line python script to turn JSONL into JSON and then put it into an internal stage with Snow SQL. Feels a bit like cheating, fortunately with this one I only have to run it a couple times.
u
@jessica_fosler oh no that sounds like a pain 😢 . A new mailchimp SDK based tap was added to the hub last week, I wonder if that would be faster and potentially handle your data issues too https://hub.meltano.com/extractors/tap-mailchimp. Worth a try 🤷
j
OOO cool beans - @pat_nadolny I am learning meltano and Snowflake at the same time so it's all coming like a firehose. One thing I did learn is that I was silly to convert it to JSON - that snowflake itself supports NDJSON which is like practically the same as JSONL - the format that the JSONL dumps out as I could upload to an internal snowflake stage with the file format JSON - I did use the strip outer array option too, which I dont know if that was actually necessary. The current tap is choking on dumping out report activity. I do wonder if this one handles it better. I dont have to dump again for a few months but I might give that new connector a shot. I can't wait to update the python in this docker from 3.9 but I've got a few other things that are blocking it from being updated. It must be that 3.10 is like the magic minimum version.
But I guess the takeaway for you might be Snowflake is calling JSONL NDJSON in their docs, might be worth double listing the tap in hub.meltano.com and/or updating the readme so it shows up in search
message has been deleted
u
might be worth double listing the tap in hub.meltano.com and/or updating the readme so it shows up in search
@jessica_fosler can you say more about what you mean by this? Most users of Meltano and target-snowflake should never have to even consider how all of this works to get it functioning, unfortunately I think your use case with mailchimp made it more complicated due to bugs. The meltanolabs variant of target-snowflake writes JSONL data to an internal stage and loads it into tables for you, so you shouldnt ever have to create stages/files formats/etc manually.
j
The mailchimp download initial sync took like 17h to pull down the data, and what I was probably overly optimising for was not wanting to hold a warehouse connection on for that long. So I wanted to save to disk and then push to snowflake. But I was a bit stuck on step 2 of reading the JSONL files because my docker container is stuck on python 3.9. So I could write out to JSONL in python 3.9 but not read from JSONL in python 3.9. Which meant I had to flip to manually uploading the data.
u
Another option you can try is saving the raw tap output to a file then later piping it into the target once its done.
meltano invoke tap-mailchimp > output.json
then later
cat output.json | meltano invoke target-snowflake
. Its not necessarily a long term solution since you can't use invoke for schedules but it could be good for a one time backfill
j
ooooo. Thanks heaps!
Also we love your work @pat_nadolny - have been learning a lot from meltano-squared