So been playing with Meltano for few days. It seem...
# troubleshooting
k
So been playing with Meltano for few days. It seems to be working fine with smaller test data. I've tested couple different loaders, between jsonl and redshift, to reduce the gap between the two, by changing the settings with the loader. For production data set, which is only about few hundred times bigger, and has few thousand times bigger continuous activities, it seems to be pretty slow. (have not tested whether the delay is from the extractor or loader). I will continue experimenting, by setting up cloud state backends etc. but are there any other suggestions that I can try this weekend? Thank you.
v
Can you give more information, meltano.yml which comands are slow? what is fast? what is slow? etc
k
I'm extracting from HubSpot to Redshift. The entire etl seems to take forever in the production environment. I plan to do more testing like I did in staging, i.e. extractor to jsonl to compare the speed etc.
So, I will be sending time investigating this weekend, and wanted to get just general suggestions people have tried to increase speed of etl in the past.
I'm not doing any transformation.
Slow as in, new data doesn't show up in Redshift even after a day, even though I have airflow running. It seems to be extracting data from the source as I see API activity to pull data, so it's possible that it hit some kind of max cap somewhere? have free space, and memory isn't utilized 100%, but i may also just try a beefier machine too.
c
for the S3 data dumped from the tap, how many files are created? How many machines in your Redshift cluster? If you see only a single file in S3, unless things have changed over the past couple of years, you’ll get no parallelism loading that data into Redshift
k
I only have 1 cluster. Maybe that's a factor with larger db. I need to figure out how not to delete files in S3 to see how many files it creates at the same time. From the logs, names seem different, but it's possible that it is only doing it one by one I guess.
c
the number of machines in your Redshift cluster will determine what level of parallelism you can attain - I believe it is 2 per machine in the cluster
if you end up with a single 1 TB CSV file… guess what? Heh.
I think if you enable bucket versioning you’ll be able to see deleted files in an S3 bucket, so that’s an easy way to debug this (caveat: never tried this, but I suspect that would work)
v
https://github.com/meltano/meltano/issues/6613#issuecomment-1215074973 Is what I point folks to for debugging perfomrance issues as it'll narrow you down quicker
🙏 1
If it's taking a day then I'd start at the second step (skip the meltano run)
c
if you’re using the pipelinewise-target-redshift (remembering you had problems with the ticketswap version) there is a parameter that controls the # of output files which defaults to 1: `slices`:
Copy code
The number of slices to split files into prior to running COPY on Redshift. This should be set to the number of Redshift slices. The number of slices per node depends on the node size of the cluster - run SELECT COUNT(DISTINCT slice) slices FROM stv_slices to calculate this. Defaults to 1.
🙏 1
k
Thank you both for the suggestions. I will experiment over the weekend, and hopefully I can achieve an acceptable performance. 🙂
I'm going to try that slice thing now. hopefully that's a quicker win.
1
c
match it to the number of slices in Redshift! (2 per node IIRC)
k
I don't think the performance cap was on the redshift side. I think it's from the source, as their API is limited as well. So, maybe it is what it is. 😞 well, at least I found bunch of other issues, so will address those first and see. Thank you
c
yeah Redshift is fairly reasonable even with a single node and a couple GBs of data at a time, at least in terms of performance (cost… different question and answer entirely)
👍 1