So been playing with Meltano for few days It seems to be wor Meltano #troubleshooting

So been playing with Meltano for few days. It seem...

Kyunghwan Choi

08/09/2024, 5:59 PM

So been playing with Meltano for few days. It seems to be working fine with smaller test data. I've tested couple different loaders, between jsonl and redshift, to reduce the gap between the two, by changing the settings with the loader. For production data set, which is only about few hundred times bigger, and has few thousand times bigger continuous activities, it seems to be pretty slow. (have not tested whether the delay is from the extractor or loader). I will continue experimenting, by setting up cloud state backends etc. but are there any other suggestions that I can try this weekend? Thank you.

visch

08/09/2024, 6:45 PM

Can you give more information, meltano.yml which comands are slow? what is fast? what is slow? etc

Kyunghwan Choi

08/09/2024, 6:56 PM

I'm extracting from HubSpot to Redshift. The entire etl seems to take forever in the production environment. I plan to do more testing like I did in staging, i.e. extractor to jsonl to compare the speed etc.

Kyunghwan Choi

08/09/2024, 6:57 PM

So, I will be sending time investigating this weekend, and wanted to get just general suggestions people have tried to increase speed of etl in the past.

Kyunghwan Choi

08/09/2024, 6:57 PM

I'm not doing any transformation.

Kyunghwan Choi

08/09/2024, 7:04 PM

Slow as in, new data doesn't show up in Redshift even after a day, even though I have airflow running. It seems to be extracting data from the source as I see API activity to pull data, so it's possible that it hit some kind of max cap somewhere? have free space, and memory isn't utilized 100%, but i may also just try a beefier machine too.

Charles Feduke

08/09/2024, 7:06 PM

for the S3 data dumped from the tap, how many files are created? How many machines in your Redshift cluster? If you see only a single file in S3, unless things have changed over the past couple of years, you’ll get no parallelism loading that data into Redshift

Kyunghwan Choi

08/09/2024, 7:15 PM

I only have 1 cluster. Maybe that's a factor with larger db. I need to figure out how not to delete files in S3 to see how many files it creates at the same time. From the logs, names seem different, but it's possible that it is only doing it one by one I guess.

Charles Feduke

08/09/2024, 7:16 PM

the number of machines in your Redshift cluster will determine what level of parallelism you can attain - I believe it is 2 per machine in the cluster

Charles Feduke

08/09/2024, 7:16 PM

if you end up with a single 1 TB CSV file… guess what? Heh.

Charles Feduke

08/09/2024, 7:17 PM

I think if you enable bucket versioning you’ll be able to see deleted files in an S3 bucket, so that’s an easy way to debug this (caveat: never tried this, but I suspect that would work)

visch

08/09/2024, 7:19 PM

https://github.com/meltano/meltano/issues/6613#issuecomment-1215074973 Is what I point folks to for debugging perfomrance issues as it'll narrow you down quicker

🙏 1

visch

08/09/2024, 7:20 PM

If it's taking a day then I'd start at the second step (skip the meltano run)

Charles Feduke

08/09/2024, 7:21 PM

if you’re using the pipelinewise-target-redshift (remembering you had problems with the ticketswap version) there is a parameter that controls the # of output files which defaults to 1: `slices`:

Copy code

The number of slices to split files into prior to running COPY on Redshift. This should be set to the number of Redshift slices. The number of slices per node depends on the node size of the cluster - run SELECT COUNT(DISTINCT slice) slices FROM stv_slices to calculate this. Defaults to 1.

🙏 1

Kyunghwan Choi

08/09/2024, 7:23 PM

Thank you both for the suggestions. I will experiment over the weekend, and hopefully I can achieve an acceptable performance. 🙂

Kyunghwan Choi

08/09/2024, 7:26 PM

I'm going to try that slice thing now. hopefully that's a quicker win.

➕ 1

Charles Feduke

08/09/2024, 7:27 PM

match it to the number of slices in Redshift! (2 per node IIRC)

Kyunghwan Choi

08/12/2024, 9:52 AM

I don't think the performance cap was on the redshift side. I think it's from the source, as their API is limited as well. So, maybe it is what it is. 😞 well, at least I found bunch of other issues, so will address those first and see. Thank you

Charles Feduke

08/12/2024, 2:52 PM

yeah Redshift is fairly reasonable even with a single node and a couple GBs of data at a time, at least in terms of performance (cost… different question and answer entirely)

👍 1

10 Views

Open in Slack

Previous Next