Hi folks I just tried Meltano and while the premise sounds g Meltano #best-practices

Hi folks! I just tried Meltano and while the premi...

muhammad_jarir_kanji

06/01/2022, 10:20 PM

Hi folks! I just tried Meltano and while the premise sounds great, I'm finding it to be rather slow? Specifically, I'm trying to move data from Stripe to Redshift and even though it's just a few hundred thousand rows, the job took around 2-3 hours to complete. That's far too long. In comparison, we have an existing pipeline that uses scheduled queries in Stripe Sigma to generate the tables we want as CSVs and then a webhook in AWS Lambda downloads the CSVs into S3, which are then loaded to Redshift. This is MUCH faster in comparison, but it requires a lot of legwork on my end (writing DML for new tables, etc.) and I'm wondering if there's something I can do to make Meltano faster. Any default settings I need to change?

aaronsteers

06/01/2022, 11:11 PM

Hi, @muhammad_jarir_kanji - I think the first thing to do is to see if you can identify whether it is the tap or target that is the primary cause for slowdown. We can help you through the details of process, but the generate approach is: 1. Execute just the tap with something like

meltano invoke tap-stripe > jobout.jsonl

a. Record timings. b. Debug any slowness. 2. Execute just the target with something like

cat jobout.jsonl | meltano invoke target-redshift

a. Record timings. b. Debug slowness. c. Repeat.

aaronsteers

06/01/2022, 11:12 PM

Since each connector is its own 'app', so to speak, generally the solution would be to figure out where the bottleneck is and then see if that part can be tuned.

aaronsteers

06/01/2022, 11:13 PM

cc @visch who has a lot of experience with this and might have other/better script examples.

ilkka_peltola

06/02/2022, 7:15 AM

I think it's just the Stripe api that's not great. The API doens't allow you to order records by modified timestamp (e.g. invoices, charges), so the way it looks like it's been implemented is, that there's an events stream that contains all events that happen to the account. The tap queries the events and then one-by-one queries each object that something has happened to. That's what makes it so slow, as far as I've understood it.

visch

06/02/2022, 12:42 PM

Everything AJ said here seems chef kiss If Sigma is better maybe a tap is needed for Sigma but I don't know Stripe's api's at all so no idea

ilkka_peltola

06/02/2022, 1:15 PM

I have, and I've even asked Stripe to implement sorting by modified date on all objects, but they weren't interested. One idea for @muhammad_jarir_kanji could be to create a tap for the scheduled Sigma exports. That way, you could just schedule the Sigma exports, run meltano daily to fetch all stored sigma exports and push them to a Redshift target. Meltano would take care of the tables.

aaronsteers

06/02/2022, 3:35 PM

Yeah, I don't know much about Stripe API or Stripe Sigma, but just from the original post, I was about to recommend the same.

edward_ryan

06/07/2022, 11:31 AM

@aaronsteers we've determined a lot of the bottleneck happens with Redshift Loader. Are there any discussions for debugging that to speed it up?

aaronsteers

06/07/2022, 12:42 PM

The redshift loader (and generally other 'big data' loaders as well) will be affected greatly by batch size and batch strategy. They will be best suited to loads 10s or 100s of thousands of rows at a time, and smaller batches could cause a lot of slowdown. For instance, we saw a case where a target was loading every 10 or 20 records, and sometimes for batches less than 10 - due to a trigger based on STATE messages, which were coming very often. Definitely something that can be tuned, at least in theory. One way to mitigate in the target is to provide a 'min batch size' optimization, where a flush of records to the target will be held off until the minimal record count is received. I am a bit rusty on my knowledge of that target but we had similar conversations with @visch and another dev (I'm forgetting who) a while back.

edward_ryan

06/08/2022, 6:20 PM

Perfect, thanks a lot AJ Very helpful as always We've been barking up the batch size tree (@connor_flynn can speak to this) but lot of room for improvement

connor_flynn

06/08/2022, 6:39 PM

Thank's for the insight AJ! That is correct. We have tried batch sizes varying from 10-200k(default), and our recharge tap appears to load data for approximately seven hours until it triggers the target-redshift (datamill variant)

write_batch()

function which produces the errors...

Copy code

CRITICAL cursor already closed

psycopg2.OperationalError: SSL connection has been closed unexpectedly

psycopg2.InterfaceError: connection already

Open in Slack

Previous Next