Hi all wondering if anyone has run into performance problems Meltano #troubleshooting

Hi all, wondering if anyone has run into performan...

laurent

04/02/2021, 2:38 PM

Hi all, wondering if anyone has run into performance "problems" with loading CSV files. I'm using

tap-spreadsheets-anywhere

to load a CSV with about 5 million rows (only 2 columns of fixed length strings). The file weighs 180MB, but it takes over an hour to load into

target-postgres

which seems impossibly long. My previous code uses

pandas

and loads the file into a dataframe in maybe 1 minute or less, so I'm pondering whether the

csv.DictReader

in the tap is too slow, or if there's something else at play. Everything happens on a single machine with 32GB RAM, so it's not a network issue, nor some sort of memory constraint. I'll try to do some testing over the weekend, but if anyone has any tips, happy to take them 🙂

taylor

04/02/2021, 2:45 PM

That seems crazy long. Have you tried running with the debug log output to see if that gives any hints as to what’s slowing it down?

laurent

04/02/2021, 2:55 PM

the debug output isn't actually much better than the normal output, it just has 200k lines of data in between useful stuff 🙂

laurent

04/02/2021, 2:56 PM

Looking at it progress, it might actually be the target which is slow, I'm seeing lines like this at the end of batches of 200k records:

target-postgres                 | INFO METRIC: {"type": "timer", "metric": "job_duration", "value": 72.2939760684967, "tags": {"job_type": "table", "path":...

taylor

04/02/2021, 3:37 PM

This problem should be address by adding the `batch`or fast sync ability within Meltano https://gitlab.com/meltano/meltano/-/issues/2364

laurent

04/02/2021, 3:39 PM

I've tried some quick postgres tuning based on https://pgtune.leopard.in.ua but it makes no difference. I will try to dive in the target's source code to see at what stage the bottleneck is.

laurent

04/02/2021, 3:40 PM

I'm not sure how to understand your point about the BATCH ability. At first glance it looks made for other taps and targets, did I miss something?

laurent

04/02/2021, 3:46 PM

hmm, reading a bit further, it sounds like you mean we could implement something like that to speed things up between the tap and

target-postgres

. Too many weekend projects 😞

taylor

04/02/2021, 4:01 PM

Sorry - missed the ping on this. Fast sync is a concept that exists in pipelinewise that we want to bring to Meltano https://transferwise.github.io/pipelinewise/concept/fastsync.html

taylor

04/02/2021, 4:01 PM

But it is only for a certain combination of taps/targets

laurent

04/02/2021, 4:19 PM

got it. thanks for explaining! sounds like it would be a great option. Would that be implemented at the tap/target level, or does it require code in meltano's core as well?

taylor

04/02/2021, 4:21 PM

It does require some code on Meltano’s side to run it based on some of the comments in https://gitlab.com/meltano/meltano/-/issues/2364

laurent

04/02/2021, 4:26 PM

alright. I'll stop bugging you with dumb questions. I'll look into the target code over the weekend, hopefully I can figure out why it's so slow for me. I'll report back! thanks again for your help Taylor!

taylor

04/02/2021, 4:27 PM

I promise you they’re not dumb! I’m still ramping and onboarding to everything too so it’s good for me to dive in 🙂

visch

04/02/2021, 5:27 PM

What's the slow part about the singer spec? Is it the validation of each record against the json schema? Outputing each record doesn't seem like any overhead. Validation is the only thing that makes sense to me. I'm doing a fairly large test with oracle, to a csv and watching TOP while they are running it's curious how much CPU is being used by the tap/target. The other piece that I was looking at is https://meltano.com/docs/settings.html#elt-buffer-size , it all depends on where the bottle neck is. Which is admittedly hard to find. The other idea I have is to run your tap alone without piping to your target to see if it's the tap that's your limiting factor

visch

04/02/2021, 5:28 PM

I'm running into some similar things right now, but today it's not crucial that things happen quickly. It will be soon though

taylor

04/02/2021, 6:37 PM

I would guess it’s the validation on each record

laurent

04/03/2021, 5:53 PM

I've been doing some profiling on the tap and target. On a truncated version of my CSV with only 50k rows (2 cols of strings), the timings were roughly: • tap 5 seconds • target 20 seconds Focusing on the target, I used

lru_cache

on a few calls: • calculating the field name, called about num_rows * (num_cols+5 metadata cols), caching works well because there seems to be only a limited combination of args • formatting timestamps (the same timestamp is formatted once per row, with caching it's more or less 1 run/batch) This dropped exec time to about 5s for the target. Then I replaced a deepcopy (of the default row value) with pickling/unpickling, and that shaved off another 1.5-2 seconds. Most of the time left now is actually spent in a postgres

COPY

operation, which seems fair. With these changes, my initial file (5million rows) now loads in about 8 min, vs 50+ before 🎉 ⏩

taylor

04/03/2021, 7:07 PM

That’s awesome!!

visch

04/03/2021, 7:26 PM

Wow! That's interesting that it wasn't the jsonschema validation, I was looking at the fastjsonschema library to see if that might fit well. Looks like my guess was wrong! Sweet

laurent

04/04/2021, 1:19 AM

I've opened a PR on

target-postgres

to suggest these changes: https://github.com/datamill-co/target-postgres/pull/204

laurent

04/04/2021, 11:38 PM

On the tap side, I've opened another PR that speeds up the tap by about 20% https://github.com/ets/tap-spreadsheets-anywhere/pull/18 and I think it's possible to do a bit better, as explained in there.

taylor

04/05/2021, 1:21 PM

Those look great - really nice work Laurent

laurent

04/05/2021, 1:47 PM

Thanks! I think there's room to boost things a bit further, but it would require someone who actually understands the broader code base.

Open in Slack

Previous Next