Hi all, wondering if anyone has run into performan...
# troubleshooting
l
Hi all, wondering if anyone has run into performance "problems" with loading CSV files. I'm using
tap-spreadsheets-anywhere
to load a CSV with about 5 million rows (only 2 columns of fixed length strings). The file weighs 180MB, but it takes over an hour to load into
target-postgres
which seems impossibly long. My previous code uses
pandas
and loads the file into a dataframe in maybe 1 minute or less, so I'm pondering whether the
csv.DictReader
in the tap is too slow, or if there's something else at play. Everything happens on a single machine with 32GB RAM, so it's not a network issue, nor some sort of memory constraint. I'll try to do some testing over the weekend, but if anyone has any tips, happy to take them 🙂
t
That seems crazy long. Have you tried running with the debug log output to see if that gives any hints as to what’s slowing it down?
l
the debug output isn't actually much better than the normal output, it just has 200k lines of data in between useful stuff 🙂
Looking at it progress, it might actually be the target which is slow, I'm seeing lines like this at the end of batches of 200k records:
target-postgres                 | INFO METRIC: {"type": "timer", "metric": "job_duration", "value": 72.2939760684967, "tags": {"job_type": "table", "path":...
t
This problem should be address by adding the `batch`or fast sync ability within Meltano https://gitlab.com/meltano/meltano/-/issues/2364
l
I've tried some quick postgres tuning based on https://pgtune.leopard.in.ua but it makes no difference. I will try to dive in the target's source code to see at what stage the bottleneck is.
I'm not sure how to understand your point about the BATCH ability. At first glance it looks made for other taps and targets, did I miss something?
hmm, reading a bit further, it sounds like you mean we could implement something like that to speed things up between the tap and
target-postgres
. Too many weekend projects 😞
t
Sorry - missed the ping on this. Fast sync is a concept that exists in pipelinewise that we want to bring to Meltano https://transferwise.github.io/pipelinewise/concept/fastsync.html
But it is only for a certain combination of taps/targets
l
got it. thanks for explaining! sounds like it would be a great option. Would that be implemented at the tap/target level, or does it require code in meltano's core as well?
t
It does require some code on Meltano’s side to run it based on some of the comments in https://gitlab.com/meltano/meltano/-/issues/2364
l
alright. I'll stop bugging you with dumb questions. I'll look into the target code over the weekend, hopefully I can figure out why it's so slow for me. I'll report back! thanks again for your help Taylor!
t
I promise you they’re not dumb! I’m still ramping and onboarding to everything too so it’s good for me to dive in 🙂
v
What's the slow part about the singer spec? Is it the validation of each record against the json schema? Outputing each record doesn't seem like any overhead. Validation is the only thing that makes sense to me. I'm doing a fairly large test with oracle, to a csv and watching TOP while they are running it's curious how much CPU is being used by the tap/target. The other piece that I was looking at is https://meltano.com/docs/settings.html#elt-buffer-size , it all depends on where the bottle neck is. Which is admittedly hard to find. The other idea I have is to run your tap alone without piping to your target to see if it's the tap that's your limiting factor
I'm running into some similar things right now, but today it's not crucial that things happen quickly. It will be soon though
t
I would guess it’s the validation on each record
l
I've been doing some profiling on the tap and target. On a truncated version of my CSV with only 50k rows (2 cols of strings), the timings were roughly: • tap 5 seconds • target 20 seconds Focusing on the target, I used
lru_cache
on a few calls: • calculating the field name, called about num_rows * (num_cols+5 metadata cols), caching works well because there seems to be only a limited combination of args • formatting timestamps (the same timestamp is formatted once per row, with caching it's more or less 1 run/batch) This dropped exec time to about 5s for the target. Then I replaced a deepcopy (of the default row value) with pickling/unpickling, and that shaved off another 1.5-2 seconds. Most of the time left now is actually spent in a postgres
COPY
operation, which seems fair. With these changes, my initial file (5million rows) now loads in about 8 min, vs 50+ before 🎉
t
That’s awesome!!
v
Wow! That's interesting that it wasn't the jsonschema validation, I was looking at the fastjsonschema library to see if that might fit well. Looks like my guess was wrong! Sweet
l
I've opened a PR on
target-postgres
to suggest these changes: https://github.com/datamill-co/target-postgres/pull/204
On the tap side, I've opened another PR that speeds up the tap by about 20% https://github.com/ets/tap-spreadsheets-anywhere/pull/18 and I think it's possible to do a bit better, as explained in there.
t
Those look great - really nice work Laurent
l
Thanks! I think there's room to boost things a bit further, but it would require someone who actually understands the broader code base.