Heya I m running a simple meltano pipeline tap oracle pipeli Meltano #troubleshooting

Heya! I'm running a simple meltano pipeline - *ta...

janis_puris

06/27/2023, 1:37 PM

Heya! I'm running a simple meltano pipeline • tap-oracle (pipelinewise) • target-snowflake (pipelinewise) which is incrementally replicating single table (~300M rows). The initial load is super slow i.e. ~200k rows a minute, which would take it whole lot of time to complete. From my investigation, the slowness can be attributed to the 100% CPU usage on the

target-snowflake

process, while

tap-oracle

is pretty chill (The EC2 this is running on has 2 CPUs). What is

target-snowflake

doing that makes this throttle so much due to being CPU resource starved.

Copy code

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2867 ubuntu    20   0 2443764   1.8g  52656 R 100.0  46.7  22:30.11 target-snowflak
   2855 ubuntu    20   0  283868 116956  16364 S   6.7   2.9   3:06.67 meltano
   2865 ubuntu    20   0   62424  42176  18564 R   6.3   1.1   3:41.81 tap-oracle

The massive CPU usage is visible only during the data coming in, when the Oracle queries cursor is "starting up", there is no CPU usage at all on

target-snowflake

. Any ideas what I can do make this speed up? Do I need to throw bigger cores at the EC2 (as it seems single thread bound) 😞 Logs in thread 🧵

janis_puris

06/27/2023, 1:39 PM

Log snippet showing the rather slow row "processing"?

Untitled.txt

janis_puris

06/27/2023, 2:47 PM

Switched to using parquet and the pipeline runs faster now and

tap-oracle

is not so easy on the CPU anymore. This is not documented on the tap facepalm trek Relevant PR is [AP-953] Add parquet support #149

janis_puris

06/27/2023, 2:48 PM

Will later test the meltanolabs one and hopefully will some performance gains 🤞

mark_johnston

06/28/2023, 9:33 PM

We are doing almost exactly the same thing as you, i.e.

tap-oracle

target-snowflake

so it's interesting to see your results. We also would like to improve performance, but we've managed to get some pretty good results using the pipelinewise variant of `target-snowflake`: https://github.com/transferwise/pipelinewise-target-snowflake and I have a fork of this which adds some functionality and removes things like per-value timestamp adjustment that you shouldn't need with a database source that provides timestamps in the same format each time: https://github.com/mjsqu/pipelinewise-target-snowflake

mark_johnston

06/28/2023, 9:35 PM

If you want to get down into the details of the tap/target performance and are happy to make changes in the Python code, I recommend looking into

cProfile

to pick out any unnecessary or repeated function calls

Open in Slack

Previous Next