Heya! I'm running a simple meltano pipeline - *ta...
# troubleshooting
j
Heya! I'm running a simple meltano pipeline • tap-oracle (pipelinewise) • target-snowflake (pipelinewise) which is incrementally replicating single table (~300M rows). The initial load is super slow i.e. ~200k rows a minute, which would take it whole lot of time to complete. From my investigation, the slowness can be attributed to the 100% CPU usage on the
target-snowflake
process, while
tap-oracle
is pretty chill (The EC2 this is running on has 2 CPUs). What is
target-snowflake
doing that makes this throttle so much due to being CPU resource starved.
Copy code
PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2867 ubuntu    20   0 2443764   1.8g  52656 R 100.0  46.7  22:30.11 target-snowflak
   2855 ubuntu    20   0  283868 116956  16364 S   6.7   2.9   3:06.67 meltano
   2865 ubuntu    20   0   62424  42176  18564 R   6.3   1.1   3:41.81 tap-oracle
The massive CPU usage is visible only during the data coming in, when the Oracle queries cursor is "starting up", there is no CPU usage at all on
target-snowflake
. Any ideas what I can do make this speed up? Do I need to throw bigger cores at the EC2 (as it seems single thread bound) šŸ˜ž Logs in thread 🧵
Log snippet showing the rather slow row "processing"?
Switched to using parquet and the pipeline runs faster now and
tap-oracle
is not so easy on the CPU anymore. This is not documented on the tap facepalm trek Relevant PR is [AP-953] Add parquet support #149
Will later test the meltanolabs one and hopefully will some performance gains šŸ¤ž
m
We are doing almost exactly the same thing as you, i.e.
tap-oracle
to
target-snowflake
so it's interesting to see your results. We also would like to improve performance, but we've managed to get some pretty good results using the pipelinewise variant of `target-snowflake`: https://github.com/transferwise/pipelinewise-target-snowflake and I have a fork of this which adds some functionality and removes things like per-value timestamp adjustment that you shouldn't need with a database source that provides timestamps in the same format each time: https://github.com/mjsqu/pipelinewise-target-snowflake
If you want to get down into the details of the tap/target performance and are happy to make changes in the Python code, I recommend looking into
cProfile
to pick out any unnecessary or repeated function calls