hello, does anyone know how many threads will run ...
# plugins-general
c
hello, does anyone know how many threads will run by default in the
tap-postgres
extractor?
and is it possible to modify that through plugin config?
I ask because the loader
target-snowflake
has a config param for
parallelism
but I can't find an equivalent in the extractor
t
Assuming you're using the Pipelinewise variant... I don't think it supports parallelism.
c
I am, yes
so does it just run the streams in series?
for context, I'm trying to determine why we have a pipe that runs for much much longer than I would expect, considering how much incremental data its loading in each run
t
I believe it does. A quick skim of the code seems to confirm that. I haven't used tap-postgres but I believe it's pretty similar to tap-mysql (developed by the same people), which I do have plenty of experience with, so... I'm reasonably confident I'm correct there.
What replication method are you using? If it's primarily log_based I think it probably has to be single-threaded... I don't know that there's a way to parallelize that. That aside, do you know if the slowdown is on the tap or target side? i.e. if you run tap-postgres to target-jsonl how does that compare to tap-postgres -> target-snowflake?
c
we're using incremental replication
I'm not sure the best way to determine that, but if you look at the logs, the different between the first message emitted by the tap and last is ~20 minutes
the difference between the first and last messages from the target is ~11 minutes
I haven't actually tested this tap with any other targets
t
How many rows of data are produced in that 20 minutes? In my experience Python is not especially fast at serializing JSON so if you're moving a lot of rows that could be the issue. I did some tuning on the MySQL tap and it was pretty slow about retrieving rows from the DB too so I guess I wouldn't be shocked if the PG tap had the same issue.
If you run a query that pulls the same rows using the psql and just dump it to a file how long does it take?
c
is there a better way to determine the total row count aside from just digging through the logs and adding the numbers together?
I'm still relatively new to meltano
t
When the tap starts processing does it write the replication-key value to the log? If so you can just query for rows with replication-key >= that value.
The MySQL tap actually writes the query used to retrieve rows to the log... I would expect the PG tap to do the same given their common heritage.
c
yeah it does... this might take a little doing, this pipe is moving ~100 tables
t
Can you set up a separate pipeline that processes just one table for testing purposes? there are a couple ways... you could create a new meltano environment with the select rules set to only grab the one table, then run tap-postgres -> target_jsonl just to pull the data for that one table out of the DB.
Though on the first run that would pull the whole table, I guess... 🤔
c
I do have the option to test this in a non-prod environment, I can try to do some tuning on one table
I've been thinking about this and chatting with some coworkers, I think your statement about the python serializing json is actually the problem here
I'm not sure I can get around that without implementing parallelization in the tap itself or just splitting the pipe up into chunks
t
I haven't had to do so myself but separate pipelines are how I'd approach this. You get a little insulation that way too... if one table has an extra ton of changes on day it won't slow everything else down.
c
yeah that's also a great case for breaking up the pipeline
I think that's what I'm going to work on, I've been iterating on builds with one or two tables today and the runtime differences are basically negligible
thanks a lot for your answers and insight
t
Glad I could help 🙂