hello does anyone know how many threads will run by default Meltano #plugins-general

hello, does anyone know how many threads will run ...

chris_barnes

07/06/2022, 7:30 PM

hello, does anyone know how many threads will run by default in the

tap-postgres

extractor?

chris_barnes

07/06/2022, 7:30 PM

and is it possible to modify that through plugin config?

chris_barnes

07/06/2022, 7:31 PM

I ask because the loader

target-snowflake

has a config param for

parallelism

but I can't find an equivalent in the extractor

thomas_briggs

07/06/2022, 8:26 PM

Assuming you're using the Pipelinewise variant... I don't think it supports parallelism.

chris_barnes

07/06/2022, 9:01 PM

I am, yes

chris_barnes

07/06/2022, 9:01 PM

so does it just run the streams in series?

chris_barnes

07/06/2022, 9:02 PM

for context, I'm trying to determine why we have a pipe that runs for much much longer than I would expect, considering how much incremental data its loading in each run

thomas_briggs

07/06/2022, 9:07 PM

I believe it does. A quick skim of the code seems to confirm that. I haven't used tap-postgres but I believe it's pretty similar to tap-mysql (developed by the same people), which I do have plenty of experience with, so... I'm reasonably confident I'm correct there.

thomas_briggs

07/06/2022, 9:09 PM

What replication method are you using? If it's primarily log_based I think it probably has to be single-threaded... I don't know that there's a way to parallelize that. That aside, do you know if the slowdown is on the tap or target side? i.e. if you run tap-postgres to target-jsonl how does that compare to tap-postgres -> target-snowflake?

chris_barnes

07/06/2022, 9:14 PM

we're using incremental replication

chris_barnes

07/06/2022, 9:15 PM

I'm not sure the best way to determine that, but if you look at the logs, the different between the first message emitted by the tap and last is ~20 minutes

chris_barnes

07/06/2022, 9:15 PM

the difference between the first and last messages from the target is ~11 minutes

chris_barnes

07/06/2022, 9:16 PM

I haven't actually tested this tap with any other targets

thomas_briggs

07/06/2022, 9:18 PM

How many rows of data are produced in that 20 minutes? In my experience Python is not especially fast at serializing JSON so if you're moving a lot of rows that could be the issue. I did some tuning on the MySQL tap and it was pretty slow about retrieving rows from the DB too so I guess I wouldn't be shocked if the PG tap had the same issue.

thomas_briggs

07/06/2022, 9:21 PM

If you run a query that pulls the same rows using the psql and just dump it to a file how long does it take?

chris_barnes

07/06/2022, 9:22 PM

is there a better way to determine the total row count aside from just digging through the logs and adding the numbers together?

chris_barnes

07/06/2022, 9:23 PM

I'm still relatively new to meltano

thomas_briggs

07/06/2022, 9:24 PM

When the tap starts processing does it write the replication-key value to the log? If so you can just query for rows with replication-key >= that value.

thomas_briggs

07/06/2022, 9:26 PM

The MySQL tap actually writes the query used to retrieve rows to the log... I would expect the PG tap to do the same given their common heritage.

chris_barnes

07/06/2022, 9:27 PM

yeah it does... this might take a little doing, this pipe is moving ~100 tables

thomas_briggs

07/06/2022, 9:56 PM

Can you set up a separate pipeline that processes just one table for testing purposes? there are a couple ways... you could create a new meltano environment with the select rules set to only grab the one table, then run tap-postgres -> target_jsonl just to pull the data for that one table out of the DB.

thomas_briggs

07/06/2022, 9:57 PM

Though on the first run that would pull the whole table, I guess... 🤔

chris_barnes

07/06/2022, 10:01 PM

I do have the option to test this in a non-prod environment, I can try to do some tuning on one table

chris_barnes

07/06/2022, 10:11 PM

I've been thinking about this and chatting with some coworkers, I think your statement about the python serializing json is actually the problem here

chris_barnes

07/06/2022, 10:11 PM

I'm not sure I can get around that without implementing parallelization in the tap itself or just splitting the pipe up into chunks

thomas_briggs

07/07/2022, 12:55 PM

I haven't had to do so myself but separate pipelines are how I'd approach this. You get a little insulation that way too... if one table has an extra ton of changes on day it won't slow everything else down.

chris_barnes

07/07/2022, 4:28 PM

yeah that's also a great case for breaking up the pipeline

chris_barnes

07/07/2022, 4:29 PM

I think that's what I'm going to work on, I've been iterating on builds with one or two tables today and the runtime differences are basically negligible

chris_barnes

07/07/2022, 4:29 PM

thanks a lot for your answers and insight

thomas_briggs

07/07/2022, 4:30 PM

Glad I could help 🙂

Open in Slack

Previous Next