Not sure if this is the correct channel but I ll try So I ne Meltano #troubleshooting

Not sure if this is the correct channel, but I'll ...

andrey_tatarinov

02/25/2023, 6:39 PM

Not sure if this is the correct channel, but I'll try. So I need to sync really large dataset (100Gb+) from some tap to, say, bigquery. Also I like the idea of file based state store, so I'm using it. Getting data from tap is no concern it streams as fast as I need to. Obviously I cannot fit the whole dataset in memory and I set

batch_size_rows: 500000

empirically it takes about 4Gb in memory in target-bigquery before flush. The issue is in the combination of target-bigquery (I use pipelinewise variant) and state store: Data flows fast, I get 500K rows in ~10 seconds, 10 more seconds are spent to write batch to BigQuery. And then everything halts on Writing state to AWS S3 for almost 2 minutes, then state is finally written and process continues.

andrey_tatarinov

02/25/2023, 6:42 PM

I did a little digging and it turns out, that

BaseFilesystemStateStoreManager

does not distinguish between it's own lock and someone else's lock. So it waits for

lock_timeout_seconds

even if it was the same process that updated lock 10 seconds ago. It seems inefficient to me.

andrey_tatarinov

02/25/2023, 6:45 PM

I would propose to write some unique identifier (UUID/pid/random number) that will help StoreManager recognize that this lock was made by this very process and 1. avoid waiting 2. make lock behavior more robust (ideally process that got the lock should refresh it before the timeout to avoid race conditions with other processes)

alexander_butler

02/25/2023, 6:50 PM

I am curious. You are getting 400mb/s ish throughput based on the above? Seems high, especially for upload/ingest on a single node?

edgar_ramirez_mondragon

02/25/2023, 6:51 PM

Thanks for sharing @andrey_tatarinov!

I did a little digging and it turns out, that
BaseFilesystemStateStoreManager
does not distinguish between it’s own lock and someone else’s lock.

Yeah, that sounds like a 🐛 Would you mind logging an issue with the problem and your proposed solution? I can do that later too.

andrey_tatarinov

02/25/2023, 6:51 PM

I think that actual throughput is a less than that, Python dicts are very inefficient way to store data

aaronsteers

02/25/2023, 7:10 PM

Agreed with @edgar_ramirez_mondragon above that the locking on file write to the state store definitely sounds like a bug. If you have time to log this, would be much appreciated. Regarding holding records in memory, this is likely due to the dedupe algorithm that this specific version of target-BigQuery had chosen to implement. With many targets, each batch's keys are deduped against the other keys in that batch prior to loading to the target system. This deduping is often performed via python dictionaries. You could write this another way: either let the user opt out of batch-level deduping, or run the dedupe operation in the target itself using a sql-based operation. Either of those optimization strategies could eliminate the need to hold records all of the batch's records in in memory together at all.

aaronsteers

02/25/2023, 7:13 PM

If you are sure your source doesn't send dupes, simplest option might be to create a fork of the target with the option to opt out of deduping.

andrey_tatarinov

02/25/2023, 7:15 PM

It's just an implementation detail of ppw target-bigquery, they do not flush received data to file before writing to BQ (and BQ really loves large batches for ingest)

andrey_tatarinov

02/25/2023, 7:16 PM

I dream that some day I will have time to migrate to meltano-sdk based target-bigquery and utilize BATCH messages 🙂

alexander_butler

02/25/2023, 7:18 PM

You should try mine. I am super curious how the throughput will compare in this specific situation. It is SDK based, supports BATCH, and has lots of knobs in general.

alexander_butler

02/25/2023, 7:19 PM

Understand time as a constraint but if you write to a different dataset just for testing. Could be cool to see.

andrey_tatarinov

02/25/2023, 7:19 PM

@alexander_butler which variant is yours? https://hub.meltano.com/loaders/target-bigquery

alexander_butler

02/25/2023, 7:20 PM

z3z1ma

Open in Slack

Previous Next