Not sure if this is the correct channel, but I'll ...
# troubleshooting
a
Not sure if this is the correct channel, but I'll try. So I need to sync really large dataset (100Gb+) from some tap to, say, bigquery. Also I like the idea of file based state store, so I'm using it. Getting data from tap is no concern it streams as fast as I need to. Obviously I cannot fit the whole dataset in memory and I set
batch_size_rows: 500000
empirically it takes about 4Gb in memory in target-bigquery before flush. The issue is in the combination of target-bigquery (I use pipelinewise variant) and state store: Data flows fast, I get 500K rows in ~10 seconds, 10 more seconds are spent to write batch to BigQuery. And then everything halts on Writing state to AWS S3 for almost 2 minutes, then state is finally written and process continues.
I did a little digging and it turns out, that
BaseFilesystemStateStoreManager
does not distinguish between it's own lock and someone else's lock. So it waits for
lock_timeout_seconds
even if it was the same process that updated lock 10 seconds ago. It seems inefficient to me.
I would propose to write some unique identifier (UUID/pid/random number) that will help StoreManager recognize that this lock was made by this very process and 1. avoid waiting 2. make lock behavior more robust (ideally process that got the lock should refresh it before the timeout to avoid race conditions with other processes)
a
I am curious. You are getting 400mb/s ish throughput based on the above? Seems high, especially for upload/ingest on a single node?
e
Thanks for sharing @andrey_tatarinov!
I did a little digging and it turns out, that
BaseFilesystemStateStoreManager
does not distinguish between it’s own lock and someone else’s lock.
Yeah, that sounds like a 🐛 Would you mind logging an issue with the problem and your proposed solution? I can do that later too.
a
I think that actual throughput is a less than that, Python dicts are very inefficient way to store data
a
Agreed with @edgar_ramirez_mondragon above that the locking on file write to the state store definitely sounds like a bug. If you have time to log this, would be much appreciated. Regarding holding records in memory, this is likely due to the dedupe algorithm that this specific version of target-BigQuery had chosen to implement. With many targets, each batch's keys are deduped against the other keys in that batch prior to loading to the target system. This deduping is often performed via python dictionaries. You could write this another way: either let the user opt out of batch-level deduping, or run the dedupe operation in the target itself using a sql-based operation. Either of those optimization strategies could eliminate the need to hold records all of the batch's records in in memory together at all.
If you are sure your source doesn't send dupes, simplest option might be to create a fork of the target with the option to opt out of deduping.
a
It's just an implementation detail of ppw target-bigquery, they do not flush received data to file before writing to BQ (and BQ really loves large batches for ingest)
I dream that some day I will have time to migrate to meltano-sdk based target-bigquery and utilize BATCH messages 🙂
a
You should try mine. I am super curious how the throughput will compare in this specific situation. It is SDK based, supports BATCH, and has lots of knobs in general.
Understand time as a constraint but if you write to a different dataset just for testing. Could be cool to see.
a
@alexander_butler which variant is yours? https://hub.meltano.com/loaders/target-bigquery
a