Heya! How do do you normally handle massive table...
# getting-started
j
Heya! How do do you normally handle massive table initial loads? My use case pretty much is a
Copy code
tap-oracle target-snowflake
I was thinking of modifying the
tap-oracle
to basically run up to N amount of records and call it done, then have the job run in loop until it no longer hits the set "max/limit". --- Currently to my understanding, the
tap-oracle
will do its best to download all of 300M rows at once, then serialise into JSONL and pass it to the
target-snowflake
that supports batching i.e. 100k rows (S3 -> Stage -> MERGE INTO) until done. The problem with this is that 1. It takes a long time to download the data from the
tap
a. If this fails, one needs to do it all again 2. Compute/time to serialise the 300M record messages into JSONL 3. It would take a long time to then load the data into
target
a. If any batch fails for whatever reason, again.. we need to do everything from beginning, because state is recorded only after a successful job (all or nothing) --- Questions 1. What do you think of the proposed approach? 2. Is this tried and failed method? 3. There are some talks from late last year about SDK supporting some sort of "batch" mode. Is this something similar to what Stitch Data has been doing for long time or something different? Is there any info on how this may look like for the taps/targets created by SDK? P.S. I can always transfer the initial "load" manually and then hack the meltano state, but I very much would prefer not to need to figure out how to do this with Oracle and batch it in a way Snowflake will spend massive amounts of compute time to digest that.
l
i haven’t used the oracle tap or the snowflake target, but • it looks like it will use a cursor https://github.com/s7clarke10/pipelinewise-tap-oracle/blob/master/tap_oracle/sync_strategies/incremental.py#L27 so it won’t actually try to download all 300 million rows in one go • as it makes progress, the data will be piped to the snowflake loader, which should then batch things up as you describe • assuming you are doing incremental sync, meltano will periodically record the state of the last replication key, so that if the process crashes, it can pick up from where it left off. note that the import is (probably) not making use of atomic transactions, so an interrupt/crash can result in duplicate data in your destination/target. this can be mitigated either by using a merge strategy in the loader (assuming snowflake supports it), or by doing downstream deduplication (which is what i do, with dbt) • i used meltano to extract a ~260 million row table from postgres -> bigquery using this approach, it took 4-5 days
k
picking up this approach I have a doubt regarding the process of automating the ELT once new files are added into the Postgres defined schema .... can it automate and run that file into ELT .. OR can Postgres support batch entry in minimum time interval possible ... just like scheduling MONGODB
j
Thank you, @luke_rodgers! I do have to admit that I wrote the post with plethora of assumptions based on running a meltano job on smaller tables. 😅 I will test it with one of the big ones and see how it behaves. If it is anything as you describe 🤞, then it will be glorious and solves my described problem.
You are correct, Luke. From my experiments with the pipeline, it seems
tap-oracle
will only produce as many rows as the
target-snowflake
is able to take in. As soon as the
target
reaches the set the limits, it will proceed to • write to /tmp a "csv.gz" file • upload said file to S3 and remove from local fs • issue a
MERGE INTO ... FROM @STAGE/...
command, which then loads the file • removes the files from S3 It works just the way I want it!
p
It sounds like you've got a solution already but for others finding this theres 2 other interesting things to add. Theres a new BATCH message type https://sdk.meltano.com/en/latest/batch.html thats part of the SDK which is meant to help in situations like this, its still relatively new so please give feedback if you try it out. See this blog https://meltano.com/blog/6x-more-speed-for-your-data-pipelines-with-batch-messages/. I'd also add that we recently put a lot of work into the meltanolabs target-snowflake variant which will support this batch message type, I use it in production and this week I'll be converting one of my pipelines to use batch messages.
j
@pat_nadolny! This is the blog post I was looking for 🙂 I've seen chatter about the batch messages, but had no clue what these actually solve. Thank you so much for linking this! After doing some more tests, Intend to experiment with more cores + understand the singer spec little better, and then I'll do the same with the MeltanoLabs's
target-snowflake
. Very exciting!