Heya How do do you normally handle massive table initial loa Meltano #getting-started

Heya! How do do you normally handle massive table...

janis_puris

06/26/2023, 9:06 PM

Heya! How do do you normally handle massive table initial loads? My use case pretty much is a

Copy code

tap-oracle target-snowflake

I was thinking of modifying the

tap-oracle

to basically run up to N amount of records and call it done, then have the job run in loop until it no longer hits the set "max/limit". --- Currently to my understanding, the

tap-oracle

will do its best to download all of 300M rows at once, then serialise into JSONL and pass it to the

target-snowflake

that supports batching i.e. 100k rows (S3 -> Stage -> MERGE INTO) until done. The problem with this is that 1. It takes a long time to download the data from the

tap

a. If this fails, one needs to do it all again 2. Compute/time to serialise the 300M record messages into JSONL 3. It would take a long time to then load the data into

target

a. If any batch fails for whatever reason, again.. we need to do everything from beginning, because state is recorded only after a successful job (all or nothing) --- Questions 1. What do you think of the proposed approach? 2. Is this tried and failed method? 3. There are some talks from late last year about SDK supporting some sort of "batch" mode. Is this something similar to what Stitch Data has been doing for long time or something different? Is there any info on how this may look like for the taps/targets created by SDK? P.S. I can always transfer the initial "load" manually and then hack the meltano state, but I very much would prefer not to need to figure out how to do this with Oracle and batch it in a way Snowflake will spend massive amounts of compute time to digest that.

luke_rodgers

06/26/2023, 10:52 PM

i haven’t used the oracle tap or the snowflake target, but • it looks like it will use a cursor https://github.com/s7clarke10/pipelinewise-tap-oracle/blob/master/tap_oracle/sync_strategies/incremental.py#L27 so it won’t actually try to download all 300 million rows in one go • as it makes progress, the data will be piped to the snowflake loader, which should then batch things up as you describe • assuming you are doing incremental sync, meltano will periodically record the state of the last replication key, so that if the process crashes, it can pick up from where it left off. note that the import is (probably) not making use of atomic transactions, so an interrupt/crash can result in duplicate data in your destination/target. this can be mitigated either by using a merge strategy in the loader (assuming snowflake supports it), or by doing downstream deduplication (which is what i do, with dbt) • i used meltano to extract a ~260 million row table from postgres -> bigquery using this approach, it took 4-5 days

keshav_narayan_balot

06/27/2023, 6:16 AM

picking up this approach I have a doubt regarding the process of automating the ELT once new files are added into the Postgres defined schema .... can it automate and run that file into ELT .. OR can Postgres support batch entry in minimum time interval possible ... just like scheduling MONGODB

janis_puris

06/27/2023, 6:55 AM

Thank you, @luke_rodgers! I do have to admit that I wrote the post with plethora of assumptions based on running a meltano job on smaller tables. 😅 I will test it with one of the big ones and see how it behaves. If it is anything as you describe 🤞, then it will be glorious and solves my described problem.

janis_puris

06/27/2023, 1:20 PM

You are correct, Luke. From my experiments with the pipeline, it seems

tap-oracle

will only produce as many rows as the

target-snowflake

is able to take in. As soon as the

target

reaches the set the limits, it will proceed to • write to /tmp a "csv.gz" file • upload said file to S3 and remove from local fs • issue a

MERGE INTO ... FROM @STAGE/...

command, which then loads the file • removes the files from S3 It works just the way I want it!

pat_nadolny

06/27/2023, 6:56 PM

It sounds like you've got a solution already but for others finding this theres 2 other interesting things to add. Theres a new BATCH message type https://sdk.meltano.com/en/latest/batch.html thats part of the SDK which is meant to help in situations like this, its still relatively new so please give feedback if you try it out. See this blog https://meltano.com/blog/6x-more-speed-for-your-data-pipelines-with-batch-messages/. I'd also add that we recently put a lot of work into the meltanolabs target-snowflake variant which will support this batch message type, I use it in production and this week I'll be converting one of my pipelines to use batch messages.

janis_puris

06/27/2023, 7:18 PM

@pat_nadolny! This is the blog post I was looking for 🙂 I've seen chatter about the batch messages, but had no clue what these actually solve. Thank you so much for linking this! After doing some more tests, Intend to experiment with more cores + understand the singer spec little better, and then I'll do the same with the MeltanoLabs's

target-snowflake

. Very exciting!

4 Views

Open in Slack

Previous Next