How to scale meltano pipelines if I have to transfer 1 TB of Meltano #troubleshooting

How to scale meltano pipelines if I have to transf...

debasis_panda

05/26/2021, 11:53 AM

How to scale meltano pipelines if I have to transfer 1 TB of data from an Oracle table to BigQuery ? Is there any best practices ?

visch

05/26/2021, 12:48 PM

What issues are you running into?

debasis_panda

05/26/2021, 2:07 PM

Nope not really any issues. But want to know whether hosting meltano in a compute engine and extracting huge data from oracle to bigquery .. will there be any performance challenges? any performance benchmarks or best practices we can refer for this kind of scenario. Or it should work.

aaronsteers

05/26/2021, 2:17 PM

Hi, @debasis_panda. Everything should work, but be aware the initial backfill may take a long time. 1. Before you start you’ll want to make sure you have incremental keys defined - since those aren’t generally detectable on RDBMS sources. 2. Make sure you are using meltano for the sync and make sure you specify a job-id at the CLI which will ensure you capture the state and can resume if interrupted. 3. If you can, run from a cloud instance in the same region as your source and destination. This will reduce latency for handshakes and orchestration. 4. Generally people find it helpful to bucketize source tables into groups. For instance, putting all your smaller tables together or conversely, moving the largest or more cumbersome tables into their own group. (You can use

inherits_from

syntax to create those distinct groupings.)

ken_payne

05/26/2021, 3:00 PM

I think your example @debasis_panda is a great match for this issue. Would love your comments on it 🙏 In your position (with 1 large table of concern, making 'bucketing' tables in config less effective), I would be looking at ways to bulk export to csv from Oracle before uploading to BQ, and then fabricating a bookmark for incremental replication thereafter. In an ideal world this could be done with writes to the source db paused, to avoid having to manage changes made during the migration, but with

updated_at

as well as

created_at

columns it is possible to do 'live' by saving the

updated_at

max value at the beginning of bulk export and using that as your replication key in state for your first incremental run in Meltano.

visch

05/26/2021, 5:05 PM

From the tap-oracle side, I've ran into a couple oracle issues. https://gitlab.com/autoidm/tap-oracle , has a couple updates. Biggest is

turn_off_multipleof

and I did a slight thing with filter_schemas due to some access issues with v$database. MultipleOf in schemas causes all sorts of issues for me during json schema validation that I didn't care about. The next few things I have as issues with Oracle are listed here https://gitlab.com/autoidm/tap-oracle/-/issues

visch

05/26/2021, 5:07 PM

Biggest issue I have left is really CLOBs not being pulled, and Retries due to connection issues

debasis_panda

05/27/2021, 11:50 PM

Thank you all for your ideas. You are right@ken_payne . Approach should be similar to move bulk data into any cloud ware houses. Oracle --->Cloud Storage ---> BQ..

2 Views

Open in Slack

Previous Next