How to scale meltano pipelines if I have to transf...
# troubleshooting
d
How to scale meltano pipelines if I have to transfer 1 TB of data from an Oracle table to BigQuery ? Is there any best practices ?
v
What issues are you running into?
d
Nope not really any issues. But want to know whether hosting meltano in a compute engine and extracting huge data from oracle to bigquery .. will there be any performance challenges? any performance benchmarks or best practices we can refer for this kind of scenario. Or it should work.
a
Hi, @debasis_panda. Everything should work, but be aware the initial backfill may take a long time. 1. Before you start you’ll want to make sure you have incremental keys defined - since those aren’t generally detectable on RDBMS sources. 2. Make sure you are using meltano for the sync and make sure you specify a job-id at the CLI which will ensure you capture the state and can resume if interrupted. 3. If you can, run from a cloud instance in the same region as your source and destination. This will reduce latency for handshakes and orchestration. 4. Generally people find it helpful to bucketize source tables into groups. For instance, putting all your smaller tables together or conversely, moving the largest or more cumbersome tables into their own group. (You can use
inherits_from
syntax to create those distinct groupings.)
k
I think your example @debasis_panda is a great match for this issue. Would love your comments on it 🙏 In your position (with 1 large table of concern, making 'bucketing' tables in config less effective), I would be looking at ways to bulk export to csv from Oracle before uploading to BQ, and then fabricating a bookmark for incremental replication thereafter. In an ideal world this could be done with writes to the source db paused, to avoid having to manage changes made during the migration, but with
updated_at
as well as
created_at
columns it is possible to do 'live' by saving the
updated_at
max value at the beginning of bulk export and using that as your replication key in state for your first incremental run in Meltano.
v
From the tap-oracle side, I've ran into a couple oracle issues. https://gitlab.com/autoidm/tap-oracle , has a couple updates. Biggest is
turn_off_multipleof
and I did a slight thing with filter_schemas due to some access issues with v$database. MultipleOf in schemas causes all sorts of issues for me during json schema validation that I didn't care about. The next few things I have as issues with Oracle are listed here https://gitlab.com/autoidm/tap-oracle/-/issues
Biggest issue I have left is really CLOBs not being pulled, and Retries due to connection issues
d
Thank you all for your ideas. You are right@ken_payne . Approach should be similar to move bulk data into any cloud ware houses. Oracle --->Cloud Storage ---> BQ..