Export from a modest Postrgres database 450GB and import int Meltano #best-practices

Export from a modest Postrgres database (450GB) an...

casey

07/01/2021, 8:50 PM

Export from a modest Postrgres database (450GB) and import into BigQuery (foregoing transformation) is taking about 32 hours, but not because of insufficient resources... is there any foreseeable problem with splitting tables of the same DB between two taps (and concomitant pipelines) and then running them both in parallel?

aaronsteers

07/01/2021, 9:07 PM

No problem with that at all. It's actually a good practice. You can use the

inherit_from

option as in our hub project here to declare multiple instances of the same tap. The only thing to watch out for is, if you later want to join them back together as a single job, you may need to manually merge their states. Also, just make sure you use a job_id so that state will be captured for each running instance. @douwe_maan - Do I have the above correct regarding a merge of states? I imagine you'd need distinct job_ids, because two jobs running simultaneously with the same job_id would probably clobber each other, but I actually haven't tested that myself.

douwe_maan

07/01/2021, 9:41 PM

Yep, each parallel pipeline should have its own job_id, but you can manually merge the state JSON dicts later on. https://gitlab.com/meltano/meltano/-/issues/2727 will basically automate this “splitting up over multiple tap/target combinations” process

casey

07/02/2021, 8:00 AM

Ah, great. Thanks for the advice regarding job states; it would have tripped me up.

Open in Slack

Previous Next