:melty-bouncy::melty-bouncy:Let's talk about best ...
# best-practices
s
melty bouncymelty bouncyLet's talk about best practices for initially setting up pipelines for DB=> DB syncs, that I don't want to have full table later on. I'm genuinely interested in your opinions, and some input. What I think would be great to have is to: 1. Have whatever happens completely versioned. 2. Ability to retry initial loads that didn't work out. 3. Some way of parallelization for myself. So what I mean by that is, I think I want to: 1. Define N full table syncs on the same table, partition them by hand in a way I know makes sense (creation date e.g.). 2. Commit a manual run of them. 3. Then add 1 incremental sync however I want them to be. => That way I would get 1-3 (I think), and have in code a representation of "manual load until date 2023-01 ran on ....; incremental load started with data from 2023-01+...." Does that make sense? Is that easily doable with Meltano? Your thoughts? @visch @taylor @pat_nadolny @aaronsteers @thomas_briggs...
v
I don't want to have full table later on
I do this but I do full tables syncs for all the tables we need right now. Problem with incremental streams and db->db is you miss deletes. Some handle this by using a log based approach instead
Have whatever happens completely versioned.
https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver16 works pretty well for me
Ability to retry initial loads that didn't work out.
Orchestrator, and this ties in with #3
Some way of parallelization for myself.
Run each table individually
Define N full table syncs on the same table, partition them by hand in a way I know makes sense (creation date e.g.).
I'd automate it all personally, not sure why you wouldn't just full table sync it. Tables are >500M records or something? Other questions: 1. How important is the type mapping to you between the DBs? how accurate does the type mapping need to be, after you get data moving I'd look at this 2. How important is speed? 3. How big is the data?
s
@visch, to your questions, I don't know 😄 I'm thinking about rewriting the replicate data guide and would really like to give some good advice. I'm leaning towards providing something, because I had a bunch of conversions inside slack where the initial loading process just wasn't fun (taking multiple hours). And I kind of like the idea of having my "manual backfills" automatically executed & versioned.
p
@Sven Balnojan I think this is a good idea to include. I'm not sure if theres an easy answer for this right now though, maybe BATCH is the answer moving forward. I think when some people talk about manual backfills (for huge amounts of data) they end up backfilling outside of meltano (pgdump, manual script, etc.) then set incremental bookmarks to latest timestamps so Meltano handles the incremental changes following that date. To me that sounds like a huge hassle though.
t
If I understand the goal here (which I'm not 100% sure I do) - my first thought is not to go DB-to-DB but to go DB-to-jsonl-to-DB, i.e. 1. Create views in the source DB that produce subsets of the table in question 2. Define taps that pull from each view and write to the data to jsonl 3. Define taps that read each of the jsonl files 4. Feed all those taps to the same target (using some aliasing or renaming or something to get all the data into the same table) That way discrete sections can be easily restarted/reloaded if necessary, and the extraction and loading can happen in parallel.
To @pat_nadolny’s point the real issue is that taps are too slow to produce large amounts of data. Fixing that is a better long-term solution than getting good at this IMO.
v
Yeah for me the biggest table is a ~20 million records so we get by fine.
t
I can already see some issues with the process I just described above but I think the idea is "directionally correct" 😛
p
I could see a feature also where a tap-postgres sync with a ton of tables gets fanned out to lots of parallel jobs but idk how that would work today without custom orchestrator logic. I think one of the backfill challenges is that its a single node trying to sync all tables vs parallelizing it
v
For my use case we don't' control the source DB, hard to make changes, we get access to query that's about it. @pat_nadolny yep that's what we do!
p
@visch do you mind sharing more details of how you do it? With an single job, fanned by the orchestrator? With many inherited plugins in meltano.yml with different select criteria? Select criteria using the environment variables?
I guess figuring out what the different definitions of "manual backfill" are would help for coming up with a recommendation (@Sven Balnojan)
v
1 plugin. Fanned by "orchestrator" (lot of this is manual sometimes, doesn't have to be but folks like to do unique things) . Select with env vars (we wrote a little meltano runner that adds this for us as you also have to select the right dbt models and doing via Meltano was a bit too tricky)
t
Fair point about not being able to change the source DB @visch. What I described is intended to work without changes to the tap or target but if taps allowed WHERE clauses for each table to be defined it could be done without changes to the source... but I think that's outside the scope of what we're talking about. 😉
p
@Sven Balnojan in terms of versioning/partitioning I think including the sync metadata columns would be sufficient, if im understanding the requirement properly. The full sync and incremental sync can be sent to the same raw table and you'll know what subset was from backfill and which is incremental
Potentially having recommendations on using
state get/set
to update your bookmarks. This could also be helpful if someone is switching from their home grown scripts to Meltano, they dont need to start over if they start their incremental sync from the right point.
s
Hm so now I keep on thinking about @thomas_briggs solution 😄 The state get/set is an interesting point though, and true @pat_nadolny I didn’t think about the metadata columns, that’S a good point as well, I feel like we should always recommend using them.
a
I agree with a lot of the above, and I also have some additional perspectives I want to add to the conversation. That said, I don't know that I could fit my recommendations in a Slack comment. Let me put some thoughts together and circle back to this thread...
A quick draft writeup. Be gentle, this was a rush job. 😄
s
@aaronsteers I love it! Thanks
t
Yeah - I have nothing to add to AJ’s write-up. Very detailed!