melty bouncy melty bouncy Let s talk about best practices f Meltano #best-practices

:melty-bouncy::melty-bouncy:Let's talk about best ...

Sven Balnojan

01/27/2023, 9:40 AM

melty bouncymelty bouncyLet's talk about best practices for initially setting up pipelines for DB=> DB syncs, that I don't want to have full table later on. I'm genuinely interested in your opinions, and some input. What I think would be great to have is to: 1. Have whatever happens completely versioned. 2. Ability to retry initial loads that didn't work out. 3. Some way of parallelization for myself. So what I mean by that is, I think I want to: 1. Define N full table syncs on the same table, partition them by hand in a way I know makes sense (creation date e.g.). 2. Commit a manual run of them. 3. Then add 1 incremental sync however I want them to be. => That way I would get 1-3 (I think), and have in code a representation of "manual load until date 2023-01 ran on ....; incremental load started with data from 2023-01+...." Does that make sense? Is that easily doable with Meltano? Your thoughts? @visch @taylor @pat_nadolny @aaronsteers @thomas_briggs...

visch

01/27/2023, 1:52 PM

I don't want to have full table later on

I do this but I do full tables syncs for all the tables we need right now. Problem with incremental streams and db->db is you miss deletes. Some handle this by using a log based approach instead

Have whatever happens completely versioned.

https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver16 works pretty well for me

Ability to retry initial loads that didn't work out.

Orchestrator, and this ties in with #3

Some way of parallelization for myself.

Run each table individually

Define N full table syncs on the same table, partition them by hand in a way I know makes sense (creation date e.g.).

I'd automate it all personally, not sure why you wouldn't just full table sync it. Tables are >500M records or something? Other questions: 1. How important is the type mapping to you between the DBs? how accurate does the type mapping need to be, after you get data moving I'd look at this 2. How important is speed? 3. How big is the data?

Sven Balnojan

01/27/2023, 1:56 PM

@visch, to your questions, I don't know 😄 I'm thinking about rewriting the replicate data guide and would really like to give some good advice. I'm leaning towards providing something, because I had a bunch of conversions inside slack where the initial loading process just wasn't fun (taking multiple hours). And I kind of like the idea of having my "manual backfills" automatically executed & versioned.

pat_nadolny

01/27/2023, 2:47 PM

@Sven Balnojan I think this is a good idea to include. I'm not sure if theres an easy answer for this right now though, maybe BATCH is the answer moving forward. I think when some people talk about manual backfills (for huge amounts of data) they end up backfilling outside of meltano (pgdump, manual script, etc.) then set incremental bookmarks to latest timestamps so Meltano handles the incremental changes following that date. To me that sounds like a huge hassle though.

thomas_briggs

01/27/2023, 2:48 PM

If I understand the goal here (which I'm not 100% sure I do) - my first thought is not to go DB-to-DB but to go DB-to-jsonl-to-DB, i.e. 1. Create views in the source DB that produce subsets of the table in question 2. Define taps that pull from each view and write to the data to jsonl 3. Define taps that read each of the jsonl files 4. Feed all those taps to the same target (using some aliasing or renaming or something to get all the data into the same table) That way discrete sections can be easily restarted/reloaded if necessary, and the extraction and loading can happen in parallel.

thomas_briggs

01/27/2023, 2:50 PM

To @pat_nadolny’s point the real issue is that taps are too slow to produce large amounts of data. Fixing that is a better long-term solution than getting good at this IMO.

visch

01/27/2023, 2:51 PM

Yeah for me the biggest table is a ~20 million records so we get by fine.

thomas_briggs

01/27/2023, 2:51 PM

I can already see some issues with the process I just described above but I think the idea is "directionally correct" 😛

pat_nadolny

01/27/2023, 2:55 PM

I could see a feature also where a tap-postgres sync with a ton of tables gets fanned out to lots of parallel jobs but idk how that would work today without custom orchestrator logic. I think one of the backfill challenges is that its a single node trying to sync all tables vs parallelizing it

visch

01/27/2023, 2:55 PM

For my use case we don't' control the source DB, hard to make changes, we get access to query that's about it. @pat_nadolny yep that's what we do!

pat_nadolny

01/27/2023, 2:57 PM

@visch do you mind sharing more details of how you do it? With an single job, fanned by the orchestrator? With many inherited plugins in meltano.yml with different select criteria? Select criteria using the environment variables?

pat_nadolny

01/27/2023, 2:58 PM

I guess figuring out what the different definitions of "manual backfill" are would help for coming up with a recommendation (@Sven Balnojan)

visch

01/27/2023, 2:58 PM

1 plugin. Fanned by "orchestrator" (lot of this is manual sometimes, doesn't have to be but folks like to do unique things) . Select with env vars (we wrote a little meltano runner that adds this for us as you also have to select the right dbt models and doing via Meltano was a bit too tricky)

thomas_briggs

01/27/2023, 2:59 PM

Fair point about not being able to change the source DB @visch. What I described is intended to work without changes to the tap or target but if taps allowed WHERE clauses for each table to be defined it could be done without changes to the source... but I think that's outside the scope of what we're talking about. 😉

pat_nadolny

01/27/2023, 3:09 PM

@Sven Balnojan in terms of versioning/partitioning I think including the sync metadata columns would be sufficient, if im understanding the requirement properly. The full sync and incremental sync can be sent to the same raw table and you'll know what subset was from backfill and which is incremental

pat_nadolny

01/27/2023, 3:11 PM

Potentially having recommendations on using

state get/set

to update your bookmarks. This could also be helpful if someone is switching from their home grown scripts to Meltano, they dont need to start over if they start their incremental sync from the right point.

Sven Balnojan

01/27/2023, 3:24 PM

Hm so now I keep on thinking about @thomas_briggs solution 😄 The state get/set is an interesting point though, and true @pat_nadolny I didn’t think about the metadata columns, that’S a good point as well, I feel like we should always recommend using them.

aaronsteers

01/28/2023, 5:29 AM

I agree with a lot of the above, and I also have some additional perspectives I want to add to the conversation. That said, I don't know that I could fit my recommendations in a Slack comment. Let me put some thoughts together and circle back to this thread...

aaronsteers

01/28/2023, 5:52 AM

A quick draft writeup. Be gentle, this was a rush job. 😄

Sven Balnojan

01/30/2023, 11:50 AM

@aaronsteers I love it! Thanks

taylor

01/30/2023, 3:42 PM

Yeah - I have nothing to add to AJ’s write-up. Very detailed!

3 Views

Open in Slack

Previous Next