Hey we are facing a problem where we are trying to do increm Meltano #best-practices

Hey, we are facing a problem where we are trying t...

Love Eklund

07/02/2024, 12:34 PM

Hey, we are facing a problem where we are trying to do incremental load (using oracle connector https://github.com/s7clarke10/pipelinewise-tap-oracle, but I assume the issue would be the same for many connectors). The table has multiple columns that is the replication-key, it looks a bit like this process_start_time | process_end_time 2024-07-02 | NULL 2024-07-02 | 2024-07-03 Anyone that has faced the same issues and/or knows how to solve the issue, preferably without having to fork the connector. Thanks!

visch

07/02/2024, 12:52 PM

what's your problem?

Love Eklund

07/02/2024, 12:55 PM

Sorry if that wasn't clear. The tap doesn't support multiple replication-keys. So if I set replication_key = process_end_time I'll miss some rows as not all processes will end and write to the table. If I set replication_key= process_start_time , I'll not get rows when they are updated.

visch

07/02/2024, 12:57 PM

Sorry I'm not following exactly how to help you without spending a bunch of time, for me if you share your

meltano.yml

and give an example it'd be very helpful. Maybe someone else will chimein!

➕ 1

visch

07/02/2024, 12:58 PM

Generally multiple replication keys "should" work, it's possible the tap doesn't support it and you'd have to fork it On second thought I haven't ever set up anything with multiple replication keys and I"m not sure exactly how that would play out. I was thinking primary keys Sorry I can't spend more cycles on this today but wanted to share what I know / don't know

Charles Feduke

07/02/2024, 2:06 PM

you could create a view that performs a

coalesce(process_end_time, process_start_time)

and tap that… which I know is not ideal. I had a similar pattern for a series of tables in my database, but figured out the tables were so small (<500MB) that just performing a full table copy each day was acceptable so I didn’t need to create a view. If

coalesce

won’t work, you could do some other sort of logic with a

case

statement or something more crazy with calculating a row version number composing the start_time/end_time (treating them as integers). These were the sort of heroics done in the dark early days of ETL so don’t do this if you don’t have to. 🙂

➕ 2

Love Eklund

07/02/2024, 2:30 PM

@visch Thanks for your input! @Charles Feduke I think a coalesce would work, and actually thought of a similar pattern but I wasn't sure that A. it would go over well with the catalog discovery functionality of the tap, I'd have to make sure first that views are treated the same as tables there and B. the database admin would give us that permission ( this I think is the difficult part). But good to hear that someone thought about a similar pattern to use. My other nuclear option is to fork the tap and just allow for "custom query" to be used and require it to return data in the same format as the table and just feed that to the target.

Charles Feduke

07/02/2024, 2:34 PM

because I never went the view route I didn’t actually verify that it’d work with the tap’s catalog routine, I am operating on the assumption that it would… - and I’m in the same place mentally, I do not like the idea of modifying the source system ever to facilitate an ELT process

➕ 1

visch

07/02/2024, 2:37 PM

If you have <~50 million records I wouldn't do anything but Full Table replication myself unless there's some constraint on speed somewhere

visch

07/02/2024, 2:37 PM

networking, compute, and storage are cheap these days. Human time isn't

💯 3

Love Eklund

07/02/2024, 3:05 PM

I looked in the tap source code (one great upside with open source) now and seems like it has support for views. But yeah I probably wont go down that route. The tap has support for an offset so I'll just reload with a huge negative offset, should be the least time consuming for my side. Thanks for help and input!

➕ 1

4 Views

Open in Slack

Previous Next