Hey, we are facing a problem where we are trying t...
# best-practices
l
Hey, we are facing a problem where we are trying to do incremental load (using oracle connector https://github.com/s7clarke10/pipelinewise-tap-oracle, but I assume the issue would be the same for many connectors). The table has multiple columns that is the replication-key, it looks a bit like this process_start_time | process_end_time 2024-07-02 | NULL 2024-07-02 | 2024-07-03 Anyone that has faced the same issues and/or knows how to solve the issue, preferably without having to fork the connector. Thanks!
v
what's your problem?
l
Sorry if that wasn't clear. The tap doesn't support multiple replication-keys. So if I set replication_key = process_end_time I'll miss some rows as not all processes will end and write to the table. If I set replication_key= process_start_time , I'll not get rows when they are updated.
v
Sorry I'm not following exactly how to help you without spending a bunch of time, for me if you share your
meltano.yml
and give an example it'd be very helpful. Maybe someone else will chimein!
1
Generally multiple replication keys "should" work, it's possible the tap doesn't support it and you'd have to fork it On second thought I haven't ever set up anything with multiple replication keys and I"m not sure exactly how that would play out. I was thinking primary keys Sorry I can't spend more cycles on this today but wanted to share what I know / don't know
c
you could create a view that performs a
coalesce(process_end_time, process_start_time)
and tap that… which I know is not ideal. I had a similar pattern for a series of tables in my database, but figured out the tables were so small (<500MB) that just performing a full table copy each day was acceptable so I didn’t need to create a view. If
coalesce
won’t work, you could do some other sort of logic with a
case
statement or something more crazy with calculating a row version number composing the start_time/end_time (treating them as integers). These were the sort of heroics done in the dark early days of ETL so don’t do this if you don’t have to. 🙂
2
l
@visch Thanks for your input! @Charles Feduke I think a coalesce would work, and actually thought of a similar pattern but I wasn't sure that A. it would go over well with the catalog discovery functionality of the tap, I'd have to make sure first that views are treated the same as tables there and B. the database admin would give us that permission ( this I think is the difficult part). But good to hear that someone thought about a similar pattern to use. My other nuclear option is to fork the tap and just allow for "custom query" to be used and require it to return data in the same format as the table and just feed that to the target.
c
because I never went the view route I didn’t actually verify that it’d work with the tap’s catalog routine, I am operating on the assumption that it would… - and I’m in the same place mentally, I do not like the idea of modifying the source system ever to facilitate an ELT process
1
v
If you have <~50 million records I wouldn't do anything but Full Table replication myself unless there's some constraint on speed somewhere
networking, compute, and storage are cheap these days. Human time isn't
💯 3
l
I looked in the tap source code (one great upside with open source) now and seems like it has support for views. But yeah I probably wont go down that route. The tap has support for an offset so I'll just reload with a huge negative offset, should be the least time consuming for my side. Thanks for help and input!
1