I am using the Postgres tap on a couple of tables ...
# getting-started
w
I am using the Postgres tap on a couple of tables with millions of records (maybe 50 million in each). The replication key column for both is the updated_at column in the source database. The tap is selecting the records updated since the last run (obviously) and the ordering them by updated_at (not sure why it is doing this unless the assumption is you want records ordered sequentially). I don't have an index on updated_at currently as it wasn't used by the source systems but I suppose I should add one? The extraction is taking a number of hours to complete each day.
a
Hi @will_johnson the tap will be selecting the records since the latest update_at received - not the last moment it was run. If the records were to arrive out of order and the extraction failed, then you could have a situation where the last update_at received was after some records it hadn't processed yet. Make sense?
w
So, WHERE updated_at > (latest updated_at from last extraction). I got that. I see what you're saying about an extraction possibly failing. Thanks. I think I will add an index to updated_at.