I think I've asked this before but can't remember ...
# best-practices
e
I think I've asked this before but can't remember the exact answer and can't find the previous thread. I need to run a meltano full refresh for data for a specific time period. I know I can use
start_date
but I believe
end_date
is not a thing? Does anyone have a recommendation for conditionally replicating data?
a
This would be the kind of thing that individual taps might implement. What is your source? It might help also to understand why running the tap in full up to today's date isn't a good solution for you.
e
The source is postgres and destination is bigquery.
There are a couple of use cases why to not run up to today's date. 1. Replicating data for only a specific time period. For example, I want to replicate data for only 2024 to a destination for cold storage. 2. Records are updated but the timestamp is not. For example, data is updated to reflect a business change but the timestamp isn't updated. 3. Batching replication for very large datasets. For example, I would like to replicate the data for Jan, then Feb, then March, etc. or replicating X records at a time.
a
Thanks, I can't offer you a method for this kind of 'partitioned' refresh unfortunately, unless you are able to customise the tap to accept this 'end_date' as a config item. Alternatively you could try a stream filter: https://sdk.meltano.com/en/latest/stream_maps.html#filtering-out-records-from-a-stream-using-filter-operation. This would probably involve extracting ALL the data from your tap only to discard most of this (where the replication key is greater than your
end_date
), which might not solve the issues listed above.
e
Thanks @Andy Carter that is actually an interesting idea and I'll probably test it since I've got nothing to lose. I'd really hate to fork and maintain tap-postgres just for one small change.