Following up on an <earlier question> (thanks for ...
# singer-tap-development
j
Following up on an earlier question (thanks for pointing us to
ACTIVATE_VERSION
!) with a clearer understanding, I'm curious about singer best practice in this situation. (cc @ryan_bell) Context: we want to sync Klaviyo list members, which is set of emails belonging to an email list, and users can be added and removed from the list, making it not a typical
INCREMENTAL
-type replication but more like a many-to-many join table. We could do
FULL_TABLE
replication and then use
ACTIVATE_VERSION
to delete the version from older runs. However this would still require us to stream all of the data for all of the lists on every run. Some of these lists have millions of members, and many of them do not update frequently, so we were thinking of using state to keep track of last-updated-at for each list and only replicate lists that have changed. I see a few options here, want to get people's thoughts on best approach or if there's one we haven't considered: 1. Suck it up and replicate all the members of all the lists on every run. This may not be an option as we want these updated at a minimum-daily rate and replication takes several hours 2. Send each email list as it's own stream. Then
ACTIVATE_VERSION
will work and each list can track it's own state. However, this seems like bad practice as the streams are now dynamic 3. Don't send
ACTIVATE_VERSION
, instead use some custom python code after the regular EL pipline to do a more complex implementation of cleaning up old records 4. Maybe there's some kind of partitioning solution in Singer I'm not aware of that could handle this?
a
@julian_knight - Can you clarify slightly on the requirements? 1. Do you have a top-level date on each list when it has last added/dropped members? (Hopefully yes.) 2. Do you have list-membership-level update timestamps? (Assuming yes, but it's limited value anyway without #3) 3. Do you have soft-delete flags for dropped members? (I think you already said no, but that's the kicker so just want to confirm.)
Assuming "yes", to #1 and "no" to #3, the "ideal" situation would probably be a parent-child construct between list and list members - where a parent-level timestamp increment triggers a new full retrieval of just that list. In future, perhaps this could be combined with a partition-scoped ACTIVATE_VERSION message which doesn't yet exist. However, in append-only mode, you could use SQL to detect which older versions of the list memberships have been replaced by newer ones and filter out those older versions. Does that make sense? WDYT?
FWIW, I saw the same problem in the Pardot API, which similarly tracks marketing email list and list memberships - but without good update tracking and without soft delete flags.
j
answers: yes, no, and no
where a parent-level timestamp increment triggers a new full retrieval of just that list
That's what we were thinking. The problem is how do we clear out old versions of list members? oh wait, you mean using parent-child streams, right? That's a feature we haven't dug into yet but I'll check it out now Okay I read your message again and I see what you're saying now. I think your solution is the same as option 3 in the OP. Thanks for your help!
a
Yes - similar to #3 but without a soft-delete flag the only way I know to actually remove old members would be (1) sync the whole table each time and infer deletions globally or (2) sync each list as it changes so you can infer deletions locally to that list.
Either way, I wish they had soft-delete flags. 🙃