More fun with `tap-mssql` Is syntax such as this n...
# singer-taps
d
More fun with
tap-mssql
Is syntax such as this normal?
select_sql += " WHERE \"{}\" >= %(replication_key_value)s ORDER BY \"{}\" ASC".format(
This is from
sync_strategies.incremental.sync_table()
We can see that it grabs data greater than or equal to the current bookmark. Shouldn't it just be greater than? I'm finding that the same few records are getting inserted over and over again in my target. I can filter but it's annoying.
Found some doco from Stitch that says this is normal and expected (seems bizarre to me), but is usually taken care of in the targets upsert logic. Seems I need to fix my target config.
I'm still not sure I understand the rationale...just in case the most recent job fails to load but still updates the bookmark I guess? But if that happens, it could easily happen multiple times in a row and you still miss data...
v
Yeah this is a singer thing I've seen others chat about this. Should probably get added to https://hub.meltano.com/singer/spec at some point. My target I"m saying I only handle FULL_TABLE 's at the moment. I think the idea is that you could have multiple records with the same replication key. So even though your state ended on one record with that replication key it doesn't mean you synced all records with that replication key
I know it's bad what I'm doing with my target but I haven't migrated to the singersdk yet for it
d
This is discussed in https://gitlab.com/meltano/meltano/-/issues/2504:
incremental replication mode intentionally duplicates the last-emitted record to guard against race conditions.
a
I agree 100% with everyone else's comments here: (1) the
>=
logic is necessary to make sure we don't capture duplicates and (2) it sure would be nice if a dedupe algorithm existed that would make this easier to work with downstream. So.... here's a possible proposal which combines greater-than-or-equal-to logic with a dedupe algorithm focused just on keeping track of which ties have been seen so far for the max value. I've actually been meaning to log this for a while and was finally inspired through this thread. If anyone wants to review this spec proposal, more eyeballs are much appreciated. Spec proposal for duplicate-proof incremental replication (#162) · Issues · Meltano / Meltano SDK for Singer Taps and Targets · GitLab Standard disclaimer: no promises that this will be built - this is still in thought-experiment stage for review/discussion.
d
@aaronsteers I like the proposal. Feels quite clean and keeps the intended logic intact. Will really help for file based targets like CSV as well. The concept is similar to Data Vaults
hashdiff
field.
Along similar lines of thinking, it would be nice to have an option in targets to choose between upsert (if supported) and an enhancement to append-only that does a Data Vault style or Type 2 SCD style insert. e.g. in upsert mode the number of rows in the target will always match the source as the changing records are updated In append-mode it will write out a new record if changed, and skip those that are identical (for the dupes captured by
>=
)
I suppose with this proposal, we actually get this automatically as the records would be de-duped on the tap side. But I'm unsure if you can reliably influence the tap to choose between upsert/append modes
a
I'm away from my laptop right now so can't dig up the issue link, but there's a known workaround to null-out primary keys during the sync operation. I'd love to make this a native concept but intentionally nulling/nuking the key-properties will get you the "append-only" behavior instead of the default upserts. (Note to self to add that link here.)
d
@aaronsteers Yep I found that behaviour accidentally. I am pulling from a set of views in my upstream Azure SQL source. I had set `table-key-properties`but because they were views, the tap
pipeline-tap-mssql
has some smarts to detect the
is-view
attribute and then read from
view-key-properties
instead. That all makes sense but I found another bug that meant if you left
table-key-properties
blank, the import would fail as it was a required field. This tricked me into thinking that I should use
table-keys
and not
view-keys
, and using both seemed...wrong. In my PR I have updated the logic to properly use
view-keys
for views and
table-keys
for tables and it works a treat for upserts now.