Is there a recommended way to handle the situation where a s Meltano #singer-tap-development

Is there a recommended way to handle the situation...

stephen_bailey

12/14/2021, 4:26 PM

Is there a recommended way to handle the situation where a stream has an

updatedAt

field and a

createdAt

field, but the

updatedAt

field is null after the initial creation (i.e. never updated)? Is it acceptable, for example, to use

post_process

to create a

created_or_updated_at

field that combines the two?

stephen_bailey

12/14/2021, 4:27 PM

The problem is that using

updatedAt

results in replication key errors when the value is null:

Copy code

File "/.../site-packages/singer_sdk/helpers/_state.py", line 220, in increment_state
    if old_rk_value is None or new_rk_value >= old_rk_value:
TypeError: '>=' not supported between instances of 'NoneType' and 'int'

edgar_ramirez_mondragon

12/14/2021, 5:17 PM

Is it acceptable, for example, to use
post_process
to create a
created_or_updated_at
field that combines the two?

I think it is perfectly acceptable 😄. I wonder if the SDK should expose an API for overriding the way the replication key value is extracted/composed from the record.

stephen_bailey

12/14/2021, 6:08 PM

that's a very nice workaround on the tap side. unfortunately there are now api side issues, because i would have to implement separate logic for the list of

created_after

records and then do a separate pull of

updated_after

records, because if i just filter by

updated-after

, it won't pull records that were nearly created but never updated! 😭

aaronsteers

12/14/2021, 7:29 PM

Re @edgar_ramirez_mondragon :

I think it is perfectly acceptable...

Ditto. Yeah, that's what I'd do too. 👍

post_process()

to add them custom field sounds perfect, and also (friendly reminder) you'll want to declare that property also in the stream's schema. Re @stephen_bailey :

...unfortunately there are now api side issues...

😢 - sorry dude. Some APIs are just badly designed... If there's no way to "just get all the records created or modified since this timestamp", then you'll have to make two rounds of calls as you described. There's a way to do this with complex pagination tokens - basically defining the

next_page_token

as a custom dict and just making sure you loop through both extraction patterns before returning

None

from

get_next_page_token()

stephen_bailey

12/14/2021, 7:54 PM

one thing about working with apis via meltano is you begin to really appreciate the small things about good interfaces! 🙂

aaronsteers

12/14/2021, 8:29 PM

Right!? I remember a few years back in the data space everyone was so excited about API layers to abstract people away from direct SQL interfaces - but not many folks were aware how easy it was to build a bad API - in terms of ability to efficiently serve data that consumers need. @dan_mosora (Talend/Stitch) did a great talk on API design for ELT/ETL. Slide 24 from his PPT basically covers this topic: api-design-for-etl/API Design for ETL.pptx

dan_mosora

12/14/2021, 9:12 PM

Thanks for the plug 😄 Since I’m here, I’ll chime in a bit. In the past, we’ve done a sort of “coalesce” from the

created_at

into the

updated_at

to maintain consistency of replication key, rather than inventing a new field. Both approaches are probably fine, depends on what kind of flexibility you want.

dan_mosora

02/04/2022, 3:20 PM

To revisit this concept since it just came up again. I’ve just spoken with another group of developers where they had a similar problem (rows were updated based on the max of 2 fields). The solution they chose at least for now is to have two fields listed in metadata as

valid-replication-keys

and the sync mode goes through the data set and uses the max of the two. This is kind of an interesting concept because there’s the question of whether to read multiple replication keys as

OR

AND

. Maybe it’s a bit less of a concern than something like key properties since a replication key is purely informational for either an orchestration/UI/platform actor or the tap’s own sync mode, but thought it an interesting idea to bring up.

Open in Slack

Previous Next