Is there a recommended way to handle the situation...
# singer-tap-development
s
Is there a recommended way to handle the situation where a stream has an
updatedAt
field and a
createdAt
field, but the
updatedAt
field is null after the initial creation (i.e. never updated)? Is it acceptable, for example, to use
post_process
to create a
created_or_updated_at
field that combines the two?
The problem is that using
updatedAt
results in replication key errors when the value is null:
Copy code
File "/.../site-packages/singer_sdk/helpers/_state.py", line 220, in increment_state
    if old_rk_value is None or new_rk_value >= old_rk_value:
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
e
Is it acceptable, for example, to use 
post_process
 to create a 
created_or_updated_at
 field that combines the two?
I think it is perfectly acceptable 😄. I wonder if the SDK should expose an API for overriding the way the replication key value is extracted/composed from the record.
s
that's a very nice workaround on the tap side. unfortunately there are now api side issues, because i would have to implement separate logic for the list of
created_after
records and then do a separate pull of
updated_after
records, because if i just filter by
updated-after
, it won't pull records that were nearly created but never updated! 😭
a
Re @edgar_ramirez_mondragon :
I think it is perfectly acceptable...
Ditto. Yeah, that's what I'd do too. 👍
post_process()
to add them custom field sounds perfect, and also (friendly reminder) you'll want to declare that property also in the stream's schema. Re @stephen_bailey :
...unfortunately there are now api side issues...
😢 - sorry dude. Some APIs are just badly designed... If there's no way to "just get all the records created or modified since this timestamp", then you'll have to make two rounds of calls as you described. There's a way to do this with complex pagination tokens - basically defining the
next_page_token
as a custom dict and just making sure you loop through both extraction patterns before returning
None
from
get_next_page_token()
s
one thing about working with apis via meltano is you begin to really appreciate the small things about good interfaces! 🙂
a
Right!? I remember a few years back in the data space everyone was so excited about API layers to abstract people away from direct SQL interfaces - but not many folks were aware how easy it was to build a bad API - in terms of ability to efficiently serve data that consumers need. @dan_mosora (Talend/Stitch) did a great talk on API design for ELT/ETL. Slide 24 from his PPT basically covers this topic: api-design-for-etl/API Design for ETL.pptx
d
Thanks for the plug 😄 Since I’m here, I’ll chime in a bit. In the past, we’ve done a sort of “coalesce” from the
created_at
into the
updated_at
to maintain consistency of replication key, rather than inventing a new field. Both approaches are probably fine, depends on what kind of flexibility you want.
To revisit this concept since it just came up again. I’ve just spoken with another group of developers where they had a similar problem (rows were updated based on the max of 2 fields). The solution they chose at least for now is to have two fields listed in metadata as
valid-replication-keys
and the sync mode goes through the data set and uses the max of the two. This is kind of an interesting concept because there’s the question of whether to read multiple replication keys as
OR
or
AND
. Maybe it’s a bit less of a concern than something like key properties since a replication key is purely informational for either an orchestration/UI/platform actor or the tap’s own sync mode, but thought it an interesting idea to bring up.