Is the replication key lowercase? With my hubspot ...
# singer-tap-development
s
Is the replication key lowercase? With my hubspot tap, I receive a value of updatedAt which I set as a replication key, but I'm still doing full page replications it seems instead of the incremental:
Copy code
select:
    - '*.*'
    metadata:
      '*':
        replication-method: INCREMENTAL
        replication-key: updatedAt
e
Hi @Stéphane Burwash! Are you using
meltano run
?
s
No, meltano elt, more specifically
meltano elt tap-hubspot target-stitch--hubspot --job_id=blablabla
e
Ok so you're passing a job_id, that was gonna be my next suggestion 😅. It's uncommon to override the metadata for an API tap since that's usually baked in, so the tap may just be ignoring it.
If you dump the catalog, you may be able to if it at least looks right:
meltano invoke --dump=catalog tap-hubspot
s
I created my own tap with the sdk, so should I have set the replication method directly in my tap? I think my catalog file look good, here is what's at the top:
Copy code
"streams": [
    {
      "tap_stream_id": "companies",
      "replication_key": "updatedAt",
      "replication_method": "INCREMENTAL",
      "key_properties": [
        "id"
      ],
Is there a way to view the current state file? I could make a comparison with the data coming in
e
I created my own tap with the sdk, so should I have set the replication method directly in my tap?
Yeah, but with the SDK that's done automatically if you define a replication_key in the stream
s
I removed the metadata, but I'm still getting the issue, which is weird; with the bigquery loader, I'm even appending full tables instead of updating, which I thought was impossible
I went from 4255 entries to 8510
Which I'm guessing is linked to my replication key
e
which bigquery loader are you using?
s
e
yup, there's a
replication_method
config option: https://github.com/adswerve/target-bigquery#step-3-configure
s
Ok thanks! I shall look into it and get back to you 😄 Hopefully this is fixed shortly
e
cool. do let me know
s
Well sadly I can't find the source of my error; it's also harder to test since I'm performing all of my testing directly on bigquery, while my main tap actually goes through stitch, which is really our main issue since it's costing us a lot of rows every 15 minutes. Also, I've been having an issue with the start_date, where whatever value I put, the tap will always sync the entire table. Could this be linked to my issue?
e
Also, I've been having an issue with the start_date, where whatever value I put, the tap will always sync the entire table
is this in the custom tap you made with the sdk? it may just be that you need to implement the actual use of the bookmark in a url param or whatever the api expects, like here: https://github.com/MeltanoLabs/tap-stackexchange/blob/9a27f873c27c181c24271a250d8c94e275c32b8e/tap_stackexchange/client.py#L118
s
Awesome, thank you so much! Back to my replication issue then 😉
Update: I'm back to square one. I created a testing api endpoint on stitch, but every time I sync a table (ex: owners in hubspot) it's counting the data as loaded. So for 132 owners, I'm now up to 396 loaded rows in stitch (but only 132 rows in gbq, my final warehouse) Does this mean the issue is with meltano, or stitch?
e
I guess that means stitch has processed the same 132 rows three times? If you're upserting in bq 132 seems right. Although it's not running incrementally it seems
s
Copy code
1m{"type": "STATE", "value": {"bookmarks": {"owners": {"replication_key": "updatedAt", "replication_key_value": "2022-04-27T14:13:53.871Z", "replication_key_signpost": "2022-04-27T14:49:56.738716+00:00", "starting_replication_value": "2022-04-27T14:13:53.871Z", "progress_markers": {"Note": "Progress is not resumable if interrupted.", "replication_key": "updatedAt", "replication_key_value": "2020-03-10T06:42:02.879Z"}}}}}[0m [36mcmd_type[0m=[35mextractor[0m [36mjob_id[0m=[35mtest_hubspot-to-bigquery[0m [36mname[0m=[35mtap-hubspot (out)[0m [36mrun_id[0m=[35m3628d47c-8497-429c-a1af-fd957390ebc8[0m [36mstdio[0m=[35mstdout[0m
Well it seems my replication key is set properly, and the state exists
1mIncremental state has been updated at 2022-04-27 144957.473453.[0m
And At the end it says my incremental state has been updated, I just can't see anywhere where it was actually considered 😛
e
ok I just realized I had even starred your tap-hubspot repo 🤦‍♂️. I'm looking at it now...
so it seems you're using that state anywhere to query the api. You still need to call get_starting_timestamp and use the value in the streams that can be filtered. And looking at other variants of the tap, it seems like some endpoints like
owners
don't really support filtering so it's after the fact: https://github.com/singer-io/tap-hubspot/blob/master/tap_hubspot/__init__.py#L862-L863
s
So I should manually be managing state? I thought the sdk managed that under the hood no?
But ok awesome! From here I should be able to adapt my code. Do you have a code example of someone manual managing state?
e
So I should manually be managing state?
At least reading state, yes. That is actually expected. For "faking" incremental replication on streams that don't really support filtering in the upstream system, we have an issue: https://gitlab.com/meltano/sdk/-/issues/227
But ok awesome! From here I should be able to adapt my code. Do you have a code example of someone manual managing state?
Yup. Something like https://github.com/MeltanoLabs/tap-stackexchange/blob/main/tap_stackexchange/client.py#L117-L118 except you'll have to use it in
post_process
most likely to filter out unwanted records
s
As always, you help is infinitely helpful, thank you so much! Ill update you when I'm done
Copy code
def post_process(self, row: dict, context: Optional[dict]) -> dict:
        """As needed, append or transform raw data to match expected structure.
        Returns row, or None if row is to be excluded"""
        if self.replication_key:
            if row['updatedAt'] < self.get_starting_replication_key_value(context):
                return None
        return row
After pain and suffering, I think we got it! So just to recap, with the sdk, we need to manage most interactions / settings (ex: start date, replication, etc.)
Again, thank you so much @edgar_ramirez_mondragon
e
So just to recap, with the sdk, we need to manage most interactions / settings (ex: start date, replication, etc.)
Yeah. Unfortunately we (team and community) haven't come up with the right abstractions that might work declaratively for any sort filtering the source might do, so it's left for the dev to implement 🙂