I'm using the tap-hubspot, my goal is to extract a...
# singer-taps
l
I'm using the tap-hubspot, my goal is to extract all contacts data, run some transformations in my DWH, then write the enriched data back to hubspot for some unknown reason, every since many records were added in hubspot (it's a sandbox environment) - the tap is looping endlessly extracting data, with many duplicates being written over and over again to my target. the records don't seem to be different, so i suspect the tap is not using a proper paging mechanism to progress through the extraction? here's my
meltano.yml
:
Copy code
version: 1
default_environment: dev
environments:
- name: dev
- name: staging
- name: prod
state_backend:
  uri: <s3://dwh/meltano-states/>
s3:
    aws_access_key_id: ${AWS_ACCESS_KEY_ID}
    aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
plugins:
  extractors:
  - name: tap-hubspot
    # python: python
    variant: meltanolabs
    pip_url: git+<https://github.com/MeltanoLabs/tap-hubspot.git@v0.6.3>
    config:
      start_date: '2020-01-01'
    select:
    - contacts.*
  loaders:
  - name: target-s3
    variant: crowemi
    pip_url: git+<https://github.com/crowemi/target-s3.git>
    config:
      append_date_to_filename: true
      append_date_to_filename_grain: microsecond
      partition_name_enabled: true
  - name: target-s3--hubspot
    inherit_from: target-s3
    config:
      format:
        format_type: parquet
      prefix: dwh/hubspot
      flattening_enabled: false
e
Hi @Lior Naim Alon! Contacts are extracted incrementally based on the
lastmodifieddate
field, so it's possible for duplicate data to be written by the S3 loader. Does your sync look similar to what's described in https://github.com/meltano/sdk/issues/2154?
l
hello @Edgar Ramírez (Arch.dev) , In general, yes, it seems very similar, only in my case i cannot finish a single sync operation, because it seems to replicate a random batch of 10k records every time the sink is drained. Since i am using s3 as a target, I used Athena to look at the data as it is being written, and ran a query of count(1) and count(distinct id). I could see that after the first batch (sink drain?) the result for both is 10,000, but after second batch, the distinct count is up by a few hundred (to 10,230), and the general count is 20,000 I don't have time to look into the source code, but i suspect that the tap does not maintain a state between batches for some reason, or it is a configuration issue that i might be missing. In any case, for now i started using the airbyte variant which is simple to use and does not seem to suffer from these symptoms.