Hi, I'm running the tap-shopify and I'm facing a w...
# plugins-general
j
Hi, I'm running the tap-shopify and I'm facing a weird behavior, which I couldn't find a solution for yet. The elt run seems to not be respecting the bookmark, so, every time that I run the elt, I got duplicated records on BQ. I'm running the elt using:
Copy code
meltano elt tap-shopify target-bigquery --full-refresh --job_id=shopify_to_bq
On my
.env
, I'm specifying the state and the catalog like:
Copy code
TAP_SHOPIFY__CATALOG=extract/tap-shopify.catalog.json
TAP_SHOPIFY__STATE=extract/tap-shopify.state.json
On my catalog, I have the replication method, and key as:
Copy code
"metadata": {
  "selected": true,
  "table-key-properties": ["id"],
  "forced-replication-method": "INCREMENTAL",
  "valid-replication-keys": ["updated_at"]
}
I can see the bookmark line on console, but the state file remains empty.
Copy code
{
  "bookmarks": {
    "currently_sync_stream": "transactions",
    "orders": {
      "since_id": 123,
      "updated_at": "2021-05-24T22:18:48.000000Z"
    },
    "products": {
      "since_id": 456,
      "updated_at": "2021-05-24T22:18:55.000000Z"
    },
    "transaction_orders": {
      "since_id": 789,
      "updated_at": "2021-05-24T22:19:20.000000Z"
    },
    "transactions": {
      "created_at": "2021-05-24T22:17:37.000000Z"
    }
  }
}
Also, if I write the state file using the content above, it also ignores that and start over again using the
since_id=0
. Any idea how to fix or what I'm doing wrong?
a
In the metadata, can you try specifying specifically
replication-method
,
replication-key
as documented here: https://meltano.com/docs/singer-spec.html#metadata
c
I thought it always gets the most recent row twice, because it uses
>=
to guaratnee it doesn’t miss anything. and then bigquery rows are immuatable right?
j
Did you include the
--full-refresh
on purpose? I think that makes it ignore the state from your previous runs.
j
good catch @jules_huisman! I used that to try to overwrite the duplicated data, but it seems to be appending everything to the existing data
j
If you are using the
adswerve
variant of
target-bigquery
you can set
replication_method
in the config of the target to
truncate
in order to overwrite existing data. (Default is
append
from the top of my head)
j
so, what's the difference between using the
truncate
over the
--full-refresh
argument? Should I use both?
I just did a small test re-importing everything and now I have more duplicated on the table
j
--full-refresh
relates mainly to the extraction of the data, whether to use the previous state of the job or to ignore the state and pull all the data (handled by Meltano). The
replication_method
config is used specifically for
target-bigquery
and specifies whether each table should be truncated on each load, or simply append the data to the existing table.
Oh sorry, I see now that the
replication_method
functionality is not present on the main branch of target-bigquery, I was probably working with another branch.
j
that's why it's always using the append!
any idea how to make this run?
j
j
Now I could see the "WRITE_TRUNCATE" being sent to BQ, but I still don't understand why it's always duplicating the data. I'm deleting all records from my table and after running twice:
Copy code
meltano elt tap-shopify target-bigquery --job_id=shopify_to_bq
I started getting duplicated records:
Copy code
SELECT COUNT(*) TT,
          ID
    FROM `<shopify-order-table>` 
 GROUP BY ID
 HAVING TT > 1
j
Ah, apparently it is a bug in target-bigquery. https://github.com/adswerve/target-bigquery/issues/2. Someone fixed it in this branch: https://github.com/adswerve/target-bigquery/tree/hotfix/issue2
m
Thanks for info, I'm running into the same issue. It appears
adswerve
is the only provider who offers the truncate option if I'm not mistaken.
j
yes! so, I tested it changing the PIP url, and then I started getting a different error, which seems to be related to the schema I think this branch have the fix, but it might be outdated
thanks for helping me on this Jules! I'll open a new thread to discuss a different issue, once we know about this truncate now
j
No problem!