Need advice: I'm trying to develop tap-bigquery (...
# singer-tap-development
a
Need advice: I'm trying to develop tap-bigquery (existing from alendata does not work for me beyond possibility to patch - too many changes required). Most of the Singer targets require
key_properties
while BigQuery does not have this notion and in most cases you cannot guess from column type whether it is a primary key. So we need to put this data into tap config. What would be the best place to put this data and in which form?
Would be nice to put it into
metadata
and somehow use
*
functionality from Meltano. But I do not understand how to extract this data at discover stage.
Or duplicate table names once more and put a dictionary of stream names into config?
dumb question sorry. it works out of the box in meltano
a
Not a dumb question at all - although I'm not sure I understand the use case. Are you thinking as a user, you would want to override what the tap sends as its primary keys (aka 'key properties')? Or something else?
On the target implementation side, are you planning to have a merge/upsert operation in Big Query or to just ignore keys and append records regardless?
s
BigQuery as the tap….
(jumping in here because I have a BigQuery->Snowflake migration project and was thinking of using meltano to help)
a
@aaronsteers BQ tables usually do not have primary keys in formalized form in schema. You have to provide this information externally. For now I'm using excellent Meltano functionality to override metadata. Like this:
Copy code
metadata:
      '*':
        replication-method: FULL_TABLE
      ml_test_db-ml_models:
        key-properties: [model_version]
@sterling_paramore you should know that to the best of my knowledge there's no fully functional tap-bigquery. I was extremely surprised 🙂
There's alendata tap-bigquery, but it does very strange things with schema discovery and requires date-column to work, even if you are going to do FULL_TABLE sync only.
This is my rough attempt at
tap-bigquery
following AJ youtube stream, with couple of bugfixes for singer_sdk: https://github.com/epoch8/tap-bigquery
a
Nice! Yes, this looks very good. And I see what you mean - yes, that's a good way to declare key-properties. The same/similar approach is very common to specify replication-key, since almost zero databases are able to detect which column is best for replication, users of SQL-type taps almost always will need to specify those columns (like
updatedOn
).
s
@andrey_tatarinov - I was trying out your
tap-bigquery
extractor and having trouble getting the key-properties to work. Here’s what I’ve got in `meltano.yml`:
Copy code
extractors:
        - name: tap-bigquery
          config:
            credentials_path: .secrets/client_secrets.json
            start_datetime: '2000-01-01 00:00:00'
            project_id: mse-prod-field-1
          select:
            - customer-package.*
          metadata:
            '*':
              replication-method: FULL_TABLE
            customer-package:
              replication-method: INCREMENTAL
              key-properties: [timestamp]
But in the log,
key_properties
is empty
Copy code
2022-06-13T23:29:11.264565Z [debug    ] {"type": "SCHEMA", "stream": "customer-package", "schema": {"properties": {"device": {"type": ["string"]}, "device.serial_number": {"type": ["string", "null"]}, "device.nickname": {"type": ["string", "null"]}, "device.facility_id": {"type": ["string", "null"]}, "device.type_id": {"type": ["string", "null"]}, "device.sequence": {"type": ["integer", "null"]}, "timestamp": {"format": "date-time", "type": ["string"]}, "energy_reactiveDelivered": {"type": ["number", "null"]}, "energy_reactiveReceived": {"type": ["number", "null"]}, "energy_realDelivered": {"type": ["number", "null"]}, "energy_realReceived": {"type": ["number", "null"]}, "insert_timestamp": {"format": "date-time", "type": ["string", "null"]}}, "type": "object"}, "key_properties": []} cmd_type=extractor name=tap-bigquery (out) run_id=128403b4-1348-473f-9b51-f110cd925aba state_id=field-customer stdio=stdout
a
The only explanations I can think of would be if this tap doesn't respect
key-properties
- which would be odd for something like bigquery, or else is it expected in a different place within
metadata
. If all else fails, you could use a mapper to override this setting between tap and target, although that adds additional complexity here that probably shouldn't be needed.
s
That’s what I would have thought too, but Andrey specifically said he addressed this and I was just wondering if I might be doing something wrong on my end.
a
This specific case indeed works for me. It's strange that there are troubles reproducing