Trouble with `tap-postgres` to `target-bigquery` :...
# troubleshooting
m
Trouble with
tap-postgres
to
target-bigquery
🧵
Using select to filter which schemas a fields to extract.
Copy code
select:
    - schem-foo.*
    - schem-bar.*
    - schem-buzz.id
    - schem-buzz.created_at
    - schem-buzz.updated_at
For one of my tables, I want to ignore some of the fields because they are giving me trouble when trying to load into BigQuery (see error below)
Copy code
target-bigquery | CRITICAL 'type' or 'anyOf' are required fields in property: {'$ref': '#/definitions/sdc_recursive_string_array'}
But, looking at the logs, it still looks like
tap-postgres
is selecting all of the fields
Copy code
tap-postgres    | time=2021-05-25 13:33:59 name=tap_postgres level=INFO message=select statement: SELECT  "_id" ,CASE WHEN  "created_at"  < '0001-01-01 00:00:00.000' OR  "created_at"  > '9999-12-31 23:59:59.999' THEN '9999-12-31 23:59:59.999' ELSE  "created_at"  END AS  "created_at" , "creators" , "i_data" , "id" , "in_market" , "m_data" , "o_data" , "owners" , "tp_data" ,CASE WHEN  "updated_at"  < '0001-01-01 00:00:00.000' OR  "updated_at"  > '9999-12-31 23:59:59.999' THEN '9999-12-31 23:59:59.999' ELSE  "updated_at"  END AS  "updated_at" 
tap-postgres    |                                     FROM "schem"."buzz"
tap-postgres    |                                     ORDER BY  "updated_at"  ASC with itersize 20000
Is there a way for me to have the postgres tap completely ignore these fields in this select statement?
Alternatively, is there a way for me to handle this on the target side?
For context here are some of the configs for the tap and target: tap
Copy code
- name: tap-postgres
    variant: transferwise
    pip_url: pipelinewise-tap-postgres
    config:
      host: <HOST>
      port: <PORT>
      dbname: <DBNAME>
      user: <USER>
  select:
    - schem-foo.*
    - schem-bar.*
    - schem-buzz.id
    - schem-buzz.created_at
    - schem-buzz.updated_at
  metadata:
    schem-foo:
      replication-method: INCREMENTAL
      replication-key: updated_at
    schem-bar:
      replication-method: INCREMENTAL
      replication-key: updated_at
    schem-buzz:
      replication-method: INCREMENTAL
      replication-key: updated_at
target
Copy code
- name: target-bigquery
    variant: adswerve
    pip_url: git+<https://github.com/adswerve/target-bigquery.git@v0.10.2>
    config:
      project_id: <PROJ_ID>
      dataset_id: <DATASET>
      credentials_path: <PATH>
      validate_records: true
      replication_method: append
a
Can you try the "not" syntax to exclude the columns you don't want? Example here: https://gitlab.com/meltano/hub/-/blob/main/meltano/meltano.yml#L15
Note that the string needs quotes, just because otherwise a string starting with "!" would be interpreter as a special yaml operator
Using
--list --all
should be effective for debugging the inclusion logic: https://meltano.com/docs/command-line-interface.html#select
m
Let me try not and I'll see how that goes. While I'm trying that, it's also worth noting that when using
--list --all
, it showed the fields as not being selected but according to the logs they still were 🤷
a
Oh - bummer. It sounds like this tap implementation does not apply filter rules to the SCHEMA messages it emits. Since RECORD messages are emitted separately from SCHEMA messages, there are unfortunately many taps which send the full set of schema columns downstream, regardless of selection rules. The data will still be selected/deselected, but the downstream schema would not be filtered. The SDK recently solved this for SDK-based taps, but others may not have this dual-filtering. Unfortunately for your use case, filtering out the schema is critical to resolve the type issues downstream.
@vladyslav_zavalykhatko recently ran into this
sdc_recursive_string_array
issue as well: https://meltano.slack.com/archives/CMN8HELB0/p1621875990047500
m
That’s unfortunate. Best idea I have for a workaround is to just avoid that table entirely in the select and use another method for EL (such as extract from source, load to storage bucket in flat file format, then load from storage bucket to warehouse). Thanks for looking into that 👍
a
Yeah - if you have access to create views on the DB, you could also pull from the view definition. I’ve posted a new thread here in case there are other workarounds: https://meltano.slack.com/archives/C01TCRBBJD7/p1621970218073800