Trouble with `tap postgres` to `target bigquery` thread Meltano #troubleshooting

Trouble with `tap-postgres` to `target-bigquery` :...

michael_harris

05/25/2021, 5:42 PM

Trouble with

tap-postgres

target-bigquery

🧵

michael_harris

05/25/2021, 5:47 PM

Using select to filter which schemas a fields to extract.

Copy code

select:
    - schem-foo.*
    - schem-bar.*
    - schem-buzz.id
    - schem-buzz.created_at
    - schem-buzz.updated_at

For one of my tables, I want to ignore some of the fields because they are giving me trouble when trying to load into BigQuery (see error below)

Copy code

target-bigquery | CRITICAL 'type' or 'anyOf' are required fields in property: {'$ref': '#/definitions/sdc_recursive_string_array'}

But, looking at the logs, it still looks like

tap-postgres

is selecting all of the fields

Copy code

tap-postgres    | time=2021-05-25 13:33:59 name=tap_postgres level=INFO message=select statement: SELECT  "_id" ,CASE WHEN  "created_at"  < '0001-01-01 00:00:00.000' OR  "created_at"  > '9999-12-31 23:59:59.999' THEN '9999-12-31 23:59:59.999' ELSE  "created_at"  END AS  "created_at" , "creators" , "i_data" , "id" , "in_market" , "m_data" , "o_data" , "owners" , "tp_data" ,CASE WHEN  "updated_at"  < '0001-01-01 00:00:00.000' OR  "updated_at"  > '9999-12-31 23:59:59.999' THEN '9999-12-31 23:59:59.999' ELSE  "updated_at"  END AS  "updated_at" 
tap-postgres    |                                     FROM "schem"."buzz"
tap-postgres    |                                     ORDER BY  "updated_at"  ASC with itersize 20000

Is there a way for me to have the postgres tap completely ignore these fields in this select statement?

michael_harris

05/25/2021, 5:49 PM

Alternatively, is there a way for me to handle this on the target side?

michael_harris

05/25/2021, 5:53 PM

For context here are some of the configs for the tap and target: tap

Copy code

- name: tap-postgres
    variant: transferwise
    pip_url: pipelinewise-tap-postgres
    config:
      host: <HOST>
      port: <PORT>
      dbname: <DBNAME>
      user: <USER>
  select:
    - schem-foo.*
    - schem-bar.*
    - schem-buzz.id
    - schem-buzz.created_at
    - schem-buzz.updated_at
  metadata:
    schem-foo:
      replication-method: INCREMENTAL
      replication-key: updated_at
    schem-bar:
      replication-method: INCREMENTAL
      replication-key: updated_at
    schem-buzz:
      replication-method: INCREMENTAL
      replication-key: updated_at

target

Copy code

- name: target-bigquery
    variant: adswerve
    pip_url: git+<https://github.com/adswerve/target-bigquery.git@v0.10.2>
    config:
      project_id: <PROJ_ID>
      dataset_id: <DATASET>
      credentials_path: <PATH>
      validate_records: true
      replication_method: append

aaronsteers

05/25/2021, 6:04 PM

Can you try the "not" syntax to exclude the columns you don't want? Example here: https://gitlab.com/meltano/hub/-/blob/main/meltano/meltano.yml#L15

aaronsteers

05/25/2021, 6:05 PM

Note that the string needs quotes, just because otherwise a string starting with "!" would be interpreter as a special yaml operator

aaronsteers

05/25/2021, 6:07 PM

Using

--list --all

should be effective for debugging the inclusion logic: https://meltano.com/docs/command-line-interface.html#select

michael_harris

05/25/2021, 6:27 PM

Let me try not and I'll see how that goes. While I'm trying that, it's also worth noting that when using

--list --all

, it showed the fields as not being selected but according to the logs they still were 🤷

aaronsteers

05/25/2021, 7:07 PM

Oh - bummer. It sounds like this tap implementation does not apply filter rules to the SCHEMA messages it emits. Since RECORD messages are emitted separately from SCHEMA messages, there are unfortunately many taps which send the full set of schema columns downstream, regardless of selection rules. The data will still be selected/deselected, but the downstream schema would not be filtered. The SDK recently solved this for SDK-based taps, but others may not have this dual-filtering. Unfortunately for your use case, filtering out the schema is critical to resolve the type issues downstream.

aaronsteers

05/25/2021, 7:11 PM

@vladyslav_zavalykhatko recently ran into this

sdc_recursive_string_array

issue as well: https://meltano.slack.com/archives/CMN8HELB0/p1621875990047500

michael_harris

05/25/2021, 7:30 PM

That’s unfortunate. Best idea I have for a workaround is to just avoid that table entirely in the select and use another method for EL (such as extract from source, load to storage bucket in flat file format, then load from storage bucket to warehouse). Thanks for looking into that 👍

aaronsteers

05/25/2021, 7:32 PM

Yeah - if you have access to create views on the DB, you could also pull from the view definition. I’ve posted a new thread here in case there are other workarounds: https://meltano.slack.com/archives/C01TCRBBJD7/p1621970218073800

Open in Slack

Previous Next