Hi it is strange that ```name target parquet variant estrate Meltano #troubleshooting

Hi - it is strange that ```name: target-parquet ...

ashutosh_shanker

10/27/2023, 5:20 AM

Hi - it is strange that

Copy code

name: target-parquet
  variant: estrategiahq

does not copy source schema to the target. Not sure if I am missing something or how others are using this target. I am loading a table from posgres and writing to multiple parquet files using target-parquet. Issue is that each parquet file has a different type for a numeric postgress column based on the values in the data as the schma was not copied and written from the source and looks like is generated on the fly. e.g. amount ( 10,2 )from postgres source is converted to decimal(6,2) in once file and decimal (7,2) in the other file based on the data precision of the data in that file Is my understanding correct ?

ashutosh_shanker

10/27/2023, 5:21 AM

When I am loading these parquet file in dbt-duckdb , I start seeing conversion errors as each parquet file has different data type for the same column

ashutosh_shanker

10/27/2023, 5:22 AM

dbt code

Copy code

meta:
          external_location: "read_parquet('{{ env_var('MELTANO_PROJECT_ROOT') }}/output/loader/parquet/{name}/*.parquet')"
      - name: payment

ashutosh_shanker

10/27/2023, 5:35 AM

target-parquet code - https://github.com/estrategiahq/target-parquet/blob/master/target_parquet/__init__.py

Copy code

def create_dataframe(list_dict):
    fields = set()
    for d in list_dict:
        fields = fields.union(d.keys())
    dataframe = pa.table({f: [row.get(f) for row in list_dict] for f in fields})
    return dataframe

write file

Copy code

def create_dataframe(list_dict):
    fields = set()
    for d in list_dict:
        fields = fields.union(d.keys())
    dataframe = pa.table({f: [row.get(f) for row in list_dict] for f in fields})
    return dataframe

ashutosh_shanker

10/27/2023, 12:49 PM

This seems to be solving the issue for me as of now

Copy code

The union_by_name option can be used to unify the schema of files that have different or missing columns. For files that do not have certain columns, NULL values are filled in.

SELECT * FROM read_parquet('flights*.parquet', union_by_name=true);

Open in Slack

Previous Next