Hello I m using tap s3 csv and target redshift for my pipeli Meltano #troubleshooting

Hello, I'm using tap-s3-csv and target-redshift fo...

boggdan_barrientos

11/16/2021, 4:53 PM

Hello, I'm using tap-s3-csv and target-redshift for my pipeline. My data in csv is not getting the correct infer schema so I decide to extract all data as string. Now for performance and space disk usage I'm trying to correct it copying the schema of my actual ETL, create the table with their schema manually and set the default schema in target redshift. In my understanding if Meltano finds the table, it will not re-create the table but now it makes a versioning of the table. It creates over the schema a versioning of each column.

Copy code

target-redshift       | loader    | time=2021-11-16 17:46:14 name=target_redshift level=INFO message=Versioning column: ALTER TABLE raw_s3_appunisuper."INVENTARIO" RENAME COLUMN "_SDC_DELETED_AT" TO "_SDC_DELETED_AT_20211116_1746"
target-redshift       | loader    | time=2021-11-16 17:46:14 name=target_redshift level=INFO message=Adding column: ALTER TABLE raw_s3_appunisuper."INVENTARIO" ADD COLUMN "_SDC_DELETED_AT" character varying(10000)

I want to avoid this versioning and force targert-redshift use the schema that I create for the table. Any suggests to do it? I'm running Meltano 1.87 and in the thread I share you the tap-s3-csv that I'm using. I cannot use schema extra from Meltano because I only have the capability of state with this tap.

boggdan_barrientos

11/16/2021, 4:54 PM

https://github.com/boggdan95/tap-s3-csv

pat_nadolny

11/17/2021, 6:07 PM

@aaronsteers @boggdan_barrientos My memory was that character lengths do matter in redshift but wasnt positive so I went searching. I found a couple links. Seems like for storage it doesnt matter but for complex queries that use temp tables it does make them more inefficient. Kind of a small caveat but worth calling out as its a use case dependent optimization. On the other hand, size in Snowflake does not have any impact.

pat_nadolny

11/17/2021, 6:07 PM

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-smallest-column-size.html

pat_nadolny

11/17/2021, 6:07 PM

https://community.snowflake.com/s/article/faq-how-do-i-choose-the-size-of-my-varchar-columns

aaronsteers

11/17/2021, 6:11 PM

Thanks, @pat_nadolny. I think the best practice downstream in dbt (or other transformation tool) would be to have a staging/casting/naming layer between the raw layer and any complex transforms.

aaronsteers

11/17/2021, 6:12 PM

In that dbt transformation layer though, importantly no data is lost if a type is too small - since the raw data still has full fidelity and dbt's transformations are simply "on top" of that raw data.

boggdan_barrientos

11/17/2021, 6:16 PM

Thanks @pat_nadolny and @aaronsteers this knowledge is to useful. That's we are looking whit this. Thanks again! 🙌

pat_nadolny

11/17/2021, 6:51 PM

@aaronsteers yeah that makes sense! So it sounds like its a good practice to use the staging layer to cleanup any oversized raw columns before complex queries are run downstream.

aaronsteers

11/17/2021, 8:39 PM

Yeah, that's what I'd recommend. Then type declarations are optimizations which can be incrementally added/improved purely in the T layer, without risk of breaking the EL component.

Open in Slack

Previous Next