We re trying out `target redshift` to pull in some data to a Meltano #troubleshooting

We're trying out `target-redshift` to pull in some...

craig_astill

07/18/2023, 2:09 PM

We're trying out

target-redshift

to pull in some data to a database in our Redshift cluster. • In our test VPC it creates schemas/tables/columns with lowercase. • In our Prod VPC it creates schemas/tables/columns with uppercase. Any ideas for the difference in behaviour? Trying to nail it down so that I can run the DBT transforms that were expecting lowercase.

janis_puris

07/18/2023, 2:25 PM

Thats strange.. Do you have the same

target-redshift

variant on both environments? What are the variant(s)?

craig_astill

07/18/2023, 2:30 PM

Both are using:

transferwise

variant. The only config change was: `host`/`s3_bucket` in

meltano.yml

and `TARGET_REDSHIFT_PASSWORD`/`MELTANO_STATE_BACKEND_URI` in

.env

to swap between the test/prod VPC's.

craig_astill

07/18/2023, 2:31 PM

With the above, it's the same docker container and then

--environment

.env

file to re-use it.

janis_puris

07/18/2023, 2:32 PM

Another thing.. I thought redshift is case insensitive, hence this would work no problem.

Copy code

create temporary table MY_TABLE (
    id int
);

select ID from my_table;

This is not however the case, if you quote the identifiers in dbt. Perhaps an option could be to disable the quotations 🤷

craig_astill

07/18/2023, 2:33 PM

That makes sense on the DBT-side. Was thinking of trying lookups without quotes. However, I'd like to figure out what was the cause for the difference in behaviour on the Meltano side.

janis_puris

07/18/2023, 2:34 PM

For some reason I feel like transferwise target always lowercase the identifiers, but just in case it does not.. do you have exactly same taps / config for the different envs?

janis_puris

07/18/2023, 2:35 PM

Might be that transferwise target re-uses the identifiers as is and it is the source that is creating the problem? 🤷

craig_astill

07/18/2023, 2:39 PM

I guess that's the only difference between the two. I'm using

tap-spreadsheets-anywhere

to pull in CSV's. Did it iteratively on the Test VPC, with new CSV's being dropped in regularly. Had a few more files added since the last clean run. Prod VPC has all of the files being pulled-in in one go.

janis_puris

07/18/2023, 2:44 PM

Hmm.. I don't think the tap is a variable here. Well for what it is worth.. transferwise's

target_redshift

will lowercase the identifiers here is the one for schema https://github.com/transferwise/pipelinewise-target-redshift/blob/80796b6c6e4647a85ddc9492e343547a027258e4/target_redshift/db_sync.py#L593 you can see the use of

.lower()

String method. Are you certain the tables objects were not created in PROD by something else and the target simply re-using them? Also can confirm looking at our redshift pipeline. In exporter's (

mssql

) stream I have a

Copy code

dbo-Myschema.MyTable

which then is translated to and sinked into Redshift's

Copy code

<default_schema>.mytable

janis_puris

07/18/2023, 2:49 PM

I am also not able to create uppercase table names in redshift 😄 (SERVERLESS)

Copy code

CREATE TABLE "UPPER_TABLE_NAME"
(
    id INT
);

CREATE TABLE lower_table_name
(
    id INT
);

SELECT
    table_name
FROM information_schema.tables
WHERE table_name ILIKE '%table_name';

DROP TABLE lower_table_name;
DROP TABLE upper_table_name;

The lookup on information_schema yields

Copy code

+------------------+
| table_name       |
+------------------+
| upper_table_name |
| lower_table_name |
+------------------+

🤷

craig_astill

07/18/2023, 2:49 PM

On the Prod VPC, I created the database in Redshift after a failed meltano run (Keep forgetting that

target-redshift

doesn't create the database if it doesn't exist), then did another run to create & populate the schema. On the Test VPC, to do a "clean" run, I usually do a cycle of dropping schemas + deleting the S3 state before doing a run.

janis_puris

07/18/2023, 2:50 PM

If you are able to produce a MRE on this, I'd love to try to reproduce it 😕

craig_astill

07/18/2023, 2:52 PM

I'll have a play with the Prod environment and let you know in a bit...

craig_astill

07/18/2023, 3:01 PM

Hmmm.... well, it looks like not all Redshift's are configured the same: • Prod Redshift:

Copy code

table_name
UPPER_TABLE_NAME
lower_table_name

• Test Redshift:

Copy code

table_name
upper_table_name
lower_table_name

Think I need to dig into the Redshift docs and pass on a config change to the Ops team. Thanks for helping me dig @janis_puris!!

janis_puris

07/18/2023, 3:04 PM

Well TIL 🙂 @craig_astill take a look at https://docs.aws.amazon.com/redshift/latest/dg/r_enable_case_sensitive_identifier.html

craig_astill

07/18/2023, 3:08 PM

^^ awesome! Thanks for the link. You're definitely making my afternoon debug session a lot easier!! 😄

janis_puris

07/18/2023, 3:13 PM

I was wrong about PW's

target-redshift

emitting only lowercase identifiers! Here is where it retrieves the identifier for

CREATE TABLE ...

statement 1. It will lowercase the table name from config 2. Then when it assembles the "path" i.e.

[<schema>.]<table>

it will uppercase the path with

.upper()

Additionally it will quote the table name in

This coupled with the preservation of the identifier casing on PROD VPC Redshift's config, would then result in the disparity you're observing.

craig_astill

07/18/2023, 3:15 PM

Confirmed:

SHOW enable_case_sensitive_identifier;

on

in our Production Redshift, but

off

in our Test Redshift. Will chat with our Data Team to figure out the best way of cleaning up. --- Had also found the

.upper()

that you had just found, but now all of this is making more sense around why there is a difference.

craig_astill

07/18/2023, 3:53 PM

From quick testing, I'm still getting the same behaviour on Our Prod Redshift after: • Dropping schemas in Redshift. • Deleting Meltano state from S3. •

SET enable_case_sensitive_identifier to off;

in Prod database. • Re-run meltano extraction/load. So it looks like there is a bit more to chip away at.

janis_puris

07/18/2023, 3:54 PM

you are not able to reproduce this, right? https://meltano.slack.com/archives/C01TCRBBJD7/p1689692478586699?thread_ts=1689689366.984449&cid=C01TCRBBJD7 In other words.. both TEST and PROD produce same table names

craig_astill

07/18/2023, 3:57 PM

With your above code +

SET enable_case_sensitive_identifier to off;

, the Upper table is created as:

upper_table_name

. So that's good. Just trying to figure out what

target-redshift

is playing at now.

janis_puris

07/18/2023, 3:58 PM

Are you sure that

SET

does not simply change the setting for your session only?

janis_puris

07/18/2023, 3:58 PM

Can you do the test again in another session just to be sure?

craig_astill

07/18/2023, 4:00 PM

doh

craig_astill

07/18/2023, 4:00 PM

Yep, I'm only setting it for my session. Good catch!

ian_lewis

07/18/2023, 4:01 PM

The

safe_column_name

function is the root of the issue it seems, and there's no way to circumvent it. https://github.com/transferwise/pipelinewise-target-redshift/blob/80796b6c6e4647a85ddc9492e343547a027258e4/target_redshift/db_sync.py#L94

craig_astill

07/18/2023, 4:03 PM

Learning Redshift. Looks like I need to chat to Our Ops team to set it permanently (https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_ConfigurationRef.html#t_Modifying_the_default_settings).

craig_astill

07/18/2023, 4:11 PM

Did a quick proof of concept with:

ALTER USER <user> SET enable_case_sensitive_identifier to off;

, and that now sets my DB to be the same as Our Test instance. So Meltano + DBT runs complete. That's good enough to get it off my plate for a little while and pass it over with a lovely list of caveats 😄 Thanks again @janis_puris for your help!!!

4 Views

Open in Slack

Previous Next