I have tap-postgres (meltanolabs) being fed to tar...
# troubleshooting
d
I have tap-postgres (meltanolabs) being fed to target-redshift (wise or ticketswap). I have columns in my source db that have a max length, e.g. varchar(256). When they end up in Redshift, the columns are all varchar(10000). When I use target-postgres, the resulting DB is just over 1GB. Postgres ends up with
text
columns, which are apparently space-efficient. Redshift is reporting more like 250GB used. I know Redshift has compression which should shrink this, but either it's reporting pre-compression sizes, or the compression just isn't effective here. Either way, Redshift advises to use the smallest possible column size for data (see here and here). When I look at the tap-postgres discovered schema, there's no mention of length - just that the fields are type string. e.g.
Copy code
"call_record_id": {
            "type": [
              "string"
            ]
          },
Then when I look at target-redshift, both variants use a default of 10000 for varchars - only using different values for certain special cases of other types encoded as strings. JSON schema has a maxLength property, according to: https://json-schema.org/understanding-json-schema/reference/string#length So, what I'm thinking is this: • the tap should emit maxLength in its properties (perhaps turned on/off with a setting). • the target should use this maxLength when creating the schema, but use a default Does this sound reasonable?
BTW - I'm using incremental replication in the tap
e
So, what I'm thinking is this:
• the tap should emit maxLength in its properties (perhaps turned on/off with a setting).
• the target should use this maxLength when creating the schema, but use a default
Does this sound reasonable?
That seems like the best approach. If MeltanoLabs/tap-postgres doesn't do the former, do create an issue for us to fix it. Or even better, you can submit a PR 🙂
d
Thanks!
v
To generalize this problem a bit as I've seen this in other domains with oracle to mssql. Some folks need the data and datatype but don't care about the data type being as close as possible. Some folks need the data type to be as close as possible, and it does depend on the target. It's like a
schema_discovery_method
that allows you to say "For float we want the json schema to be xyz" (We set up ours to be decimal(19.4) or something like that in target-mssql because FLOATs in oracle aren't ieee floats, binary_float's are long story but the point is the same, and there's other ways to fix this specefic problem generally but the point is similar)" Individual stream overrides are already handled with meltano which handles a ton of these scenarios, but when you need it for 100's of tables you want to be able to map every data type of a certain scenario to something else.
c
for Redshift
varchar(10000)
(or even the typical
varchar(max)
doesn’t impact storage size unless you’re actually using all 10,000 characters (its 4 bytes + total bytes of a record’s field value). That being said, being able to define what compression a column uses could be useful, and that is a column-by-column decision though in practice it is sometimes hard to know what the best compression algorithm to use on a column if you don’t have experience with it. (Side note: the default length for a varchar column from an ELT tool targeting Redshift should probably just be
max
if its not configurable - people are putting JSON in those columns!) My typical workflow with Redshift was to load data then use the AWS provided column analyzer tools to get a report as to what compression was best, and use that same tool to redefine whichever columns had to be changed from LZO (my go to for text). Something like this workflow should be considered when working with columnar store data warehouses like Redshift - often once you’ve done it once, assuming a large enough initial load, you are likely okay for subsequent ELT runs. No idea if the schema definition for new tables in your target is generating columns with compression at all (depends if you can find
ENCODE AUTO
in the source code). 250GB storage used for 1GB of PostgreSQL seems wildly off, but Redshift’s storage usage includes a lot of empty pages. (Probably… not that many.)
d
use the AWS provided column analyzer tools
Are you referring to "analyze compression" here, or something else?
c
something else: https://github.com/awslabs/amazon-redshift-utils - but analyze compression is part of that workflow (the tooling just uses the output of that to generate alter table statements and handles the transformation of the table)
d
Thank you
t
I am looking into this issue at the moment. I think an issue here is that the tap does not emit MaxLength at the moment. Unless you use stream maps
1
right now even with stream maps I don't see how this could be implemented, unless the MaxLength property would be in the schema. Is that the case with the postgres tap? @dylan_just
d
The postgres tap doesn't emit max length. I logged an issue on that repo. It would need to be addressed on both ends.
Also, AWS Support tell me it's a good idea to use smaller columns, but I kinda don't believe them 🙂
t
allright, I will take a look if the fix is easy on their end.
👍 1