dylan_just
08/07/2024, 12:42 AMtext
columns, which are apparently space-efficient. Redshift is reporting more like 250GB used.
I know Redshift has compression which should shrink this, but either it's reporting pre-compression sizes, or the compression just isn't effective here. Either way, Redshift advises to use the smallest possible column size for data (see here and here).
When I look at the tap-postgres discovered schema, there's no mention of length - just that the fields are type string. e.g.
"call_record_id": {
"type": [
"string"
]
},
Then when I look at target-redshift, both variants use a default of 10000 for varchars - only using different values for certain special cases of other types encoded as strings.
JSON schema has a maxLength property, according to:
https://json-schema.org/understanding-json-schema/reference/string#length
So, what I'm thinking is this:
• the tap should emit maxLength in its properties (perhaps turned on/off with a setting).
• the target should use this maxLength when creating the schema, but use a default
Does this sound reasonable?dylan_just
08/07/2024, 12:45 AMEdgar Ramírez (Arch.dev)
08/07/2024, 3:55 AMSo, what I'm thinking is this:
• the tap should emit maxLength in its properties (perhaps turned on/off with a setting).
• the target should use this maxLength when creating the schema, but use a default
Does this sound reasonable?That seems like the best approach. If MeltanoLabs/tap-postgres doesn't do the former, do create an issue for us to fix it. Or even better, you can submit a PR 🙂
dylan_just
08/07/2024, 3:57 AMdylan_just
08/07/2024, 4:50 AMvisch
08/07/2024, 12:50 PMschema_discovery_method
that allows you to say "For float we want the json schema to be xyz" (We set up ours to be decimal(19.4) or something like that in target-mssql because FLOATs in oracle aren't ieee floats, binary_float's are long story but the point is the same, and there's other ways to fix this specefic problem generally but the point is similar)"
Individual stream overrides are already handled with meltano which handles a ton of these scenarios, but when you need it for 100's of tables you want to be able to map every data type of a certain scenario to something else.Charles Feduke
08/07/2024, 1:42 PMvarchar(10000)
(or even the typical varchar(max)
doesn’t impact storage size unless you’re actually using all 10,000 characters (its 4 bytes + total bytes of a record’s field value). That being said, being able to define what compression a column uses could be useful, and that is a column-by-column decision though in practice it is sometimes hard to know what the best compression algorithm to use on a column if you don’t have experience with it. (Side note: the default length for a varchar column from an ELT tool targeting Redshift should probably just be max
if its not configurable - people are putting JSON in those columns!)
My typical workflow with Redshift was to load data then use the AWS provided column analyzer tools to get a report as to what compression was best, and use that same tool to redefine whichever columns had to be changed from LZO (my go to for text). Something like this workflow should be considered when working with columnar store data warehouses like Redshift - often once you’ve done it once, assuming a large enough initial load, you are likely okay for subsequent ELT runs.
No idea if the schema definition for new tables in your target is generating columns with compression at all (depends if you can find ENCODE AUTO
in the source code). 250GB storage used for 1GB of PostgreSQL seems wildly off, but Redshift’s storage usage includes a lot of empty pages. (Probably… not that many.)dylan_just
08/07/2024, 9:11 PMuse the AWS provided column analyzer toolsAre you referring to "analyze compression" here, or something else?
Charles Feduke
08/07/2024, 9:13 PMdylan_just
08/07/2024, 9:14 PMtobias_cadee
08/21/2024, 7:52 AMtobias_cadee
08/21/2024, 7:58 AMdylan_just
08/21/2024, 7:59 AMdylan_just
08/21/2024, 8:00 AMtobias_cadee
08/21/2024, 8:00 AM