Is it possible to configure Meltano loaders in a w...
# plugins-general
j
Is it possible to configure Meltano loaders in a way that they optimize loaded tables? Specifically, I am talking about PostgreSQL indexes, Snowflake cluster keys, etc.
t
I think the answer is no, not without forking the loader and adding whatever capabilities you're interested in. It depends on what you mean by "optimize" though.
FWIW we use dbt to do some of that - meltano (target-postgres, really) loads the data into "raw" tables, we transform that data into "optimized" tables with dbt and add indexes as part of that transformation. Then our analytic/reporting models use the "optimized" tables.
That makes deleted rows easier to deal with too, BTW... we want them in the "raw" tables with _sdc_deleted_at NOT NULL for historical reasons but in the tables that actually drive reporting we don't want deleted rows included. So that intermediate transformation gives us an opportunity to create tables that have only the latest data.
j
This is exactly what I just did in my demo. But, when you implement incremental ELT, in dbt you want to do something like this:
Copy code
-- Rows to be updated
SELECT ... FROM raw
WHERE raw.id IN ( SELECT id FROM target)
  AND raw.last_updated > ( SELECT max(last_updated) FROM target )
You can design indexes/... in
target
in dbt. But you would have to design indexes/... in
raw
manually after first load, which is dangerous, because e.g. when you decide to replicate the solution in a different infra (e.g. new region), you easily forget to apply the manual optimizations šŸ˜‰ In GoodData, we implemented custom loaders in the past. We allowed users to define (in a declarative way) such optimizations per table. However, we had to do it only for Vertica šŸ˜‰ Generally, it makes sense to allow specifying the following features: • Indexes in standard OLTP-like DBs • Clustering, sort columns, ... in clustered columnar MPP DBs • Partitioning
t
All fair points. I actually use a pre_hook in the models for the "optimized" tables to create an index on the "raw" tables if one doesn't exist. I'd forgotten about that šŸ˜
The code for the pipelinewise Postgres loader isn't too bad to work with... you could certainly hack it up to create indexes for you. I tink the challenge would be how to define what indexes you want to create... nothing comes from the tap to tell you what indexes exist there, so you'd have to define them all in configuration somewhere. šŸ˜• šŸ¤”
An index on _sdc_batched_at should be created unconditionally though... I actually filed a bug for that that I forgot about 🤣