Is it possible to configure Meltano loaders in a way that th Meltano #plugins-general

Is it possible to configure Meltano loaders in a w...

jan_soubusta

01/20/2023, 8:28 AM

Is it possible to configure Meltano loaders in a way that they optimize loaded tables? Specifically, I am talking about PostgreSQL indexes, Snowflake cluster keys, etc.

thomas_briggs

01/20/2023, 1:55 PM

I think the answer is no, not without forking the loader and adding whatever capabilities you're interested in. It depends on what you mean by "optimize" though.

thomas_briggs

01/20/2023, 1:56 PM

FWIW we use dbt to do some of that - meltano (target-postgres, really) loads the data into "raw" tables, we transform that data into "optimized" tables with dbt and add indexes as part of that transformation. Then our analytic/reporting models use the "optimized" tables.

thomas_briggs

01/20/2023, 1:57 PM

That makes deleted rows easier to deal with too, BTW... we want them in the "raw" tables with _sdc_deleted_at NOT NULL for historical reasons but in the tables that actually drive reporting we don't want deleted rows included. So that intermediate transformation gives us an opportunity to create tables that have only the latest data.

jan_soubusta

01/20/2023, 2:06 PM

This is exactly what I just did in my demo. But, when you implement incremental ELT, in dbt you want to do something like this:

Copy code

-- Rows to be updated
SELECT ... FROM raw
WHERE raw.id IN ( SELECT id FROM target)
  AND raw.last_updated > ( SELECT max(last_updated) FROM target )

You can design indexes/... in

target

in dbt. But you would have to design indexes/... in

raw

manually after first load, which is dangerous, because e.g. when you decide to replicate the solution in a different infra (e.g. new region), you easily forget to apply the manual optimizations 😉 In GoodData, we implemented custom loaders in the past. We allowed users to define (in a declarative way) such optimizations per table. However, we had to do it only for Vertica 😉 Generally, it makes sense to allow specifying the following features: • Indexes in standard OLTP-like DBs • Clustering, sort columns, ... in clustered columnar MPP DBs • Partitioning

thomas_briggs

01/20/2023, 2:18 PM

All fair points. I actually use a pre_hook in the models for the "optimized" tables to create an index on the "raw" tables if one doesn't exist. I'd forgotten about that 😝

thomas_briggs

01/20/2023, 2:20 PM

The code for the pipelinewise Postgres loader isn't too bad to work with... you could certainly hack it up to create indexes for you. I tink the challenge would be how to define what indexes you want to create... nothing comes from the tap to tell you what indexes exist there, so you'd have to define them all in configuration somewhere. 😕 🤔

thomas_briggs

01/20/2023, 2:24 PM

An index on _sdc_batched_at should be created unconditionally though... I actually filed a bug for that that I forgot about 🤣

thomas_briggs

01/20/2023, 2:24 PM

https://github.com/transferwise/pipelinewise-target-postgres/issues/111

Open in Slack

Previous Next