hi folks very basic question here I m using `meltano run` to Meltano #best-practices

hi folks, very basic question here - I'm using `me...

juan_luis_cano_rodriguez

05/13/2022, 10:51 AM

hi folks, very basic question here - I'm using

meltano run

to load a large PostgreSQL table, and queries on it are taking a long time, even simple ones (analysis of a `SELECT COUNT(*) FROM table WHERE timestamp > '2022-01-01'` for reference). from my limited knowledge of databases, I understand that setting proper indexes on the columns I'll use the most for queries will improve the performance. what would the #C01UTUSP34M be here? running a data transformation with dbt that creates a new table, essentially duplicating the data but with proper indexes? to be clear, I'm concerned that if I create an index on the table meltano created for loading, things will break. but maybe it's not the case?

juan_luis_cano_rodriguez

05/13/2022, 12:17 PM

reverse-FAQ: • "did you do xyz to make sure that the problem is not latency etc?" - have a look at the analysis of the query • "did you tune the database?" - I have no idea where to start 🙃 it's a PostgreSQL running on AWS RDS, for the record • "did you do a

VACUUM ANALYZE

?" yep • "why don't you use <insert cool OLAP db technology>?" - while I'm open to changing the underlying technology, that doesn't address my specific problem with PostgreSQL queries being slow

thomas_briggs

05/13/2022, 12:44 PM

Can you define "large" and "slow"? And what are the specs on your RDS instance? The first step is ensure that your expectations are correct. 😉

thomas_briggs

05/13/2022, 12:45 PM

Whether indexes will help depends on the nature of the queries and the data. If most of the rows have a timestamp value greater than 2022-01-01 then an index won't help, for example (the might actually make things worse).

thomas_briggs

05/13/2022, 12:48 PM

From the query plan info you linked to it doesn't look like the table has a lot of rows. Does it have a lot of columns? I don't think it'll hurt to put an index directly on the table and then see how the query(s) perform. I think the "best practice" will depend on the queries and the data though. For example if the table has lots of columns it might be worthwhile to use dbt to create a new table with just the timestamp values to reduce the I/O necessary to count them. Or maybe just use a materialized view. 🤷🏻‍♂️

thomas_briggs

05/13/2022, 12:49 PM

I'm a database guy, so I may have the wrong perspective, but this feels like a pretty standard database/qery optimization problem, not so much a meltano/ELT question

juan_luis_cano_rodriguez

05/13/2022, 1:04 PM

I'm a database guy, so I may have the wrong perspective, but this feels like a pretty standard database/qery optimization problem, not so much a meltano/ELT question

right, I should have made my concerns more explicit: that if I put an index on the table meltano created for loading, things will break. but maybe it's not the case

juan_luis_cano_rodriguez

05/13/2022, 1:05 PM

Can you define "large" and "slow"? And what are the specs on your RDS instance? The first step is ensure that your expectations are correct.

so as you could see, the table has ~1 M rows, and this simple query

SELECT COUNT(*) FROM table WHERE timestamp > '2022-01-01'

takes 90 seconds. I'm not a database expert but it looks like too much to me, maybe I would have expected something 10x faster at least?

juan_luis_cano_rodriguez

05/13/2022, 1:06 PM

there aren't many columns, like ~10 or so

juan_luis_cano_rodriguez

05/13/2022, 1:08 PM

the RDS lives in a

db.t3.micro

instance, so IIUC, it has 2 vCPU, 1 GB RAM

thomas_briggs

05/13/2022, 1:10 PM

For better or worse whether adding an index will cause problems is probably not a meltano question... it's a question of how the target is implemented. Assuming you're using the default PG target that's actually not produced by meltano. I know I'm being a bit pedantic here but this is the double-edged sword of meltano's reliance on F/OSS components... it makes a lot of things possible but creates a lot of potential problems that meltano can't actually help you fix. 🤔

juan_luis_cano_rodriguez

05/13/2022, 1:11 PM

that's very helpful still @thomas_briggs, thanks a lot! I guess the only way to know is to make some experiments. my limited knowledge of PG is slowing me down here, will get better with time I hope 🙂

thomas_briggs

05/13/2022, 1:11 PM

90 seconds to scan 1M rows definitely seems slow. My experience with Postgres is limited, unfortunately, but usually that pattern (queries that scan lots of rows and take a long time) means inadequate I/O. And of course the best solution to that is indexing. 🙂

thomas_briggs

05/13/2022, 1:12 PM

well, easiest, anyway... "best" is relative

juan_luis_cano_rodriguez

05/13/2022, 1:12 PM

fair enough

thomas_briggs

05/13/2022, 1:16 PM

For fun I would try a similar query on a different column, if you can... I've seen time zone conversion cause bizarre performance issues in other DBs so it's conceivable (to me) that the problem is filtering by a timestamp. Not saying that's a likely explanation, just saying it's a possible explanation. 😉

juan_luis_cano_rodriguez

05/13/2022, 1:22 PM

well, I went ahead and added an index and magically those queries are blazing fast now 🔥 whether or not this will have consequences down the road, as you say, it's a problem between me and

pipelinewise-target-postgres

and not meltano. I will do the experiment with the other columns just for curiosity in any case 🙂 thanks for your help!

thomas_briggs

05/13/2022, 1:22 PM

My pleasure. I could talk DB optimization all day so if you run into anything else let me know. 😉

alexander_butler

05/14/2022, 11:33 PM

try

brin

index on

timestamp

super slim I optimized the crap out of RDS postgres on AWS before moving into a real OLAP database

juan_luis_cano_rodriguez

05/15/2022, 5:16 PM

I'll do, thanks @alexander_butler!

juan_luis_cano_rodriguez

05/16/2022, 6:12 PM

happy ending: index tuning did the trick 😉 https://dba.stackexchange.com/a/312194/251415

Open in Slack

Previous Next