hi folks, very basic question here - I'm using `me...
# best-practices
j
hi folks, very basic question here - I'm using
meltano run
to load a large PostgreSQL table, and queries on it are taking a long time, even simple ones (analysis of a `SELECT COUNT(*) FROM table WHERE timestamp > '2022-01-01'` for reference). from my limited knowledge of databases, I understand that setting proper indexes on the columns I'll use the most for queries will improve the performance. what would the #C01UTUSP34M be here? running a data transformation with dbt that creates a new table, essentially duplicating the data but with proper indexes? to be clear, I'm concerned that if I create an index on the table meltano created for loading, things will break. but maybe it's not the case?
reverse-FAQ: • "did you do xyz to make sure that the problem is not latency etc?" - have a look at the analysis of the query • "did you tune the database?" - I have no idea where to start šŸ™ƒ it's a PostgreSQL running on AWS RDS, for the record • "did you do a
VACUUM ANALYZE
?" yep • "why don't you use <insert cool OLAP db technology>?" - while I'm open to changing the underlying technology, that doesn't address my specific problem with PostgreSQL queries being slow
t
Can you define "large" and "slow"? And what are the specs on your RDS instance? The first step is ensure that your expectations are correct. šŸ˜‰
Whether indexes will help depends on the nature of the queries and the data. If most of the rows have a timestamp value greater than 2022-01-01 then an index won't help, for example (the might actually make things worse).
From the query plan info you linked to it doesn't look like the table has a lot of rows. Does it have a lot of columns? I don't think it'll hurt to put an index directly on the table and then see how the query(s) perform. I think the "best practice" will depend on the queries and the data though. For example if the table has lots of columns it might be worthwhile to use dbt to create a new table with just the timestamp values to reduce the I/O necessary to count them. Or maybe just use a materialized view. šŸ¤·šŸ»ā€ā™‚ļø
I'm a database guy, so I may have the wrong perspective, but this feels like a pretty standard database/qery optimization problem, not so much a meltano/ELT question
j
I'm a database guy, so I may have the wrong perspective, but this feels like a pretty standard database/qery optimization problem, not so much a meltano/ELT question
right, I should have made my concerns more explicit: that if I put an index on the table meltano created for loading, things will break. but maybe it's not the case
Can you define "large" and "slow"? And what are the specs on your RDS instance? The first step is ensure that your expectations are correct.
so as you could see, the table has ~1 M rows, and this simple query
SELECT COUNT(*) FROM table WHERE timestamp > '2022-01-01'
takes 90 seconds. I'm not a database expert but it looks like too much to me, maybe I would have expected something 10x faster at least?
there aren't many columns, like ~10 or so
the RDS lives in a
db.t3.micro
instance, so IIUC, it has 2 vCPU, 1 GB RAM
t
For better or worse whether adding an index will cause problems is probably not a meltano question... it's a question of how the target is implemented. Assuming you're using the default PG target that's actually not produced by meltano. I know I'm being a bit pedantic here but this is the double-edged sword of meltano's reliance on F/OSS components... it makes a lot of things possible but creates a lot of potential problems that meltano can't actually help you fix. šŸ¤”
j
that's very helpful still @thomas_briggs, thanks a lot! I guess the only way to know is to make some experiments. my limited knowledge of PG is slowing me down here, will get better with time I hope šŸ™‚
t
90 seconds to scan 1M rows definitely seems slow. My experience with Postgres is limited, unfortunately, but usually that pattern (queries that scan lots of rows and take a long time) means inadequate I/O. And of course the best solution to that is indexing. šŸ™‚
well, easiest, anyway... "best" is relative
j
fair enough
t
For fun I would try a similar query on a different column, if you can... I've seen time zone conversion cause bizarre performance issues in other DBs so it's conceivable (to me) that the problem is filtering by a timestamp. Not saying that's a likely explanation, just saying it's a possible explanation. šŸ˜‰
j
well, I went ahead and added an index and magically those queries are blazing fast now šŸ”„ whether or not this will have consequences down the road, as you say, it's a problem between me and
pipelinewise-target-postgres
and not meltano. I will do the experiment with the other columns just for curiosity in any case šŸ™‚ thanks for your help!
t
My pleasure. I could talk DB optimization all day so if you run into anything else let me know. šŸ˜‰
a
try
brin
index on
timestamp
super slim I optimized the crap out of RDS postgres on AWS before moving into a real OLAP database
j
I'll do, thanks @alexander_butler!
happy ending: index tuning did the trick šŸ˜‰ https://dba.stackexchange.com/a/312194/251415