Maybe a dumb question, but when picking a type of ...
# best-practices
d
Maybe a dumb question, but when picking a type of compute. Would meltano benefit from a memory or cpu optimized host?
in my head memory compute would make more sense
v
compute in almost all scenarios, the buffer is low in meltano and most the taps are optimized for super low memory use
I honestly don't think about it when deploying things until I hit an issue
d
Cool, the reason I ask is that initial syncs are taking a very long time on tables ~10m rows in size and looking for ways to optimize. Worked with itersize and the loaders batch size, but looking for additional ways to tweak.
Your reply makes me think if having a lower batch/itersize + compute optimization is better than a higher batch/itersize
t
I think the answer varies depending on the scenario. The postgres target, for example, buffers rows for all streams until there's 100k rows for a particular stream, then flushes that one stream. So if you are replicating a lot of tables with a lots of rows (which we are) then target-postgres will suck up gigs and gigs of memory.
Replicating tables with a lot of rows is a different issue. I spent a bunch of time a couple months ago attempting to tune things in our environment and eventually concluded that a) the Python mysql client is slow, and b) the Python JSON parser is slow. More compute doesn't help, it's just slow. So we actually do initial setup outside meltano, then set state data appropriately and use meltano to replicate changes after that. Otherwise it would take us weeks to stand up a new environment, which is obviously not practical.
d
How did you migrate state? Copy the state file over?
t
By tweaking the state data in the meltano database using SQL 😱 But there's a new
state
command coming (or already released? not sure) that should make that a little cleaner...
v
already released 🙂
d
jeez so big woof on me for not realizing but is state 100% controlled in the DB?
i was literally up last night trying to figure out how meltano maintains state after i rebuild my docker images from a clean slate
t
It's stored in the DB when the job finishes, yeah. payload field of the job table IIRC
d
youre the best 😄
t
That field actually contains JSON but JSON is ultimately just a string so if you're careful you can use SQL to manipulate it 😛