We're running into an issue where meltano is faili...
# troubleshooting
m
We're running into an issue where meltano is failing because of
SSL SYSCALL error: Connection timed out
and then
Copy code
During handling of the above exception, another exception occurred:
FileNotFoundError: [Errno 2] No such file or directory:
/project/.meltano/run/elt/...
Because of this the run isn't marked as failed. So when we try to run it again it will fail because meltano thinks that there is another run of the same job running. Has anyone dealt this this before? Any suggestions?
v
Need more context, full logs, what are you running when this happens etc etc
m
Sure thing: • we're running
meltano elt --state-id=some-id tap-postgres target-snowflake
◦ variants are transferwise • It's running on ECS using Fargate Containers Logs are attached. Thanks!
v
I can guess with that info, the logs look like the server can't access the postgres database. More information we'd need is how is fargate actually getting called to run this. The postgres access thing is networking / iam roles etc potentially but it could be a number of things. Where I'd start is verifying those containers can access the database
1
e
Yeah, I'd verify that the Meltano container can reach the db. I've logged https://github.com/meltano/meltano/issues/8733 too.
m
Fair enough. I guess I didn't mention that this is intermitent. It seems to happen randomly. Which seems to point to a networking issue, that causes a failure AND then won't allow that failure ot be logged because of the same networking issue. It looks like it's losing a connection to the meta-db in between when it creates the run (the id is logged in the run table) and actually starting starting the replication
👀 1
d
I was working with Matt on this and we confirmed that there was an issue with the RDS instance the metadata db lives on. RDS auto-recovery was kicked off (probably due to a hardware failure), which took about 5 minutes to complete. However, the connection in meltano took around 2 hours to time out, at which point we saw this error. I'm guessing it didn't try to re-connect to mark the run as failed because of the nature of the error?
e
What version of Meltano is this? (I'm trying where exactly it's failing and come up with an MRE).
d
v2.20.0, we still haven't migrated to v3
e
Oh gotcha
Ok, so that probably means Meltano is on SQLAlchemy 1.4 but the error seems to be coming from the more stable parts of their API so I don't think bumping versions would make things better here... The error is ultimately a
psycopg2.OperationalError
crashing things here: https://github.com/meltano/meltano/blob/4bda2aa5ae8d260f5d031c00cce77fb0b478af2f/src/meltano/core/settings_store.py#L933-L947 So I wonder if: 1. We should try catching more exceptions there, but it's not clear to me which ones 2. The
psycopg3
adapter would handle things better, but that does require a bump to v3 🤔
🙌 1
d
I'm going to finally have to have time to work on the v3 update in the next few weeks, so I'll get that much done at least
👌 1
To follow-up here, we're now running Meltano 3.42 and are still seeing this issue
v
can you post the additional information now that you're upgrade, logs, debug logs, meltano.yml etc
e
also @dean_morin, are you using
psycopg3
?
d
We're using
psycopg2-binary 2.9.9
. Added with
poetry add meltano@3.4.2 --extras psycopg2
I'll get back to you on those other details
👍 1
Hey sorry for the delay, I'll DM you the files @visch
v
Can you just put them here, odds of me having the time to go do this for you isn't high we try to have the community help
d
Definitely, here they are