Is this a timeout against the meltano system datab...
# troubleshooting
m
Is this a timeout against the meltano system database? (we’re using AWS RDS Postgres for that). Is there an obvious thing to look into here? (This job errored out after ~2 hours but we’ve had a number of other jobs run fine for longer durations.
Copy code
2023-02-15T17:24:04.365968Z [error    ] (psycopg2.OperationalError) could not receive data from server: Connection timed out
SSL SYSCALL error: Connection timed out

[SQL: SELECT runs.state AS runs_state, runs.id AS runs_id, runs.job_name AS runs_job_name, runs.run_id AS runs_run_id, runs.started_at AS runs_started_at, runs.last_heartbeat_at AS runs_last_heartbeat_at, runs.ended_at AS runs_ended_at, runs.payload AS runs_payload, runs.payload_flags AS runs_payload_flags, runs.trigger AS runs_trigger 
FROM runs 
WHERE runs.state = %(param_1)s AND (runs.last_heartbeat_at IS NOT NULL AND runs.last_heartbeat_at < %(last_heartbeat_at_1)s OR runs.last_heartbeat_at IS NULL AND runs.started_at < %(started_at_1)s) AND runs.job_name = %(job_name_1)s]
[parameters: {'param_1': 'RUNNING', 'last_heartbeat_at_1': datetime.datetime(2023, 2, 15, 17, 2, 51, 894075), 'started_at_1': datetime.datetime(2023, 2, 14, 17, 7, 51, 894075), 'job_name_1': 'dev:tap-google-analytics-1-to-target-redshift'}]
(Background on this error at: <https://sqlalche.me/e/14/e3q8>)
c
Yup. System DB connection looks to have been idle for a long time I guess and something on the network tore down the TCP session, based on what other people are saying: https://stackoverflow.com/a/22701976/6056177 Keep Alives might help you out, depending on what element in the end-to-end TCP stream is actually responsible for the tear down ... if you know if there is any "stateful" network device in your architecture, then you could probably pinpoint it and deal with the issue more succinctly. https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-KEEPALIVES-IDLE I'll take a stab at it and guess this is a "containerized" environment?
keep alives can also be set on the server side via RDS parameter group instead of the libpq client side, which might be an easier and quicker way to enable the keep alives https://www.postgresql.org/docs/current/runtime-config-connection.html#RUNTIME-CONFIG-CONNECTION-SETTINGS https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithParamGroups.html
m
I’ll take a stab at it and guess this is a “containerized” environment?
I’m running these jobs in GitHub actions, but I’m not running them within a container there.
This issue occurs for me consistently when trying to load google analytics data with
tap-google-analytics
- it’s the first (historical) load so there’s a lot of data, maybe there’s a long-running API call? thinkspin
c
I’m running these jobs in GitHub actions, but I’m not running them within a container there.
I was just putting a hypothesis out there that some container networking might be what's tearing down the idle TCP stream. Sounds like that's not the case. Is there a good old "Enterprise Firewall" between the Github Runner and the AWS RDS instance? Or is it a Microsoft managed runner?
This issue occurs for me consistently when trying to load google analytics data with
tap-google-analytics
- it’s the first (historical) load so there’s a lot of data, maybe there’s a long-running API call?
Well, the hypothesis here is that the Meltano System DB connection (managed via SQLalchemy) is being kept open for the whole duration of the tap-to-target invocation (i.e. the "run", not sure what the correct terminology of it is 😂) And then some stateful network device sees that the TCP stream from the Github runner to the AWS RDS endpoint hasn't really had any traffic going on (because meltano is waiting for that one long-running stream to finish running so that it can save the bookmark for it in the state table) and the network device probably has some idle TCP session logic that tears the session down. And when the long-running stream finally completes and meltano wants write that state information for the stream into the system db, that's when the TCP stack on the Github runner says: "nope. sorry, that TCP socket is not there anymore" I could be wildly off with this speculation, but that is what is smells like to me.