I’m trying out logical replication in postgres, an...
# troubleshooting
d
I’m trying out logical replication in postgres, and I’ve run into an issue where my replication slot keeps growing in size. Does anyone have any experience with this? This is likely not meltano-specific, but I figured someone here may know a thing or two. I’m using the transferwise variant of tap-postgres. My meltano job was originally running as the admin postgres user, then I created a new user specifically for it. I messed up some of the privileges initially, so the job started failing and the replication slot started using up disk space. I fixed the privileges and the job started completing successfully, but the space used by the replication slot kept growing. I provided a new
--job-id
, which caused a new failure with the snowflake target:
Copy code
snowflake.connector.errors.ProgrammingError: 100080 (22000): Number of columns in file (19) does not match that of the corresponding table (20), use file format option error_on_column_count_mismatch=false to ignore this error
I just dropped the existing table since it just had test data. The job then ran successfully and there was no longer any space used by the slot. Over the weekend something happened causing my user to lose access (it’s a shared test db 🤷 ). I re-granted everything, and the job started succeeding again, but once again the disk used by the replication slot kept growing. When this happened, I would see this as the final log message in my job:
Copy code
tap-postgres-tables | time=2021-05-17 17:26:02 name=tap_postgres level=INFO message=Lastest wal message received was 30/310DEC38
When I use a brand new job id and delete the existing table so that everything is working as expected I’ll see this:
Copy code
tap-postgres-tables | time=2021-05-17 22:15:22 name=tap_postgres level=INFO message=Waiting for first wal message
I guess my questions are: 1. Why does it sometimes not “clear out” the replication slot? 2. Why is there a column number mismatch when I use a new job id? Thanks!
t
@aaronsteers or @douwe_maan any ideas?
d
I don't know tap-postgres's internals well enough to be able to diagnose this, I'm afraid
a
@dean_morin - Firstly, I suggest enabling the mentioned setting of
error_on_column_count_mismatch=false
to ignore this error. Extra columns on the target generally shouldn’t be a problem for you, and it sounds like the target may be over-aggressively trying to protect you from changing data schemas. There may be metadata columns added selectively depending on other settings (or whether any deletions occurred) and I don’t think that setting has much practical value. That said, just to be sure, I do recommend keeping an eye on the column list to see if you are adding new columns and which columns, if so. Regarding the postgres replication slot and disk space used by WAL (write-ahead-log) messages, I am afraid I can’t speak to those. Are you also in the Singer slack? You might try pinging the #tap-postgres or #target-postgres for advice from folks using Postgres regularly. Most likely it sounds like you need to manage how much logging to keep and for how long (logs are stored separately from data), and I’m afraid I don’t have any experience doing that on Postgres specifically.
d
thankyou @aaronsteers! I’ll try out that setting and join the Singer slack
For posterity, there were a couple of things that were contributing to this. 1. I had
logical_poll_total_seconds
set to
5
. My job runs every 5 minutes, so I don’t want to keep polling for data on the replication slot if there’s nothing new. However, this value was so short that it was timing out while trying to read the data that actually was there. 2. I had overlapping jobs running. I’m temporarily using AWS cloudevents to trigger these jobs. It was triggering a second job while the first one was still running.