Does anyone has trouble with the `pipelinewise` va...
# troubleshooting
r
Does anyone has trouble with the
pipelinewise
variant of
tap-postgres
suddenly not working with either one of the two WALs in a single database instance? There is data on the WAL but the pipeline just says
Waiting for first wal message
r
Hello @ryan_andrei_perez. Whenever that happened to us it has always been than the Postgres instance was having (a lot) of trouble in allocating resources to deliver WAL messages, either because the database was busy or there were very big transactions and loading those requires a lot of resources. That got confirmed in the end by manually logging in to the Postgres instance and using the function
pg_logical_slot_peek_binary_changes()
. Even setting the
upto_lsn
to 1 lead to the function taking about 45 minutes to start showing content, when it is usually immediate. In our case, we believe a very big transaction was causing the problem in rendering some messages; once those messages were consumed everything went back to normal
r
Hi @Raül Bonet! Appreciate the response! Did you guys did something to the states (or config) or just waited? If it is the last one.. this one has been stuck on the
Waiting for first wal message
since last Friday
peeking at the changes tho seems an interesting approach to debug this further, i’ll check that!
r
No problem, happy to try to save you the nightmare that was for us debugging that haha. The only thing we did was manually was unsetting some kind of config parameter that made the pipeline stop after some time.
max_run_seconds
I think it is. If not the pipeline would stop and start again and all the process was lost. Apart from that, yes, just wait, but in our case was maybe 1 hour maximum... Other things you can check: at the beginning of the execution, the pipeline says something like
requesting WAL messages up to LSN someting
. You can use the
pg_replication_slots
to see the
confirmed_flush
and what's the state of the slot. But yeah, definitely try peeking changes! If that also takes forever, the problem is in the database itself, not your pipeline
r
you’re a lifesaver man! In our case it was not
max_run_seconds
that was causing the pipeline to be cut off (it was already at 3600 seconds) but the
logical_poll_total_seconds
was set to 120 seconds, and it was also a case of a heavy transaction
👀 1
np 1
v
Great info here, just wanted to say that's amazing. @Raül Bonet would it be ok for me to capture this and throw it in the readme for meltanolab's
tap-postgres
I know it's not the same tap but I'm pretty sure we'd hit the same thing too!
I wonder if it's caused by the
wal2json
plugin, I have some consternation about that plugin myself but at the end of the day I don't want to run a different one 😕 so we used the same thing
r
Hello @visch Sorry, I have been sick and not able to follow up on this! Yes please, I must say huge part of the merit goes to @maarten_van_gijssel (Github here) , my tech lead at HackerOne with whom we debugged this (in case you want to acknowledge somebody 😉). About `wal2json`: we had the same theory. The version of the tap that we are using does not leverage the
write-in-chunks
option either, which probably only makes things worse. I am very curious about your reservations and about why you don't want to run a different one. We were concerned about its efficiency in the decoding process; in the end, a binary format like
pgoutput
should be faster. Could you elaborate on your reasons?
👀 1
v
Created https://github.com/MeltanoLabs/tap-postgres/issues/402 to document this at some point in the tap readme. Thanks for the heads up and thank you @maarten_van_gijssel 😄 I think you're right @Raül Bonet that using
pgoutput
would be a better option for the tap. Reason we didn't go with it is because I couldn't find a python lib (quickly that is as I"m sure one exists) that would do the work for us to translate the binary format for us. wal2json did the job by getting us to a serialization format. I'm not a postgres expert when it comes to the wal. My limited personal experience here normally would have led me to just use Debezium but there's a lot of benefits of doing this on the singer side in Python. I knew this would be a deep knowledge dive so we went with something that'd get us up and running quickly