mert_bakir
07/18/2023, 2:16 PM[2023-07-18, 09:38:26 UTC] {subprocess.py:92} INFO - 2023-07-18T09:38:26.414445Z [info ] time=2023-07-18 09:38:26 name=tap_postgres level=INFO message=Beginning sync of stream(public-table) with sync method(logical_initial) cmd_type=elb consumer=False name=tap-postgres producer=True stdio=stderr string_id=tap-postgres
[2023-07-18, 09:38:26 UTC] {subprocess.py:92} INFO - 2023-07-18T09:38:26.460792Z [info ] time=2023-07-18 09:38:26 name=tap_postgres level=INFO message=Performing initial full table sync cmd_type=elb consumer=False name=tap-postgres producer=True stdio=stderr string_id=tap-postgres
....
[2023-07-18, 08:04:49 UTC] {subprocess.py:92} INFO - 2023-07-18T08:04:49.404981Z [info ] Incremental state has been updated at 2023-07-18 08:04:49.404844.
....
[2023-07-18, 08:30:20 UTC] {subprocess.py:92} INFO - 2023-07-18T08:30:20.142742Z [info ] time=2023-07-18 08:30:20 name=target_postgres level=INFO message=Loading 66838 rows into 'public."table"' cmd_type=elb consumer=True name=target-postgres producer=False stdio=stderr string_id=target-postgres
[2023-07-18, 08:30:22 UTC] {subprocess.py:92} INFO - 2023-07-18T08:30:22.181746Z [info ] time=2023-07-18 08:30:22 name=target_postgres level=INFO message=Loading into public."table": {"inserts": 0, "updates": 66838, "size_bytes": 15638185} cmd_type=elb consumer=True name=target-postgres producer=False stdio=stderr string_id=target-postgres
Job ends successfuly.
The second job doesn't do log based replication but again starts as initial full table sync
. Finds the table already exists in the database then starts updating all the record in batches.
All consecutive jobs does the same, initial full table sync
, updates all records.
Question 1: Why is this the case? Same configuration, same database every other table - with thousands of rows not even a million - has been running on log based replication for months.
In the mean time log size on the source grows. pg_wal was huge 50GB, checking by the disk usage over time it happened just after I included the above table in tap-config.
I am no expert here but I'm assume postgres didn't flush the logs because it hasn't been replicated by any client yet. When we create a replication slot on postgresql we use select pg_create_logical_replication_slot('pipelinewise_<database_name>', 'wal2json');
. It's defined on database not per table, right?
Question 2: Before I included the table to tap config, wasn't postgres already generating logs for that table? Or it started writing wa logs for that table after some client (tap-postgres) tried to read about it?janis_puris
07/18/2023, 4:10 PMjanis_puris
07/18/2023, 4:10 PMtap-postgres
utilises the publications, hence I could be way off here :3mert_bakir
07/18/2023, 4:25 PMmert_bakir
07/25/2023, 11:56 AMplugins:
extractors:
- name: tap-postgres--view-01
inherit_from: tap-postgres--select-schema
config:
ssl: false
filter_schemas: public
default_replication_method: LOG_BASED
logical_poll_total_seconds: 600
max_run_seconds: 1500
break_at_end_lsn: true
plugins:
extractors:
- name: tap-postgres--select-schema
inherit_from: tap-postgres
select:
- public-table1.*
- public-table2.*
- public-table3.*
This was my configuration. The select list had hundres of tables. It has been working LOG_BASED
replication for months.
Then I added another table table-x
and the new table never started running log based replication always repeated the initial full table sync
as I explained in the first post.
Today, I figured every new table (tried for different tables and for different source dbs) shows the same behaviour. At each run, it logs Beginning sync of stream(public-table1) with sync method(logical_initial)
.
Then I removed all existing tables ( table-1, table-2, table-3
)in the select statement and added a single one table-x
only. After performing full table sync once it started running log based replication for the next runs.
Then I added back the previously removed tables ( table-1, table-2, table-3
), now they are stuck in performing initial full table sync while table-x
works fine.
I think there is a problem at updating the state. I don't know enough about meltano to figure out this alone.