mert_bakir
07/18/2023, 2:16 PM[2023-07-18, 09:38:26 UTC] {subprocess.py:92} INFO - 2023-07-18T09:38:26.414445Z [info ] time=2023-07-18 09:38:26 name=tap_postgres level=INFO message=Beginning sync of stream(public-table) with sync method(logical_initial) cmd_type=elb consumer=False name=tap-postgres producer=True stdio=stderr string_id=tap-postgres
[2023-07-18, 09:38:26 UTC] {subprocess.py:92} INFO - 2023-07-18T09:38:26.460792Z [info ] time=2023-07-18 09:38:26 name=tap_postgres level=INFO message=Performing initial full table sync cmd_type=elb consumer=False name=tap-postgres producer=True stdio=stderr string_id=tap-postgres
....
[2023-07-18, 08:04:49 UTC] {subprocess.py:92} INFO - 2023-07-18T08:04:49.404981Z [info ] Incremental state has been updated at 2023-07-18 08:04:49.404844.
....
[2023-07-18, 08:30:20 UTC] {subprocess.py:92} INFO - 2023-07-18T08:30:20.142742Z [info ] time=2023-07-18 08:30:20 name=target_postgres level=INFO message=Loading 66838 rows into 'public."table"' cmd_type=elb consumer=True name=target-postgres producer=False stdio=stderr string_id=target-postgres
[2023-07-18, 08:30:22 UTC] {subprocess.py:92} INFO - 2023-07-18T08:30:22.181746Z [info ] time=2023-07-18 08:30:22 name=target_postgres level=INFO message=Loading into public."table": {"inserts": 0, "updates": 66838, "size_bytes": 15638185} cmd_type=elb consumer=True name=target-postgres producer=False stdio=stderr string_id=target-postgres
Job ends successfuly.
The second job doesn't do log based replication but again starts as initial full table sync. Finds the table already exists in the database then starts updating all the record in batches.
All consecutive jobs does the same, initial full table sync , updates all records.
Question 1: Why is this the case? Same configuration, same database every other table - with thousands of rows not even a million - has been running on log based replication for months.
In the mean time log size on the source grows. pg_wal was huge 50GB, checking by the disk usage over time it happened just after I included the above table in tap-config.
I am no expert here but I'm assume postgres didn't flush the logs because it hasn't been replicated by any client yet. When we create a replication slot on postgresql we use select pg_create_logical_replication_slot('pipelinewise_<database_name>', 'wal2json');. It's defined on database not per table, right?
Question 2: Before I included the table to tap config, wasn't postgres already generating logs for that table? Or it started writing wa logs for that table after some client (tap-postgres) tried to read about it?janis_puris
07/18/2023, 4:10 PMjanis_puris
07/18/2023, 4:10 PMtap-postgres utilises the publications, hence I could be way off here :3mert_bakir
07/18/2023, 4:25 PMmert_bakir
07/25/2023, 11:56 AMplugins:
extractors:
- name: tap-postgres--view-01
inherit_from: tap-postgres--select-schema
config:
ssl: false
filter_schemas: public
default_replication_method: LOG_BASED
logical_poll_total_seconds: 600
max_run_seconds: 1500
break_at_end_lsn: true
plugins:
extractors:
- name: tap-postgres--select-schema
inherit_from: tap-postgres
select:
- public-table1.*
- public-table2.*
- public-table3.*
This was my configuration. The select list had hundres of tables. It has been working LOG_BASED replication for months.
Then I added another table table-x and the new table never started running log based replication always repeated the initial full table sync as I explained in the first post.
Today, I figured every new table (tried for different tables and for different source dbs) shows the same behaviour. At each run, it logs Beginning sync of stream(public-table1) with sync method(logical_initial) .
Then I removed all existing tables ( table-1, table-2, table-3 )in the select statement and added a single one table-x only. After performing full table sync once it started running log based replication for the next runs.
Then I added back the previously removed tables ( table-1, table-2, table-3 ), now they are stuck in performing initial full table sync while table-x works fine.
I think there is a problem at updating the state. I don't know enough about meltano to figure out this alone.