Hello, I am having some difficulty trying to get m...
# troubleshooting
r
Hello, I am having some difficulty trying to get my head around how to best set up log_based replication.
I have a postgres database with a bunch of tables which I want to dump to parquet. Currently we have been doing this with full table replication as this was the easiest to set up but now we are looking to more incremental updates. I have configured the database to allow for log based replication and have managed to get it running without any errors. However this doesnt generate any output (no files are saved). I guess this makes sense as there havent been any new updates to the database, so would the approach here be to first do a full table replication, set up log based replication and then have etl run with incremental updates?
v
Yes, but we tend to say "log based updates" not "incremental updates" as incremental updates are different as it's a where clause based on a replication let (normally an updated at field) . Normally the progression i see is full table, incremental, log based.
r
What would be the ideal way of running this from within airflow? I've currently had to modify the orchestrate.py file that is autogenerated with the airflow plugin to work correctly with environments. What I'm looking for is a of specifying the replication method from the CLI but it seems as though the following does not work:
Copy code
plugins:
  extractors:
  - name: tap-postgres
    variant: meltanolabs
    pip_url: git+<https://github.com/MeltanoLabs/tap-postgres.git>
    config:
      user: <user>
      default_replication_method: ${REPLICATION_METHOD}
    metadata:
      "*":
        replication_method: ${REPLICATION_METHOD}
        replication_key: _sdc_lsn
looking for some tips on what would be the best way of running this
v
What I tend to do for most of our stuff is
Copy code
plugins:
  extractors:
  - name: tap-postgres
    variant: meltanolabs
    pip_url: git+<https://github.com/MeltanoLabs/tap-postgres.git>
    config:
      user: <user>
As it just defaults to full table and we're done. Is your question about how should I do this for Log based / Incremental? If so are you only doing it for some tables and not others?
r
I think I might still need to wrap my head around how we would get to work for new tables.. We have the data landing in a storage account on azure and what I would like is to be able to easily configure it such that we can ensure that the first time we run meltano we get the full table dumped and afterwards it starts doing log based updates. Our current set up is with full table replication using schedules running on airflow. What I was thinking is if we could have something in the code that does: if this is the first time running the schedule then do a full_table replication otherwise do log_based replication
v
Makes sense, my first question is are these tables so large that just syncing them via full replication takes too long? If so does incremental replication during the "day" work, and then a nightly full table load solve it for you? Then I'd go to log based, but I say that just because log based can have its own issues that are not fun to diagnose as you're in the internals of the DB not good ole SQL anymore
r
yea i think something like that could work quite well. how would I go about defining one schedule that does incremental on a quicker schedule than one that does a full load? I currently have quite a bit of inheritance on top of the above plugin-definition (we have 4 postgres databases with the same user and different environments) so it would be great if there were a way of specifying it at the schedule level whether the replication method should be FULL_TABLE or LOG_BASED
1
v
If you set the replication type to incremental when you run the job with
meltano run
you can call the "full-refresh" flag https://docs.meltano.com/reference/command-line-interface#parameters-3 ie
meltano run --full-refresh postgres_snowflake
And that should do what you're after
I admittedly don't do a ton of incremental loads (luckily), but it's something we're about to start doing with a few of our larger tables.
r
Thanks for your help, I'll do a deeper dive into this and take your pointers into account! Appreciate it!
np 1