Hello I am having some difficulty trying to get my head arou Meltano #troubleshooting

Join Slack

Hello, I am having some difficulty trying to get m...

# troubleshooting

Roberto Noorduijn Londono

05/31/2024, 10:08 AM

Hello, I am having some difficulty trying to get my head around how to best set up log_based replication.

Roberto Noorduijn Londono

05/31/2024, 10:09 AM

I have a postgres database with a bunch of tables which I want to dump to parquet. Currently we have been doing this with full table replication as this was the easiest to set up but now we are looking to more incremental updates. I have configured the database to allow for log based replication and have managed to get it running without any errors. However this doesnt generate any output (no files are saved). I guess this makes sense as there havent been any new updates to the database, so would the approach here be to first do a full table replication, set up log based replication and then have etl run with incremental updates?

visch

05/31/2024, 10:23 AM

Yes, but we tend to say "log based updates" not "incremental updates" as incremental updates are different as it's a where clause based on a replication let (normally an updated at field) . Normally the progression i see is full table, incremental, log based.

Roberto Noorduijn Londono

05/31/2024, 12:50 PM

What would be the ideal way of running this from within airflow? I've currently had to modify the orchestrate.py file that is autogenerated with the airflow plugin to work correctly with environments. What I'm looking for is a of specifying the replication method from the CLI but it seems as though the following does not work:

Copy code

plugins:
  extractors:
  - name: tap-postgres
    variant: meltanolabs
    pip_url: git+<https://github.com/MeltanoLabs/tap-postgres.git>
    config:
      user: <user>
      default_replication_method: ${REPLICATION_METHOD}
    metadata:
      "*":
        replication_method: ${REPLICATION_METHOD}
        replication_key: _sdc_lsn

looking for some tips on what would be the best way of running this

visch

05/31/2024, 1:09 PM

What I tend to do for most of our stuff is

Copy code

plugins:
  extractors:
  - name: tap-postgres
    variant: meltanolabs
    pip_url: git+<https://github.com/MeltanoLabs/tap-postgres.git>
    config:
      user: <user>

As it just defaults to full table and we're done. Is your question about how should I do this for Log based / Incremental? If so are you only doing it for some tables and not others?

Roberto Noorduijn Londono

05/31/2024, 1:19 PM

I think I might still need to wrap my head around how we would get to work for new tables.. We have the data landing in a storage account on azure and what I would like is to be able to easily configure it such that we can ensure that the first time we run meltano we get the full table dumped and afterwards it starts doing log based updates. Our current set up is with full table replication using schedules running on airflow. What I was thinking is if we could have something in the code that does: if this is the first time running the schedule then do a full_table replication otherwise do log_based replication

visch

05/31/2024, 1:39 PM

Makes sense, my first question is are these tables so large that just syncing them via full replication takes too long? If so does incremental replication during the "day" work, and then a nightly full table load solve it for you? Then I'd go to log based, but I say that just because log based can have its own issues that are not fun to diagnose as you're in the internals of the DB not good ole SQL anymore

Roberto Noorduijn Londono

05/31/2024, 1:44 PM

yea i think something like that could work quite well. how would I go about defining one schedule that does incremental on a quicker schedule than one that does a full load? I currently have quite a bit of inheritance on top of the above plugin-definition (we have 4 postgres databases with the same user and different environments) so it would be great if there were a way of specifying it at the schedule level whether the replication method should be FULL_TABLE or LOG_BASED

✅ 1

visch

05/31/2024, 1:49 PM

If you set the replication type to incremental when you run the job with

meltano run

you can call the "full-refresh" flag https://docs.meltano.com/reference/command-line-interface#parameters-3 ie

meltano run --full-refresh postgres_snowflake

And that should do what you're after

visch

05/31/2024, 1:50 PM

I admittedly don't do a ton of incremental loads (luckily), but it's something we're about to start doing with a few of our larger tables.

Roberto Noorduijn Londono

05/31/2024, 2:29 PM

Thanks for your help, I'll do a deeper dive into this and take your pointers into account! Appreciate it!

np 1

17 Views

Open in Slack

Previous Next