Hey all I am looking at the `metadata` extra <https docs mel Meltano #troubleshooting

Hey all! I am looking at the `metadata` extra <her...

Conner Panarella

07/12/2024, 1:54 PM

Hey all! I am looking at the

metadata

extra here. Unfortunately it looks like I cannot provide the extra if I am providing a catalog manually. Is there a recommended way to accomplish something like that? I have a very large database which takes a long time to discover on-the-fly, so I want to provide a catalog, but I also don't want to update the catalog to set every single table to

LOG_BASED

every time the catalog needs to be updated or regenerated.

visch

07/12/2024, 2:54 PM

Depending on how you're deployed if I was still set on manually providing the catalog file (I believe right now the catalog file is cached by default for runs so you may be able to utilize that to your advantage actually) I would setup a manual CI job that would go generate a new catalog file and then shove up a new MR with the updated catalog to the repo, and I'd have meltano generate it

Conner Panarella

07/12/2024, 3:05 PM

@visch What do you mean by "shove up a new MR"?

visch

07/12/2024, 3:09 PM

Sorry I'm spitballing here. um it'd look something like 1. Manual flow triggered 2.

meltano invoke tap-name-catalog-generator --dump=catalog > input_catalog.json

git add input_catalog

git commit -m "Automated commit: Catalog now matches database metadata"

git push

Conner Panarella

07/12/2024, 3:11 PM

@visch Got it, but that wouldn't fix the issue of replicating the metadata extra applying to the provided catalog.

visch

07/12/2024, 3:11 PM

How wouldn't it? Could you share you meltano.yml

visch

07/12/2024, 3:12 PM

The assumption I'm making is the catalog that you are providing can be generated with Meltano you just don't want to because discover takes too long for you right now

Conner Panarella

07/12/2024, 3:16 PM

@visch Unfortunately it does not generate the correct catalog since it does not contain the replication-method

Conner Panarella

07/12/2024, 3:20 PM

The generated catalog looks something like this:

visch

07/12/2024, 5:40 PM

You can have meltano put in the correct replication method with metadata, can you share your meltano.yml?

Conner Panarella

07/12/2024, 6:03 PM

@visch:

Copy code

version: 1
default_environment: prod
project_id: 4172a1e0-ae95-4520-b408-050067e78d09
environments:
  - name: prod
    state_id_suffix: prod
plugins:
  extractors:
    - name: tap-mssql
      variant: wintersrd
      pip_url: tap-mssql
          
    - name: tap-mssql--ptfm_tmsn
      inherit_from: tap-mssql
      catalog: extract/ptfm_tmsn.json
      config:
        database: PTFM_TMSN
		use_date_datatype: true
        use_singer_decimal: true
        cursor_array_size: 10000

  loaders:
    - name: target-postgres
      variant: meltanolabs
      pip_url: meltanolabs-target-postgres

    - name: target-jsonl
      variant: andyh1203
      pip_url: target-jsonl
  utilities:
    - name: dagster
      variant: quantile-development
      pip_url: dagster-ext dagster-postgres dagster-dbt grpcio<1.65
      commands:
        start-dagster:
          args: dev -f $REPOSITORY_DIR/meltano_jobs.py -h 0.0.0.0 -d $REPOSITORY_DIR
          executable: dagster_invoker

visch

07/12/2024, 6:23 PM

Copy code

- name: tap-mssql
      variant: wintersrd
      pip_url: tap-mssql
      metadata:
        "*":
          replication-method: LOG_BASED

right? Then this would work correcT?

Conner Panarella

07/12/2024, 6:33 PM

@visch You can't use that if you're also providing a catalog manually, which led me to my original question.

visch

07/12/2024, 6:45 PM

@Conner Panarella I understand man, I'm trying to help! You said https://meltano.slack.com/archives/C069CQNHDNF/p1720797382867199?thread_ts=1720792493.151449&cid=C069CQNHDNF I offered a way to put the correct replication key in place here https://meltano.slack.com/archives/C069CQNHDNF/p1720808628358009?thread_ts=1720792493.151449&cid=C069CQNHDNF Which allows you to run https://meltano.slack.com/archives/C069CQNHDNF/p1720796969455139?thread_ts=1720792493.151449&cid=C069CQNHDNF Does that make sense?

Conner Panarella

07/12/2024, 6:51 PM

@visch Sorry, I really appreciate the help. Didn't mean to come across combative! Unfortunately I am a bit lost since I cannot use config like this:

Copy code

- name: tap-mssql
      variant: wintersrd
      pip_url: tap-mssql
      metadata:
        "*":
          replication-method: LOG_BASED

If I am also providing a catalog generated with a command like this:

meltano invoke tap-name-catalog-generator --dump=catalog > input_catalog.json

When you use a catalog (either in the meltano.yaml or with

--catalog

anything in the metadata extra in dagster.yaml is ignored. This is also in the documentation: "These rules are not applied when a catalog is provided manually."

visch

07/12/2024, 6:53 PM

but tap-mssql doesn't have a catalog, right? Just the tap-mssql--ptfm_tmsn has one. The point is you can generate your manual one without much issue. If you're really worried about it being in the "chain" then you can make another extractor and do the same thing (or just make another inheirt from tap-mssql-catalog-generator

Conner Panarella

07/12/2024, 7:13 PM

No it doesn't have a catalog, but I'm not sure what having the metadata extra at that level is meant to do? Sorry if I am missing something. But the goal is to be able to supply the generated catalog for

tap-mssql--ptfm_tmsn

and also be able to make sure the replication is LOG_BASED, despite it not being generated that way.

Edgar Ramírez (Arch.dev)

07/12/2024, 8:09 PM

Hey @Conner Panarella! So a few comments and questions about your workflow: 1. Does

meltano invoke tap-name-catalog-generator --dump=catalog

generate a correct catalog or do you apply manual changes to the output? 2. "I have a very large database which takes a long time to discover on-the-fly". That shouldn't be a problem if you're using

meltano run

, which will automatically cache the catalog. A

--refresh-catalog

CLI option will be shipped in an upcoming 3.5.0 release (currently in alpha), so you'll be able to update this cached catalog whenever you know things changed upstream.

Conner Panarella

07/12/2024, 8:42 PM

@Edgar Ramírez (Arch.dev) 1. No, the catalog is missing

replication-method

so that must be set to

LOG_BASED

2. Got it, this all is related to the fact that

tap-mssql

starts failing once columns are added to a source database. Perhaps I should reach out to them to see what the recommended way to handle this is.

Edgar Ramírez (Arch.dev)

07/12/2024, 8:58 PM

The first point is definitely weird. Meltano should apply the catalog changes regardless of whether the tap will respect the overrides. I can confirm it does. Can you confirm if adding

metadata

Copy code

- name: tap-mssql
      variant: wintersrd
      pip_url: tap-mssql
      metadata:
        "*":
          replication-method: LOG_BASED

and running

Copy code

meltano invoke tap-name-catalog-generator --dump=catalog

doesn't add the replication method to the generated catalog. I know your workflows requires you to supply a pre-saved catalog, but I want to double-check my assumptions. Also, do you run Meltano in an environment with ephemeral storage where the cached catalog saved to

.meltano/tap-mssql/tap.properties.json

would be lost for the next run? PS: I think tap-mysql has a similar cdc implementation and it wasn't applying schema overrides: https://github.com/transferwise/pipelinewise-tap-mysql/pull/186. But that's not the problem you're having.

Conner Panarella

07/15/2024, 7:56 PM

@Edgar Ramírez (Arch.dev) Thank you for the explanation! Very insightful! Maybe I've done a poor job of explaining the issue I am trying to solve, so let me take a step back. I have a large MS SQL database I want to replicate using

tap-mssql

which supports cdc. This database can have columns added to tables on a semi-regular basis. I would like to avoid doing a full refresh (which is a necessity when a column is added, I think) until it is manually initiated since these are large tables with many records.

Conner Panarella

07/15/2024, 8:02 PM

I thought that by using a static catalog I could avoid Meltano /

tap-mssql

trying to replicate the newly added column before the cdc was re-initialized to include that column.

Conner Panarella

07/15/2024, 8:07 PM

But I did test the command you provided, instead of the one I was using:

meltano invoke --dump=catalog tap-mssql--au_tdc > test.json

vs.

meltano invoke tap-mssql--au_tdc --discover > test.json

The first one included the metadata extra in

meltano.yml

as expected! The second did not.

👍 2

Edgar Ramírez (Arch.dev)

07/15/2024, 9:57 PM

I have a large MS SQL database I want to replicate using
tap-mssql
which supports cdc. This database can have columns added to tables on a semi-regular basis. I would like to avoid doing a full refresh (which is a necessity when a column is added, I think) until it is manually initiated since these are large tables with many records.

Ah gotcha, that makes sense

4 Views

Open in Slack

Previous Next