Hey all! I am looking at the `metadata` extra <her...
# troubleshooting
c
Hey all! I am looking at the
metadata
extra here. Unfortunately it looks like I cannot provide the extra if I am providing a catalog manually. Is there a recommended way to accomplish something like that? I have a very large database which takes a long time to discover on-the-fly, so I want to provide a catalog, but I also don't want to update the catalog to set every single table to
LOG_BASED
every time the catalog needs to be updated or regenerated.
v
Depending on how you're deployed if I was still set on manually providing the catalog file (I believe right now the catalog file is cached by default for runs so you may be able to utilize that to your advantage actually) I would setup a manual CI job that would go generate a new catalog file and then shove up a new MR with the updated catalog to the repo, and I'd have meltano generate it
c
@visch What do you mean by "shove up a new MR"?
v
Sorry I'm spitballing here. um it'd look something like 1. Manual flow triggered 2.
meltano invoke tap-name-catalog-generator --dump=catalog > input_catalog.json
3.
git add input_catalog
git commit -m "Automated commit: Catalog now matches database metadata"
4.
git push
c
@visch Got it, but that wouldn't fix the issue of replicating the metadata extra applying to the provided catalog.
v
How wouldn't it? Could you share you meltano.yml
The assumption I'm making is the catalog that you are providing can be generated with Meltano you just don't want to because discover takes too long for you right now
c
@visch Unfortunately it does not generate the correct catalog since it does not contain the replication-method
The generated catalog looks something like this:
v
You can have meltano put in the correct replication method with metadata, can you share your meltano.yml?
c
@visch:
Copy code
version: 1
default_environment: prod
project_id: 4172a1e0-ae95-4520-b408-050067e78d09
environments:
  - name: prod
    state_id_suffix: prod
plugins:
  extractors:
    - name: tap-mssql
      variant: wintersrd
      pip_url: tap-mssql
          
    - name: tap-mssql--ptfm_tmsn
      inherit_from: tap-mssql
      catalog: extract/ptfm_tmsn.json
      config:
        database: PTFM_TMSN
		use_date_datatype: true
        use_singer_decimal: true
        cursor_array_size: 10000

  loaders:
    - name: target-postgres
      variant: meltanolabs
      pip_url: meltanolabs-target-postgres

    - name: target-jsonl
      variant: andyh1203
      pip_url: target-jsonl
  utilities:
    - name: dagster
      variant: quantile-development
      pip_url: dagster-ext dagster-postgres dagster-dbt grpcio<1.65
      commands:
        start-dagster:
          args: dev -f $REPOSITORY_DIR/meltano_jobs.py -h 0.0.0.0 -d $REPOSITORY_DIR
          executable: dagster_invoker
v
Copy code
- name: tap-mssql
      variant: wintersrd
      pip_url: tap-mssql
      metadata:
        "*":
          replication-method: LOG_BASED
right? Then this would work correcT?
c
@visch You can't use that if you're also providing a catalog manually, which led me to my original question.
c
@visch Sorry, I really appreciate the help. Didn't mean to come across combative! Unfortunately I am a bit lost since I cannot use config like this:
Copy code
- name: tap-mssql
      variant: wintersrd
      pip_url: tap-mssql
      metadata:
        "*":
          replication-method: LOG_BASED
If I am also providing a catalog generated with a command like this:
meltano invoke tap-name-catalog-generator --dump=catalog > input_catalog.json
When you use a catalog (either in the meltano.yaml or with
--catalog
anything in the metadata extra in dagster.yaml is ignored. This is also in the documentation: "These rules are not applied when a catalog is provided manually."
v
but tap-mssql doesn't have a catalog, right? Just the tap-mssql--ptfm_tmsn has one. The point is you can generate your manual one without much issue. If you're really worried about it being in the "chain" then you can make another extractor and do the same thing (or just make another inheirt from tap-mssql-catalog-generator
c
No it doesn't have a catalog, but I'm not sure what having the metadata extra at that level is meant to do? Sorry if I am missing something. But the goal is to be able to supply the generated catalog for
tap-mssql--ptfm_tmsn
and also be able to make sure the replication is LOG_BASED, despite it not being generated that way.
e
Hey @Conner Panarella! So a few comments and questions about your workflow: 1. Does
meltano invoke tap-name-catalog-generator --dump=catalog
generate a correct catalog or do you apply manual changes to the output? 2. "I have a very large database which takes a long time to discover on-the-fly". That shouldn't be a problem if you're using
meltano run
, which will automatically cache the catalog. A
--refresh-catalog
CLI option will be shipped in an upcoming 3.5.0 release (currently in alpha), so you'll be able to update this cached catalog whenever you know things changed upstream.
c
@Edgar Ramírez (Arch.dev) 1. No, the catalog is missing
replication-method
so that must be set to
LOG_BASED
2. Got it, this all is related to the fact that
tap-mssql
starts failing once columns are added to a source database. Perhaps I should reach out to them to see what the recommended way to handle this is.
e
The first point is definitely weird. Meltano should apply the catalog changes regardless of whether the tap will respect the overrides. I can confirm it does. Can you confirm if adding
metadata
Copy code
- name: tap-mssql
      variant: wintersrd
      pip_url: tap-mssql
      metadata:
        "*":
          replication-method: LOG_BASED
and running
Copy code
meltano invoke tap-name-catalog-generator --dump=catalog
doesn't add the replication method to the generated catalog. I know your workflows requires you to supply a pre-saved catalog, but I want to double-check my assumptions. Also, do you run Meltano in an environment with ephemeral storage where the cached catalog saved to
.meltano/tap-mssql/tap.properties.json
would be lost for the next run? PS: I think tap-mysql has a similar cdc implementation and it wasn't applying schema overrides: https://github.com/transferwise/pipelinewise-tap-mysql/pull/186. But that's not the problem you're having.
c
@Edgar Ramírez (Arch.dev) Thank you for the explanation! Very insightful! Maybe I've done a poor job of explaining the issue I am trying to solve, so let me take a step back. I have a large MS SQL database I want to replicate using
tap-mssql
which supports cdc. This database can have columns added to tables on a semi-regular basis. I would like to avoid doing a full refresh (which is a necessity when a column is added, I think) until it is manually initiated since these are large tables with many records.
I thought that by using a static catalog I could avoid Meltano /
tap-mssql
trying to replicate the newly added column before the cdc was re-initialized to include that column.
But I did test the command you provided, instead of the one I was using:
meltano invoke --dump=catalog tap-mssql--au_tdc > test.json
vs.
meltano invoke tap-mssql--au_tdc --discover > test.json
The first one included the metadata extra in
meltano.yml
as expected! The second did not.
👍 2
e
I have a large MS SQL database I want to replicate using
tap-mssql
which supports cdc. This database can have columns added to tables on a semi-regular basis. I would like to avoid doing a full refresh (which is a necessity when a column is added, I think) until it is manually initiated since these are large tables with many records.
Ah gotcha, that makes sense