Conner Panarella
07/12/2024, 1:54 PMmetadata
extra here. Unfortunately it looks like I cannot provide the extra if I am providing a catalog manually. Is there a recommended way to accomplish something like that? I have a very large database which takes a long time to discover on-the-fly, so I want to provide a catalog, but I also don't want to update the catalog to set every single table to LOG_BASED
every time the catalog needs to be updated or regenerated.visch
07/12/2024, 2:54 PMConner Panarella
07/12/2024, 3:05 PMvisch
07/12/2024, 3:09 PMmeltano invoke tap-name-catalog-generator --dump=catalog > input_catalog.json
3. git add input_catalog
git commit -m "Automated commit: Catalog now matches database metadata"
4. git push
Conner Panarella
07/12/2024, 3:11 PMvisch
07/12/2024, 3:11 PMvisch
07/12/2024, 3:12 PMConner Panarella
07/12/2024, 3:16 PMConner Panarella
07/12/2024, 3:20 PMvisch
07/12/2024, 5:40 PMConner Panarella
07/12/2024, 6:03 PMversion: 1
default_environment: prod
project_id: 4172a1e0-ae95-4520-b408-050067e78d09
environments:
- name: prod
state_id_suffix: prod
plugins:
extractors:
- name: tap-mssql
variant: wintersrd
pip_url: tap-mssql
- name: tap-mssql--ptfm_tmsn
inherit_from: tap-mssql
catalog: extract/ptfm_tmsn.json
config:
database: PTFM_TMSN
use_date_datatype: true
use_singer_decimal: true
cursor_array_size: 10000
loaders:
- name: target-postgres
variant: meltanolabs
pip_url: meltanolabs-target-postgres
- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl
utilities:
- name: dagster
variant: quantile-development
pip_url: dagster-ext dagster-postgres dagster-dbt grpcio<1.65
commands:
start-dagster:
args: dev -f $REPOSITORY_DIR/meltano_jobs.py -h 0.0.0.0 -d $REPOSITORY_DIR
executable: dagster_invoker
visch
07/12/2024, 6:23 PM- name: tap-mssql
variant: wintersrd
pip_url: tap-mssql
metadata:
"*":
replication-method: LOG_BASED
right? Then this would work correcT?Conner Panarella
07/12/2024, 6:33 PMvisch
07/12/2024, 6:45 PMConner Panarella
07/12/2024, 6:51 PM- name: tap-mssql
variant: wintersrd
pip_url: tap-mssql
metadata:
"*":
replication-method: LOG_BASED
If I am also providing a catalog generated with a command like this: meltano invoke tap-name-catalog-generator --dump=catalog > input_catalog.json
When you use a catalog (either in the meltano.yaml or with --catalog
anything in the metadata extra in dagster.yaml is ignored. This is also in the documentation:
"These rules are not applied when a catalog is provided manually."visch
07/12/2024, 6:53 PMConner Panarella
07/12/2024, 7:13 PMtap-mssql--ptfm_tmsn
and also be able to make sure the replication is LOG_BASED, despite it not being generated that way.Edgar Ramírez (Arch.dev)
07/12/2024, 8:09 PMmeltano invoke tap-name-catalog-generator --dump=catalog
generate a correct catalog or do you apply manual changes to the output?
2. "I have a very large database which takes a long time to discover on-the-fly". That shouldn't be a problem if you're using meltano run
, which will automatically cache the catalog. A --refresh-catalog
CLI option will be shipped in an upcoming 3.5.0 release (currently in alpha), so you'll be able to update this cached catalog whenever you know things changed upstream.Conner Panarella
07/12/2024, 8:42 PMreplication-method
so that must be set to LOG_BASED
2. Got it, this all is related to the fact that tap-mssql
starts failing once columns are added to a source database. Perhaps I should reach out to them to see what the recommended way to handle this is.Edgar Ramírez (Arch.dev)
07/12/2024, 8:58 PMmetadata
- name: tap-mssql
variant: wintersrd
pip_url: tap-mssql
metadata:
"*":
replication-method: LOG_BASED
and running
meltano invoke tap-name-catalog-generator --dump=catalog
doesn't add the replication method to the generated catalog.
I know your workflows requires you to supply a pre-saved catalog, but I want to double-check my assumptions.
Also, do you run Meltano in an environment with ephemeral storage where the cached catalog saved to .meltano/tap-mssql/tap.properties.json
would be lost for the next run?
PS: I think tap-mysql has a similar cdc implementation and it wasn't applying schema overrides: https://github.com/transferwise/pipelinewise-tap-mysql/pull/186. But that's not the problem you're having.Conner Panarella
07/15/2024, 7:56 PMtap-mssql
which supports cdc. This database can have columns added to tables on a semi-regular basis. I would like to avoid doing a full refresh (which is a necessity when a column is added, I think) until it is manually initiated since these are large tables with many records.Conner Panarella
07/15/2024, 8:02 PMtap-mssql
trying to replicate the newly added column before the cdc was re-initialized to include that column.Conner Panarella
07/15/2024, 8:07 PMmeltano invoke --dump=catalog tap-mssql--au_tdc > test.json
vs.
meltano invoke tap-mssql--au_tdc --discover > test.json
The first one included the metadata extra in meltano.yml
as expected! The second did not.Edgar Ramírez (Arch.dev)
07/15/2024, 9:57 PMI have a large MS SQL database I want to replicate usingAh gotcha, that makes sensewhich supports cdc. This database can have columns added to tables on a semi-regular basis. I would like to avoid doing a full refresh (which is a necessity when a column is added, I think) until it is manually initiated since these are large tables with many records.tap-mssql