I m finding that when I set TAP NAME CATALOG Meltano no long Meltano #troubleshooting

I'm finding that when I set TAP_NAME__CATALOG, Mel...

Hayden Ness

08/27/2024, 6:07 AM

I'm finding that when I set TAP_NAME__CATALOG, Meltano no longer downloads any source data. With the variable set, and the state deleted, I get the following.

Copy code

Environment 'dev' is active
Reading state from Local Filesystem
No state found for dev:table-test_sysprocompanysb-dbo-testtable-to-target-test_sysprocompanysb.
No state was found, complete import.
Found catalog in /Users/hayden.ness/mltproject/.meltano/run/source-test_sysprocompanysb/tap.properties.json
...spam...
2024-08-27 15:53:44,860 | INFO     | target-s3            | Target 'target-s3' completed reading 1 lines of input (0 schemas, 0 records, 0 batch manifests, 1 state messages).
2024-08-27 15:53:44,861 | INFO     | target-s3            | Emitting completed target state {"currently_syncing": null}
Writing state to Local Filesystem
Incremental state has been updated at 2024-08-27 05:53:44.876604+00:00.
Block run completed.

If I unset the variable, the data is downloaded and written into parquet files in my target as expected. If it matters, I'm using tap-mssql, and target-s3. Is this the intended behaviour? If so, what am I missing?

✅ 1

visch

08/27/2024, 2:39 PM

you're asking what the

catalog

extra does I think? https://docs.meltano.com/concepts/plugins/#catalog-extra Yes if your catalog isn't right / says to not pull any data it won't pull any data

visch

08/27/2024, 2:40 PM

Normally with meltano you don't provide a catalog is the select / metadata / etc settings build that for you automatically based on discovery from the tap. But there can be cases that folks want to control it themselves and if you do then you have to maintain how that thing is generated

Edgar Ramírez (Arch.dev)

08/27/2024, 3:36 PM

Yeah +1 to what Derek said. I'm curious what value you're setting

TAP_NAME__CATALOG

to.

Hayden Ness

08/27/2024, 10:42 PM

My project involves being able to deal with some fairly large sources, and latency is something to be reduced. To that end, I am creating workers/subprocesses to each handle a share of the work. My plugins are structured like this:

Copy code

plugins:
  - name: tap-mssql
    metadata:
      '*':
        replication-method: LOG_BASED
  - name: source-database_one
    inherit_from: tap-mysql
  - name: source-database_two
    inherit_from: tap-mssql
  - name: source-database_one-table-table_one
    inherit_from: source-database_one
    select:
      - 'dbo-TableOne.*'
      - '!dbo-TableOne.TimeStamp'
  - name: source-database_one-table-table_two
    inherit_from: source-database_one
    select:
      - 'dbo-TableTwo.*'
      - '!dbo-TableTwo.TimeStamp'

I then set SOURCE_DATABASE_ONE___CATALOG=.meltano/run/source-database_one/tap.properties.json_ I generate the catalog file with (Is there a better way to generate this?)

Copy code

meltano invoke source-database_one --discover > .meltano/run/source-source_database_one/tap.properties.json

I'm doing this because it greatly speeds up how fast my workers start doing useful work, and ideally each worker would have the same source of truth. (Side note, in the above, I'm was trying one worker per table, which I've found to be slower than many tables per worker due to overhead, but I'll revert that later. I want to avoid the catalog overhead in any case.)

Edgar Ramírez (Arch.dev)

08/27/2024, 11:14 PM

I generate the catalog file with (Is there a better way to generate this?)

Can you try

Copy code

meltano invoke --dump=catalog source-database_one > .meltano/run/source-source_database_one/tap.properties.json

👍 1

Edgar Ramírez (Arch.dev)

08/27/2024, 11:15 PM

The difference is this option respects your

select

patches, instead of giving you the default catalog with everything unselected.

Hayden Ness

08/27/2024, 11:25 PM

Hmm, that does work, but Meltano tries to sync everything. i.e. '*'. So it seems my understanding of how the catalog works has been incorrect, and 'select: ' (and presumably *__SELECT) only matters for catalog generation, not afterwards. That allows me to solve my problem by pre-generating a catalog per worker, which will be good enough, so thank you very much for that. Are there any improvements that can be made on top of that?

Edgar Ramírez (Arch.dev)

08/28/2024, 1:05 AM

Hmm, that does work, but Meltano tries to sync everything. i.e. '*'.

That does sound like bug. Would you like to report a bug? Or, I should probably ask, does that happen during

meltano invoke

or during

meltano run

So it seems my understanding of how the catalog works has been incorrect, and 'select: ' (and presumably *__SELECT) only matters for catalog generation, not afterwards.

Pretty much when you use the

catalog

extra, all other catalog-patching settings are ignored, yes.

That allows me to solve my problem by pre-generating a catalog per worker, which will be good enough, so thank you very much for that.

Awesome!

Are there any improvements that can be made on top of that?

Not at the moment, I think. We could look into ways of making the cached catalogs (and everything Meltano caches in general) more portable. Perhaps with a command for explicitly purging unwanted stuff from the cache. That would make it easy to save and restore the cache for use cases like yours. Another option would be a way to tell Meltano to apply

select

and related catalog patches on the passed catalog. Maybe a new

patch_catalog

boolean flag or similar. By all means do create a feature request and if you're curious this happens in https://github.com/meltano/meltano/blob/38e895682425b29d4e4041f86d3605d2c6dd4978/src/meltano/core/plugin/singer/tap.py#L565-L571.

Edgar Ramírez (Arch.dev)

08/28/2024, 1:09 AM

Actually, looking at the code there, I wonder if using

select_filter

might be what you're looking for...

Edgar Ramírez (Arch.dev)

08/28/2024, 1:10 AM

An extractor's
select_filter
extra holds an array of entity selection filter rules that are applied to the extractor's discovered or provided catalog file when the extractor is run using
meltano run
,
meltano invoke
, or
meltano elt
, after schema, selection, and metadata rules are applied.

Hayden Ness

08/28/2024, 2:21 AM

So the

select_filter

option does allow me to load in the table as needed, so that's good to know. Unfortunately, tap-mssql crashes with a parse error on certain columns, and

select_filter

doesn't work for attributes. In general though, it's quite possible we will end up selecting only the columns we need to the transferred, so it won't be suitable for us. > That does sound like bug. Would you like to report a bug? Or, I should probably ask, does that happen during

meltano invoke

or during

meltano run

? I've only been using

run

, I haven't really explored the use of

invoke

el

yet. > Not at the moment, I think. We could look into ways of making the cached catalogs (and everything... I'm not sure what the right approach for most people or for Meltanos philosophy is, but for my specific (but broad) use case (get data from here and put it over there), all I would really need is for Meltano to cache the database schema (an expensive and tap dependent operation, and to enable * selection), and to record the state. I think I've really only had problems with anything beyond that. I should be able to find some time in a few weeks to write up git issues on problematic behaviours. I can manage without additional features for now, but I'm hoping the python API can address many more problems eloquently.

3 Views

Open in Slack

Previous Next