I'm finding that when I set TAP_NAME__CATALOG, Mel...
# troubleshooting
h
I'm finding that when I set TAP_NAME__CATALOG, Meltano no longer downloads any source data. With the variable set, and the state deleted, I get the following.
Copy code
Environment 'dev' is active
Reading state from Local Filesystem
No state found for dev:table-test_sysprocompanysb-dbo-testtable-to-target-test_sysprocompanysb.
No state was found, complete import.
Found catalog in /Users/hayden.ness/mltproject/.meltano/run/source-test_sysprocompanysb/tap.properties.json
...spam...
2024-08-27 15:53:44,860 | INFO     | target-s3            | Target 'target-s3' completed reading 1 lines of input (0 schemas, 0 records, 0 batch manifests, 1 state messages).
2024-08-27 15:53:44,861 | INFO     | target-s3            | Emitting completed target state {"currently_syncing": null}
Writing state to Local Filesystem
Incremental state has been updated at 2024-08-27 05:53:44.876604+00:00.
Block run completed.
If I unset the variable, the data is downloaded and written into parquet files in my target as expected. If it matters, I'm using tap-mssql, and target-s3. Is this the intended behaviour? If so, what am I missing?
1
v
you're asking what the
catalog
extra does I think? https://docs.meltano.com/concepts/plugins/#catalog-extra Yes if your catalog isn't right / says to not pull any data it won't pull any data
Normally with meltano you don't provide a catalog is the select / metadata / etc settings build that for you automatically based on discovery from the tap. But there can be cases that folks want to control it themselves and if you do then you have to maintain how that thing is generated
e
Yeah +1 to what Derek said. I'm curious what value you're setting
TAP_NAME__CATALOG
to.
h
My project involves being able to deal with some fairly large sources, and latency is something to be reduced. To that end, I am creating workers/subprocesses to each handle a share of the work. My plugins are structured like this:
Copy code
plugins:
  - name: tap-mssql
    metadata:
      '*':
        replication-method: LOG_BASED
  - name: source-database_one
    inherit_from: tap-mysql
  - name: source-database_two
    inherit_from: tap-mssql
  - name: source-database_one-table-table_one
    inherit_from: source-database_one
    select:
      - 'dbo-TableOne.*'
      - '!dbo-TableOne.TimeStamp'
  - name: source-database_one-table-table_two
    inherit_from: source-database_one
    select:
      - 'dbo-TableTwo.*'
      - '!dbo-TableTwo.TimeStamp'
I then set SOURCE_DATABASE_ONE___CATALOG=.meltano/run/source-database_one/tap.properties.json_ I generate the catalog file with (Is there a better way to generate this?)
Copy code
meltano invoke source-database_one --discover > .meltano/run/source-source_database_one/tap.properties.json
I'm doing this because it greatly speeds up how fast my workers start doing useful work, and ideally each worker would have the same source of truth. (Side note, in the above, I'm was trying one worker per table, which I've found to be slower than many tables per worker due to overhead, but I'll revert that later. I want to avoid the catalog overhead in any case.)
e
I generate the catalog file with (Is there a better way to generate this?)
Can you try
Copy code
meltano invoke --dump=catalog source-database_one > .meltano/run/source-source_database_one/tap.properties.json
👍 1
The difference is this option respects your
select
patches, instead of giving you the default catalog with everything unselected.
h
Hmm, that does work, but Meltano tries to sync everything. i.e. '*'. So it seems my understanding of how the catalog works has been incorrect, and 'select: ' (and presumably *__SELECT) only matters for catalog generation, not afterwards. That allows me to solve my problem by pre-generating a catalog per worker, which will be good enough, so thank you very much for that. Are there any improvements that can be made on top of that?
e
Hmm, that does work, but Meltano tries to sync everything. i.e. '*'.
That does sound like bug. Would you like to report a bug? Or, I should probably ask, does that happen during
meltano invoke
or during
meltano run
?
So it seems my understanding of how the catalog works has been incorrect, and 'select: ' (and presumably *__SELECT) only matters for catalog generation, not afterwards.
Pretty much when you use the
catalog
extra, all other catalog-patching settings are ignored, yes.
That allows me to solve my problem by pre-generating a catalog per worker, which will be good enough, so thank you very much for that.
Awesome!
Are there any improvements that can be made on top of that?
Not at the moment, I think. We could look into ways of making the cached catalogs (and everything Meltano caches in general) more portable. Perhaps with a command for explicitly purging unwanted stuff from the cache. That would make it easy to save and restore the cache for use cases like yours. Another option would be a way to tell Meltano to apply
select
and related catalog patches on the passed catalog. Maybe a new
patch_catalog
boolean flag or similar. By all means do create a feature request and if you're curious this happens in https://github.com/meltano/meltano/blob/38e895682425b29d4e4041f86d3605d2c6dd4978/src/meltano/core/plugin/singer/tap.py#L565-L571.
Actually, looking at the code there, I wonder if using
select_filter
might be what you're looking for...
An extractor's
select_filter
extra holds an array of entity selection filter rules that are applied to the extractor's discovered or provided catalog file when the extractor is run using
meltano run
,
meltano invoke
, or
meltano elt
, after schema, selection, and metadata rules are applied.
h
So the
select_filter
option does allow me to load in the table as needed, so that's good to know. Unfortunately, tap-mssql crashes with a parse error on certain columns, and
select_filter
doesn't work for attributes. In general though, it's quite possible we will end up selecting only the columns we need to the transferred, so it won't be suitable for us. > That does sound like bug. Would you like to report a bug? Or, I should probably ask, does that happen during
meltano invoke
or during
meltano run
? I've only been using
run
, I haven't really explored the use of
invoke
or
el
yet. > Not at the moment, I think. We could look into ways of making the cached catalogs (and everything... I'm not sure what the right approach for most people or for Meltanos philosophy is, but for my specific (but broad) use case (get data from here and put it over there), all I would really need is for Meltano to cache the database schema (an expensive and tap dependent operation, and to enable * selection), and to record the state. I think I've really only had problems with anything beyond that. I should be able to find some time in a few weeks to write up git issues on problematic behaviours. I can manage without additional features for now, but I'm hoping the python API can address many more problems eloquently.