hi how do I access the `settings` and `select` str...
# getting-started
m
hi how do I access the
settings
and
select
streams from meltano.yml in the tap itself? In the tap-mongodb codebase there is a part that hits all collections in the database, but this is unnecessary if I’m only running a select on one collection
Copy code
for collection in self.database.list_collection_names(authorizedCollections=True, nameOnly=True):
   ...
I was thinking I can add a configuration for the behavior
Copy code
if self.discovery_mode == 'select':
            collections = <get current selected streams>
        else:
            collections = self.database.list_collection_names(authorizedCollections=True, nameOnly=True)
I can force it in a way that’s probably super bad practice and wouldn’t generalize across different env configurations:
Copy code
selected_collections = yaml.safe_load(open('meltano.yml')).get('plugins').get('extractors')[0].get('select')
r
What variant of
tap-mongodb
?
m
meltano labs
but my question should be for any meltano tap. I just want to pull the
select
values from meltano.yml
r
Yeah, just wanted to know if you were using an SDK-based one or not. In the tap class/stream classes, you can use
self.config
to access setting values and SQLStream.get_selected_schema for the selected schema.
m
hmm.. where would this
get_selected_schema
start being available? I’m not sure where to call it? In discover_catalog_entries there is no value. When I look at
self.config
I only see a mapping proxy:
Copy code
self.config
mappingproxy({'database': <DB>, 'mongodb_connection_string': <db_connection_string>, 'discovery_mode': 'select', 'datetime_conversion': 'datetime', 'prefix': '', 'add_record_metadata': False, 'allow_modify_change_streams': False, 'operation_types': ['create', 'delete', 'insert', 'replace', 'update']})
Copy code
for i in dir(self):
    print(i)
    
_PluginBase__initialized_at
__abstractmethods__
__annotations__
__class__
__delattr__
__dict__
__dir__
__doc__
__eq__
__format__
__ge__
__getattribute__
__gt__
__hash__
__init__
__init_subclass__
__le__
__lt__
__module__
__ne__
__new__
__reduce__
__reduce_ex__
__repr__
__setattr__
__sizeof__
__str__
__subclasshook__
__weakref__
_abc_impl
_catalog
_catalog_dict
_config
_env_var_config
_get_about_info
_get_mongo_connection_string
_get_mongo_options
_get_package_version
_get_supported_python_versions
_input_catalog
_is_secret_config
_mapper
_reset_state_progress_markers
_set_compatible_replication_methods
_singer_catalog
_state
_streams
_validate_config
append_builtin_config
capabilities
catalog
catalog_dict
catalog_json_text
cb_discover
cb_test
cb_version
cli
config
config_from_cli_args
config_jsonschema
connector
discover_streams
get_plugin_version
get_sdk_version
get_singer_command
get_supported_python_versions
initialized_at
input_catalog
invoke
load_state
load_streams
logger
mapper
metrics_logger
name
package_name
plugin_version
print_about
print_version
run_connection_test
run_discovery
run_sync_dry_run
sdk_version
setup_mapper
state
streams
sync_all
write_schemas
self.streams
{}
r
Hmm... It won't be available in the connector class where you are trying to apply it, since it is a method available to
SQLStream
classes only - i.e.
MongoDBCollectionStream
which are instantiated using the connector: https://github.com/MeltanoLabs/tap-mongodb/blob/b84a1e04052d83bade0f49fad79a1c9311b7a109/tap_mongodb/tap.py#L240-L243
m
right, bu tthe issue is here: https://github.com/MeltanoLabs/tap-mongodb/blob/b84a1e04052d83bade0f49fad79a1c9311b7a109/tap_mongodb/connector.py#L119 I want to add the following code but it requires me to pull the select values from meltano.yml
Copy code
if self.discovery_mode == 'select':
            collections = <pull selected streams>
        else:
            collections = self.database.list_collection_names(authorizedCollections=True, nameOnly=True)
r
I think you probably want to implement a setting that accepts an array of collection names to filter for (i.e.
filter_collections
) during catalog discovery.
select
is used to filter streams/properties from the discovered catalog. You would be able to omit your
select
rules if you specified
filter_collections
as you would have already "selected" collections during discovery. A couple other taps follow a similar design pattern; •
tap-mssql
and `filter_dbs`: https://hub.meltano.com/extractors/tap-mssql/#filter_dbs-setting
tap-mysql
and `filter_dbs`: https://hub.meltano.com/extractors/tap-mysql/#filter_dbs-setting
tap-postgres
and `filter_schemas`: https://hub.meltano.com/extractors/tap-postgres#filter_schemas-setting
tap-snowflakes
and `tables`: https://hub.meltano.com/extractors/tap-snowflake#tables-setting
Then you would just need to modify the
MongoDBConnector
constructor to accept
filter_collections
from config: https://github.com/MeltanoLabs/tap-mongodb/blob/b84a1e04052d83bade0f49fad79a1c9311b7a109/tap_mongodb/tap.py#L207-L213
m
but i would want to run it like this
Copy code
meltano el tap-mongodb target-snowflake --select collection1 --state-id tap_mongodb_collection1
^^ then I’d have to change the config param for every time I run a new tap with
--select
we shouldn’t need 2 different select configurations in meltano.yml right? That would be error prone / buggy, ugly, and probably not do what I would want it to do?
i think the easiest solution would be to extract the selected streams somewhere within the tap
r
I agree there is some duplication in concept, but we are talking about two different stages of the tap process here: discovery and sync. If you perform filtering of collections during discovery, stream selection is implied, i.e.
meltano.yml
Copy code
select:
- collection1.*

# same as above during sync, but with improved discovery performance
config:
  filter_collections:
  - collection1
then I’d have to change the config param for every time I run a new tap with
--select
Not sure what you mean here by "new tap" - do you have multiple
tap-mongodb
instances defined? If you wanted the config directly in the command rather than
meltano.yml
, you could do
Copy code
TAP_MONGODB_FILTER_COLLECTIONS='["collection1"]' meltano el tap-mongodb target-snowflake --state-id tap_mongodb_collection1
m
** new stream, not tap
👍 1
I’m not sure what adding
filter_collections
to the config would do if I run this tap 3 times with different streams selected. For example:
Copy code
for collection in ['col1', 'col2', 'col3']:
    subprocess.run(['meltano', 'el', 'tap-mongodb', 'target-snowflake', '--select', collection, '--state-id', f'state_{collection}'])
^ it would still run the discovery mode for all collections under
filter_collections
r
You could set the environment variable for the setting in
env
of `subprocess.run`:
Copy code
for collection in ['col1', 'col2', 'col3']:
    subprocess.run(
      ['meltano', 'el', 'tap-mongodb', 'target-snowflake', '--state-id', f'state_{collection}']
      env={
        "TAP_MONGODB_FILTER_COLLECTIONS": json.dumps([collection])
      }
    )
In fact, you could implement it so the setting could accept a single collection also for your use-case:
Copy code
for collection in ['col1', 'col2', 'col3']:
    subprocess.run(
      ['meltano', 'el', 'tap-mongodb', 'target-snowflake', '--state-id', f'state_{collection}']
      env={
        "TAP_MONGODB_FILTER_COLLECTIONS": collection
      }
    )
m
Hmm that's interesting! But then wouldn't it run the catalog discovery for all collections 3 times in this case? It's still going to hit the DB with find_one() an unnecessary amount of times 🤔
Also, this tap discovery mode assumes that it will pull the entire DB whenever it runs. I would think if the collection is not listed under the select section then the stream should not run or even appear in "deselected stream". If I set the tap like this it should not look for any streams outside of col1 and col2
Copy code
select:
  - col1
  - col2
If I run meltano with --select col1, It should only show deselected stream col2, not deselected stream col3, ... coln
r
So that's where your implementation of
filter_collections
would come in:
connector.py
Copy code
class MongoDBConnector:
    """MongoDB/DocumentDB connector class"""

    def __init__(  # pylint: disable=too-many-arguments
        self,
        connection_string: str,
        options: Dict[str, Any],
        db_name: str,
        datetime_conversion: str,
        prefix: Optional[str] = None,
        collections: List[str] = None,
    ) -> None:
        self._connection_string = connection_string
        self._options = options
        self._db_name = db_name
        self._datetime_conversion: str = datetime_conversion.upper()
        self._prefix: Optional[str] = prefix
        self._collections = collections
        self._logger: Logger = getLogger(__name__)
        self._version: Optional[MongoVersion] = None

    ...

    def discover_catalog_entries(self) -> List[Dict[str, Any]]:
        """Return a list of catalog entries from discovery.

        Returns:
            The discovered catalog entries as a list.
        """
        result: List[Dict] = []
        collections = self._collections or self.database.list_collection_names(authorizedCollections=True, nameOnly=True)
        for collection in collections:
            ...
tap.py
Copy code
@cached_property
    def connector(self) -> MongoDBConnector:
        """Get MongoDBConnector instance. Instance is cached and reused."""
        return MongoDBConnector(
            self._get_mongo_connection_string(),
            self._get_mongo_options(),
            self.config.get("database"),
            self.config.get("datetime_conversion"),
            prefix=self.config.get("prefix", None),
            collections=self.config.get("filter_collections),
        )
m
Wait but when you set collections=self.config.get('filtered_collections') it will still hit all of those collections when you run the tap with a --select flag
Hmmm... hacky solution could be I pass a single collection to the MongoConnector that reads from env in a loop... Difficult to read/understand. I wish we could just easily access the selected streams from within the tap. Are you sure that's not possible?
r
I would think if the collection is not listed under the select section then the stream should not run or even appear in "deselected stream"
I think the point of discovery and stream/property selection being separate concepts by default is that discovered catalog is exposed to a user, who can select what entities they want to sync without any prior knowledge of what data is available. The performance issue you are running into is really only a problem for taps that perform dynamic discovery (rather than others will well-known schemas that are statically defined) e.g. most SQL taps,
tap-google-sheets
.
If I set the tap like this it should not look for any streams outside of col1 and col2
This isn't how
select
works though - it operates on streams/properties that have already been "looked up" i.e. discovered. That configuration would include only
col1
and
col2
in the sync, but all collections would still be discovered beforehand.
> Wait but when you set collections=self.config.get('filtered_collections') it will still hit all of those collections when you run the tap with a --select flag Not in your case as you would be setting
filter_collections
to a single collection via environment variable in a loop.
--select
would be redundant as the discovered catalog would only contain information about
colN
, so it's safe to omit.
m
Hmm ok I see what you're saying. So you're also saying that it's not possible to access the select parameters from within the tap?
r
Not in the context you want to.
I would advocate for the filtering setting, as a lot of other database taps implement that pattern for the exact performance reason you were describing.
m
Hmm ok. Maybe it's something that should be a feature request for a more bigger project? It would be nice to be able to access the selected streams from the tap itself
r
@Edgar Ramírez (Arch.dev) any advice/ideas here? You will know more than me about the design here.
@matt_elgazar Try this
tap-mongodb
definition with the subprocess.run change:
Copy code
plugins:
  extractors:
  - name: tap-mongodb
     variant: meltanolabs
     pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
     settings:
     - name: filter_collections
https://github.com/ReubenFrankel/tap-mongodb/tree/filter-collections
m
Copy code
TypeError: __init__() got an unexpected keyword argument 'discover_streams'
Does meltano.yml look like this?
Copy code
filtered_collections:
        - 'col1.*'
r
Copy code
plugins:
  extractors:
  - name: tap-mongodb
     variant: meltanolabs
     pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
     settings:
     - name: filter_collections
     config:
       filter_collections: col1
       # or
       # filter_collections: [col1]
Sorry -
filter_collections
, not
filtered_collections
m
I’m still getting
TypeError: __init__() got an unexpected keyword argument 'discover_streams'
do you have a mongo database you can connect to and try it on?
r
Oh, that's my bad. I just cranked this out quickly without checking, one sec...
Try again now.
You may have to
Copy code
meltano install --clean extractor tap-mongodb
m
same error
r
> do you have a mongo database you can connect to and try it on? Never used MongoDB before - just worked with the SDK a lot.
m
Copy code
For more detailed log messages re-run the command using 'meltano --log-level=debug ...' CLI flag. cmd_type=elt name=meltano run_id=3b0ce514-13b6-44eb-a069-523e5a67c1aa state_id=tap_mongodb_testing_casinodata stdio=stderr
2024-12-05T23:26:57.947817Z [info     ] Note that you can also check the generated log file at '/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/logs/elt/tap_mongodb_testing_casinodata/3b0ce514-13b6-44eb-a069-523e5a67c1aa/elt.log'. cmd_type=elt name=meltano run_id=3b0ce514-13b6-44eb-a069-523e5a67c1aa state_id=tap_mongodb_testing_casinodata stdio=stderr
2024-12-05T23:26:57.947875Z [info     ] For more information on debugging and logging: <https://docs.meltano.com/reference/command-line-interface#debugging> cmd_type=elt name=meltano run_id=3b0ce514-13b6-44eb-a069-523e5a67c1aa state_id=tap_mongodb_testing_casinodata stdio=stderr
Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.

ELT could not be completed: Cannot start extractor: Catalog discovery failed: command ['/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/bin/tap-mongodb', '--config', '/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/run/elt/tap_mongodb_testing_casinodata/3b0ce514-13b6-44eb-a069-523e5a67c1aa/tap.e30f052b-3a01-47b1-aef0-91df12c2d032.config.json', '--discover'] returned 1 with stderr:
 Traceback (most recent call last):
  File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/bin/tap-mongodb", line 5, in <module>
    from tap_mongodb.tap import TapMongoDB
  File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/tap_mongodb/tap.py", line 22, in <module>
    class TapMongoDB(Tap):
  File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/tap_mongodb/tap.py", line 27, in TapMongoDB
    config_jsonschema = th.PropertiesList(
  File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/singer_sdk/typing.py", line 242, in to_dict
    return self.type_dict  # type: ignore[no-any-return]
  File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/singer_sdk/typing.py", line 692, in type_dict
    merged_props.update(w.to_dict())
  File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/singer_sdk/typing.py", line 566, in to_dict
    type_dict = append_type(type_dict, "null")
  File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/singer_sdk/helpers/_typing.py", line 74, in append_type
    raise ValueError(msg)
ValueError: Could not append type because the JSON schema for the dictionary `{'oneOf': [{'type': ['string']}, {'type': 'array', 'items': {'type': ['string']}}]}` appears to be invalid.
.
yea i can take what you have done on that commit and play around with it to get it to work.
no worries, if you don’t have a mongodb db to connect to it’s going to be pretty hard to get this to work IMO. I will be using this tap in production so it has to be robust
r
This is still stuck at config validation 🤦 Sorry, rushing too much.
m
no rush, take your time 🙂
I’m working on it as well
r
Ok, it's getting past config validation now... Can you clean install and try again?
m
it runs, but same issue as before.. it still hits all of the collections.
@Edgar Ramírez (Arch.dev) is it possible to extract the selected stream text from
meltano.yml
within the extractor without hard coding using
yaml.safe_load
?
r
If you put
Copy code
self._logger.info(self._collections)
        self._logger.info(collections)
here, what do you see?
e
if the purpose is to skip discovery of unselected collections, this is equivalent to https://github.com/meltano/sdk/issues/1234. (Sorry if this was mentioned, I only skimmed the 50+ messages in the thread 😅).
is it possible to extract the selected stream text from
meltano.yml
within the extractor without hard coding using
yaml.safe_load
?
I wouldn't recommend coupling a Meltano project to a tap implementation, we should be able to solve this with config + acting early in the implementation of
discover_catalog_entries
.
🙌 1
The branch looks good, are there still problems with that approach? (looking at https://github.com/MeltanoLabs/tap-mongodb/compare/main...ReubenFrankel:tap-mongodb:filter-collections)
r
Yeah, #1234 is how I think Matt wants
select
to work here (through
meltano el --select
) - thanks for finding the issue, would definitely be good to have.
e
The reason this feels clunky is because the pattern is not supported by Meltano/Singer and why taps have opted for settings like
filter_collections
. What do
select
patterns mean when there aren't any streams against which they can be applied before discovery? You would need an invocation like
tap --discover --catalog pre-selected-catalog.json
, but how do you generate that catalog if you only know the patterns, not the actual streams? I'm certainly open to ideas, but this is the reason this request hasn't moved forward.
👀 1
r
@matt_elgazar I spun up a test cloud instance of MongoDB with some sample data and the branch change is working as expected: • All collections: ~1m40s, 67661 records • Single collection: ~0m5s, 185 records No
select
rules for either, though
filter_collections
is behaving as
select
otherwise would.
Here is just discovery - negligible difference, but maybe that is because there are only six collections here (rather than 1000s):
👍 1
m
heyyy • Yes!! The purpose is to filter discover to only selected streams as you mentioned in that meltano sdk issue https://github.com/meltano/sdk/issues/1234. • where would you put
TAP_MONGODB_FILTER_COLLECTIONS=collection1
if I’m running
meltano el
? Do I need to do anything in meltano.yml if I’m calling the filter collections parameter via the CLI?
i tried this
Copy code
TAP_MONGODB_FILTER_COLLECTIONS=test_collection meltano --environment=dev el tap-mongodb target-jsonl --state-id test --full-refresh
but I get this error:
Copy code
Failed to parse JSON array from string: 'test_collection'
also tried
TAP_MONGODB__CONFIG__FILTER_COLLECTIONS
and
MELTANO_EXTRACTORS__TAP_MONGODB__CONFIG__FILTER_COLLECTIONS
but these do nothing and hit the entire DB during discovery. When I use
TAP_MONGODB_FILTER_COLLECTIONS='["test_collection"]'
then it also hit’s the entire db
r
Yeah, that's an artifact of specifying
kind: array
for the setting and trying to support both a single collection as a string and multiple collections as an array of strings.
Copy code
TAP_MONGODB_FILTER_COLLECTIONS='["test_collection"]'
should be correct. When you say it's hitting all collections, are you seeing multiple
Discovered collection
log messages?
m
yep
Copy code
TAP_MONGODB_FILTER_COLLECTIONS='["test_collection"]' meltano --environment=dev el tap-mongodb target-jsonl --state-id test --full-refresh

...
2024-12-06T18:15:11.286080Z [warning  ] Stream `videostreamdata` was not found in the catalog
2024-12-06T18:15:11.286124Z [warning  ] Stream `wearabledispensepriceditems` was not found in the catalog
2024-12-06T18:15:11.286175Z [warning  ] Stream `wearabledispenseritems` was not found in the catalog
2024-12-06T18:15:11.286219Z [warning  ] Stream `wearabledispenserpaymenttokens` was not found in the catalog
2024-12-06T18:15:11.286261Z [warning  ] Stream `wearabledispensers` was not found in the catalog
2024-12-06T18:15:11.286307Z [warning  ] Stream `xdgrewardtrees` was not found in the catalog
2024-12-06T18:15:13.290110Z [info     ] Writing state to Azure Blob Storage
2024-12-06T18:15:13.974264Z [info     ] uploading part #1, 17 bytes (total 0.000GB)
2024-12-06T18:15:14.224768Z [info     ] uploading part #1, 50 bytes (total 0.000GB)
2024-12-06T18:15:14.475834Z [info     ] Incremental state has been updated at 2024-12-06 18:15:14.475721.
2024-12-06T18:15:14.482935Z [info     ] Extract & load complete!       name=meltano run_id=de83ae67-4260-4360-9b0e-c112cb6d9b3a state_id=test
r
Copy code
TAP_MONGODB_FILTER_COLLECTIONS='["test_collection"]' meltano config tap-mongodb list
will show you what configuration the tap will use - you should see the array value for
filter_collections
listed.
m
yea that command still shows all the collections. It shouldn’t even find those right?
Copy code
ON_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
_metadata.wearablerewards.replication-key [env: TAP_MONGODB__METADATA_WEARABLEREWARDS_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.wearablerewards.replication-method [env: TAP_MONGODB__METADATA_WEARABLEREWARDS_REPLICATION_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
_metadata.wearabledispensepriceditems.replication-key [env: TAP_MONGODB__METADATA_WEARABLEDISPENSEPRICEDITEMS_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.wearabledispensepriceditems.replication-method [env: TAP_MONGODB__METADATA_WEARABLEDISPENSEPRICEDITEMS_REPLICATION_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
_metadata.wearabledispenseritems.replication-key [env: TAP_MONGODB__METADATA_WEARABLEDISPENSERITEMS_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.wearabledispenseritems.replication-method [env: TAP_MONGODB__METADATA_WEARABLEDISPENSERITEMS_REPLICATION_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
_metadata.wearabledispenserpaymenttokens.replication-key [env: TAP_MONGODB__METADATA_WEARABLEDISPENSERPAYMENTTOKENS_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.wearabledispenserpaymenttokens.replication-method [env: TAP_MONGODB__METADATA_WEARABLEDISPENSERPAYMENTTOKENS_REPLICATION_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
_metadata.wearabledispensers.replication-key [env: TAP_MONGODB__METADATA_WEARABLEDISPENSERS_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.wearabledispensers.replication-method [env: TAP_MONGODB__METADATA_WEARABLEDISPENSERS_REPLICATION_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
_metadata.xdgrewardtrees.replication-key [env: TAP_MONGODB__METADATA_XDGREWARDTREES_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.xdgrewardtrees.replication-method [env: TAP_MONGODB__METADATA_XDGREWARDTREES_REPLICATION_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
r
Oh, you have
metadata
defined? That's probably messing with this.
m
yea i need to define metadata, because some collections require INCREMENTAL stream and some LOG_BASED. This is super important
Copy code
metadata:
      'accessorynftinfos':
        replication-key: replication_key
        replication-method: INCREMENTAL
      'activebanners':
        replication-key: replication_key
        replication-method: LOG_BASED
r
Stream {} was not found in the catalog
implies that it's not hitting the collection, no? Just that you have metadata defined for that stream, but it's not present in the discovered catalog after using
filter_collections
?
m
ah yes, good catch!
ok but when I run
Copy code
TAP_MONGODB_FILTER_COLLECTIONS='["test_collection"]' meltano --environment=dev el tap-mongodb target-jsonl --state-id test --full-refresh
I don’t see any jsonl file in the
output/
directory. I also tried
test_collection.*
r
Do you actually have a collection named
test_collection
?
m
yea
oh wait I see this
Copy code
2024-12-06T18:40:01.812740Z [warning  ] Stream `test_collection` was not found in the catalog
r
What collections do you see if you run
Copy code
meltano select tap-mongodb --list
?
m
wait i keep getting
RuntimeError: Could not connect to MongoDB
but i can definitely connect because I login with that same URI via compass… hmm
Copy code
Cannot list the selected attributes: Catalog discovery failed: command ['/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/bin/tap-mongodb', '--config', '/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/run/tap-mongodb/tap.575bc33b-3bd2-4f59-a518-2ce983bcb92f.config.json', '--discover'] returned 1 with stderr:
 Config validation failed: 'database' is a required property
but
database
is defined in meltano.yml
hmm this is happening with the main branch as well
r
Do you see it in
Copy code
meltano config tap-mongodb list
?
m
oohhh with that command I see
Copy code
meltano config tap-mongodb list | grep testcollection
2024-12-06T18:54:10.534217Z [info     ] The default environment 'dev' will be ignored for `meltano config`. To configure a specific environment, please use the option `--environment=<environment name>`.
_metadata.testcollection.replication-key [env: TAP_MONGODB__METADATA_TESTCOLLECTION_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.testcollection.replication-method [env: TAP_MONGODB__METADATA_TESTCOLLECTION_REPLICATION_METHOD] current value: 'LOG_BASED' (from `meltano.yml`)
any ideas why i’m not getting any data in the output? Do you get data in the output when you run the tap with
meltano el
?
r
I would focus on why
Copy code
meltano select tap-mongodb --list
doesn't work for you. The tap outputs records with my config: https://meltano.slack.com/archives/C06A1LKFAAC/p1733448784844729?thread_ts=1733425135.546469&cid=C06A1LKFAAC If you're not seeing any files in the
output
directory when running with
target-jsonl
, most likely the tap is not outputting any records.
m
what does your meltano.yml look like?
m
can you try it with metadata and environments set up? mine looks like this and haven’t changed anything except the pip url
Copy code
version: 1
send_anonymous_usage_stats: true
project_id: tap-mongodb

default_environment: dev

state_backend:
  type: remote
  uri: ${AZURE_TAP_MONGODB_STATE_URI}
  azure:
    connection_string: ${AZURE_TAP_MONGODB_STATE_CONNECTION_STRING}

plugins:
  extractors:
  - name: tap-mongodb
    namespace: tap_mongodb
    pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
    capabilities:
      - state
      - catalog
      - discover
      - about
      - stream-maps

    config:
      add_record_metadata: true
      allow_modify_change_streams: true

    select:
      - 'accessorynftinfos.*'
      - 'activebanners.*'
      - 'testcollection.*'

    metadata:
      'accessorynftinfos':
        replication-key: replication_key
        replication-method: INCREMENTAL
      'activebanners':
        replication-key: replication_key
        replication-method: LOG_BASED

      'activebanners':
        replication-key: replication_key
        replication-method: INCREMENTAL

  loaders:
    - name: target-jsonl
      variant: andyh1203
      pip_url: git+<https://github.com/andyhuynh3/target-jsonl.git>

environments:
  - name: dev
    config:
      plugins:
        extractors:
          - name: tap-mongodb
            config:
              mongodb_connection_string: ${DEV_MONGODB_CONNECTION_STRING}
              database: Dev_DB
        loaders:
          - name: target-snowflake
            config:
              default_target_schema: MONGODB_DEV
r
Since you have
database
configured for the
dev
environment, try
Copy code
meltano --environment dev select tap-mongodb --list
m
ah awesome! that did it 🙂
(it should be --environment=dev)
Copy code
2024-12-06T19:45:39.570655Z [warning  ] Stream `internalauthservers` was not found in the catalog
2024-12-06T19:45:39.570905Z [warning  ] Stream `packages` was not found in the catalog
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	bankedicemarketplaceinfos.*
	bannedusers.*
	casinodata.*
	realworldprizeredemptions.*
	wearablerewards.*

Selected attributes:
	[selected ] bankedicemarketplaceinfos._sdc_batched_at
	[selected ] bankedicemarketplaceinfos._sdc_extracted_at
	[selected ] bankedicemarketplaceinfos.cluster_time
	[selected ] bankedicemarketplaceinfos.document
	[selected ] bankedicemarketplaceinfos.namespace
	[selected ] bankedicemarketplaceinfos.namespace.collection
	[selected ] bankedicemarketplaceinfos.namespace.database
	[selected ] bankedicemarketplaceinfos.object_id
	[selected ] bankedicemarketplaceinfos.operation_type
	[automatic] bankedicemarketplaceinfos.replication_key
	[selected ] bannedusers._sdc_batched_at
	[selected ] bannedusers._sdc_extracted_at
	[selected ] bannedusers.cluster_time
	[selected ] bannedusers.document
	[selected ] bannedusers.namespace
	[selected ] bannedusers.namespace.collection
	[selected ] bannedusers.namespace.database
	[selected ] bannedusers.object_id
	[selected ] bannedusers.operation_type
	[automatic] bannedusers.replication_key
r
Cool, now try
Copy code
TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano select --environment dev tap-mongodb --list
Right, I think I figured your no data issue out: you had
testcollection
configured for
filter_collections
before, but you have other streams selected - not including
testcollection
, so it would have been excluded.
Copy code
Enabled patterns:
	bankedicemarketplaceinfos.*
	bannedusers.*
	casinodata.*
	realworldprizeredemptions.*
	wearablerewards.*
Remove these selection rules and try again.
m
hmmm why is it selecting other streams?
Copy code
melgazar9@MacBook-Pro tap-mongodb % TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev select tap-mongodb --list
2024-12-06T19:52:52.802146Z [info     ] Environment 'dev' is active   
2024-12-06T19:52:54.019420Z [warning  ] Stream `internalauthservers` was not found in the catalog
2024-12-06T19:52:54.019542Z [warning  ] Stream `packages` was not found in the catalog
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	bankedicemarketplaceinfos.*
	bannedusers.*
	casinodata.*
	realworldprizeredemptions.*
	wearablerewards.*

Selected attributes:
	[selected ] bankedicemarketplaceinfos._sdc_batched_at
	[selected ] bankedicemarketplaceinfos._sdc_extracted_at
	[selected ] bankedicemarketplaceinfos.cluster_time
	[selected ] bankedicemarketplaceinfos.document
	[selected ] bankedicemarketplaceinfos.namespace
	[selected ] bankedicemarketplaceinfos.namespace.collection
	[selected ] bankedicemarketplaceinfos.namespace.database
	[selected ] bankedicemarketplaceinfos.object_id
	[selected ] bankedicemarketplaceinfos.operation_type
	[automatic] bankedicemarketplaceinfos.replication_key
	[selected ] bannedusers._sdc_batched_at
	[selected ] bannedusers._sdc_extracted_at
	[selected ] bannedusers.cluster_time
	[selected ] bannedusers.document
	[selected ] bannedusers.namespace
	[selected ] bannedusers.namespace.collection
	[selected ] bannedusers.namespace.database
	[selected ] bannedusers.object_id
	[selected ] bannedusers.operation_type
	[automatic] bannedusers.replication_key
	[selected ] casinodata._sdc_batched_at
	[selected ] casinodata._sdc_extracted_at
	[selected ] casinodata.cluster_time
	[selected ] casinodata.document
	[selected ] casinodata.namespace
	[selected ] casinodata.namespace.collection
	[selected ] casinodata.namespace.database
	[selected ] casinodata.object_id
	[selected ] casinodata.operation_type
	[automatic] casinodata.replication_key
	[selected ] realworldprizeredemptions._sdc_batched_at
	[selected ] realworldprizeredemptions._sdc_extracted_at
	[selected ] realworldprizeredemptions.cluster_time
	[selected ] realworldprizeredemptions.document
	[selected ] realworldprizeredemptions.namespace
	[selected ] realworldprizeredemptions.namespace.collection
	[selected ] realworldprizeredemptions.namespace.database
	[selected ] realworldprizeredemptions.object_id
	[selected ] realworldprizeredemptions.operation_type
	[automatic] realworldprizeredemptions.replication_key
	[selected ] wearablerewards._sdc_batched_at
	[selected ] wearablerewards._sdc_extracted_at
	[selected ] wearablerewards.cluster_time
	[selected ] wearablerewards.document
	[selected ] wearablerewards.namespace
	[selected ] wearablerewards.namespace.collection
	[selected ] wearablerewards.namespace.database
	[selected ] wearablerewards.object_id
	[selected ] wearablerewards.operation_type
	[automatic] wearablerewards.replication_key
hmm, why is it selecting other collections?
r
Are those select patterns in your
meltano.yml
?
m
ah yes
but it selects all streams in meltano.yml if i do
Copy code
TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el tap-mongodb target-jsonl --state-id test
Are you saying remove everything selected in meltano.yml?
r
Yes.
m
oh no that selected everything 😅 sorry am I missing somethign?
Copy code
TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el tap-mongodb target-jsonl --state-id test
r
You should only see
casinodata
in the select list running
Copy code
TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev select tap-mongodb --list
with no
select
config in your
meltano.yml
.
m
I only have metadata defined in meltano.yml, but no select. I removed the select completely
Copy code
config:
  add_record_metadata: true
  allow_modify_change_streams: true
r
https://meltano.slack.com/archives/C06A1LKFAAC/p1733515321494379?thread_ts=1733425135.546469&amp;cid=C06A1LKFAAC So what happens when you run this? You see a file for each collection in
output
?
m
Copy code
melgazar9@MacBook-Pro tap-mongodb % ls -ltra output 
total 8
-rw-r--r--   1 melgazar9  staff   14 Nov  1  2023 .gitignore
drwxr-xr-x   3 melgazar9  staff   96 Dec  6 14:05 .
drwxr-xr-x  27 melgazar9  staff  864 Dec  6 14:06 ..
melgazar9@MacBook-Pro tap-mongodb % TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el tap-mongodb target-jsonl --state-id test
2024-12-06T20:08:42.434383Z [info     ] Environment 'dev' is active   
2024-12-06T20:08:43.072167Z [info     ] Running extract & load...      name=meltano run_id=4b899814-3f67-46cd-b2f5-fc6fc64e1ba6 state_id=test
2024-12-06T20:08:44.190784Z [info     ] Reading state from Azure Blob Storage
2024-12-06T20:08:45.249587Z [info     ] uploading part #1, 16 bytes (total 0.000GB)
2024-12-06T20:09:01.513999Z [warning  ] Stream `internalauthservers` was not found in the catalog
2024-12-06T20:09:01.514431Z [warning  ] Stream `packages` was not found in the catalog
2024-12-06T20:09:02.263224Z [info     ] 2024-12-06 14:09:02,262 | INFO     | tap-mongodb          | Beginning incremental sync of 'accessorynftinfos'... cmd_type=extractor name=tap-mongodb run_id=4b899814-3f67-46cd-b2f5-fc6fc64e1ba6 state_id=test stdio=stderr
2024-12-06T20:09:02.263640Z [info     ] 2024-12-06 14:09:02,263 | INFO     | tap-mongodb          | Tap has custom mapper. Using 1 provided map(s). cmd_type=extractor name=tap-mongodb run_id=4b899814-3f67-46cd-b2f5-fc6fc64e1ba6 state_id=test stdio=stderr

......

2024-12-06T20:10:03.967851Z [info     ] Writing state to Azure Blob Storage
2024-12-06T20:10:04.633706Z [info     ] uploading part #1, 17 bytes (total 0.000GB)
2024-12-06T20:10:04.898355Z [info     ] uploading part #1, 826 bytes (total 0.000GB)
2024-12-06T20:10:05.154122Z [info     ] Incremental state has been updated at 2024-12-06 20:10:05.154066.
2024-12-06T20:10:05.165889Z [info     ] Extract & load complete!       name=meltano run_id=1556e4be-7089-4215-a139-d767a521e116 state_id=test
2024-12-06T20:10:05.167017Z [info     ] Transformation skipped.        name=meltano run_id=1556e4be-7089-4215-a139-d767a521e116 state_id=test
melgazar9@MacBook-Pro tap-mongodb % ls -ltra output 
total 48464
-rw-r--r--   1 melgazar9  staff        14 Nov  1  2023 .gitignore
drwxr-xr-x  27 melgazar9  staff       864 Dec  6 14:06 ..
-rw-r--r--   1 melgazar9  staff      7594 Dec  6 14:09 accessorynftinfos.jsonl
-rw-r--r--   1 melgazar9  staff       404 Dec  6 14:09 activenotice.jsonl
-rw-r--r--   1 melgazar9  staff       450 Dec  6 14:09 activepoap.jsonl
-rw-r--r--   1 melgazar9  staff      2143 Dec  6 14:09 activerpc.jsonl
-rw-r--r--   1 melgazar9  staff       904 Dec  6 14:09 allowedorigins.jsonl
-rw-r--r--   1 melgazar9  staff     21636 Dec  6 14:09 appconfig.jsonl
-rw-r--r--   1 melgazar9  staff  23252941 Dec  6 14:09 arcadehandanalyticsdata.jsonl
drwxr-xr-x  11 melgazar9  staff       352 Dec  6 14:10 .
-rw-r--r--   1 melgazar9  staff    609066 Dec  6 14:10 casinodata.jsonl
r
😮‍💨 I think
metadata
is conflicting with what we're trying to do here. What happens if you add
--select casinodata
to the command?
Otherwise, I'm pretty stumped.
m
with only
--select casinodata
it only runs casinodata but it still shows other streams. One sec let me time the 2
Copy code
ironment=dev el tap-mongodb target-jsonl --state-id test --select casinodata
2024-12-06T20:15:11.344063Z [info     ] Environment 'dev' is active   
2024-12-06T20:15:11.978909Z [info     ] Running extract & load...      name=meltano run_id=4f1852c2-5793-42a8-8f6a-342b4f7dc40f state_id=test
2024-12-06T20:15:13.102493Z [info     ] Reading state from Azure Blob Storage
2024-12-06T20:15:13.816461Z [info     ] uploading part #1, 17 bytes (total 0.000GB)
2024-12-06T20:15:29.707819Z [warning  ] Stream `internalauthservers` was not found in the catalog
2024-12-06T20:15:29.708050Z [warning  ] Stream `packages` was not found in the catalog
2024-12-06T20:15:30.325400Z [info     ] 2024-12-06 14:15:30,325 | INFO     | tap-mongodb          | Skipping deselected stream 'accessorynftinfos'. cmd_type=extractor name=tap-mongodb run_id=4f1852c2-5793-42a8-8f6a-342b4f7dc40f state_id=test stdio=stderr
2024-12-06T20:15:30.325820Z [info     ] 2024-12-06 14:15:30,325 | INFO     | tap-mongodb          | Skipping deselected stream 'activebanners'. cmd_type=extractor name=tap-mongodb run_id=4f1852c2-5793-42a8-8f6a-342b4f7dc40f state_id=test stdio=stderr
2024-12-06T20:15:30.325903Z [info     ] 2024-12-06 14:15:30,325 | INFO     | tap-mongodb          | Skipping deselected stream 'activenotice'. cmd_type=extractor name=tap-mongodb run_id=4f1852c2-5793-42a8-8f6a-342b4f7dc40f state_id=test stdio=stderr
2024-12-06T20:15:30.325986Z [info     ] 2024-12-06 14:15:30,325 | INFO     | tap-mongodb          | Skipping deselected stream 'activepoap'. cmd_type=extractor name=tap-mongodb run_id=4f1852c2-5793-42a8-8f6a-342b4f7dc40f state_id=test stdio=stderr
Copy code
time TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el   4.22s user 0.57s system 19% cpu 24.809 total
Copy code
meltano --environment=dev el tap-mongodb target-jsonl --state-id test --selec  4.19s user 0.51s system 19% cpu 24.278 total
hmm yea it looks like they’re both hitting all of the collections
if i take out one of the metadata collections and run this it still hits the collection that I removed
Copy code
time TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el tap-mongodb target-jsonl --state-id test
without the
--select
it doesnt hit the collection, but then it streams everything
if i remove all metadata completely except for
casinodata
it still tries to hit everything
e
fwiw you probably wanna run with
--refresh-catalog
each time in case the catalog cache is not being invalidated for whatever reason
m
where do you put that command?
@Edgar Ramírez (Arch.dev) do you think the select streams being accessible from the meltano tap is a feature that would be added in the near future?
e
meltano el --refresh-catalog ...
m
Copy code
time TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el --refresh-catalog tap-mongodb target-jsonl --state-id test --full-refresh
Try 'meltano el --help' for help.

Error: No such option: --refresh-catalog Did you mean --catalog?
TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el   0.72s user 0.17s system 56% cpu 1.586 total
e
what version of Meltano are you on?
m
3.4.2
i can upgrade
ok on 3.5.4
Copy code
meltano --version
meltano, version 3.5.4
melgazar9@MacBook-Pro tap-mongodb % time TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el --refresh-catalog tap-mongodb target-jsonl --state-id test --full-refreshme
Usage: meltano el [OPTIONS] EXTRACTOR LOADER
Try 'meltano el --help' for help.

Error: No such option: --full-refreshme Did you mean --full-refresh?
TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el   0.71s user 0.13s system 72% cpu 1.161 total
e
I think you got a typo in there
👍 2
m
oops LOL
Not sure if there’s anything obvious I’m doing wrong here, but every combination of CLI commands I run it discovers all streams
r
I'll try with some
metadata
config when I get back home later, maybe that will clear some things up.
1
m
ok thanks!
r
OK, here's how my project plays with
meltano select
(no
metadata
config):
meltano.yml
Copy code
version: 1
default_environment: dev
project_id: 26ced3f3-b65d-421e-bace-bc4daaf99f7d
environments:
- name: dev
  config:
    plugins:
      extractors:
      - name: tap-mongodb
        config:
          database: sample_mflix
- name: staging
- name: prod
plugins:
  extractors:
  - name: tap-mongodb
    variant: meltanolabs
    pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
    settings:
    - name: filter_collections
      kind: array
  loaders:
  - name: target-jsonl
    variant: andyh1203
    pip_url: target-jsonl
(
TAP_MONGODB_MONGODB_CONNECTION_STRING
set in
.env
) 1. All streams selected (no selection criteria)
Copy code
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ meltano --environment dev select tap-mongodb --list --all
2024-12-06T22:24:32.785127Z [info     ] Environment 'dev' is active   
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	*.*

Selected attributes:
	[selected ] comments._sdc_batched_at
	[selected ] comments._sdc_extracted_at
	[selected ] comments.cluster_time
	[selected ] comments.document
	[selected ] comments.namespace
	[selected ] comments.namespace.collection
	[selected ] comments.namespace.database
	[selected ] comments.object_id
	[selected ] comments.operation_type
	[automatic] comments.replication_key
	[selected ] embedded_movies._sdc_batched_at
	[selected ] embedded_movies._sdc_extracted_at
	[selected ] embedded_movies.cluster_time
	[selected ] embedded_movies.document
	[selected ] embedded_movies.namespace
	[selected ] embedded_movies.namespace.collection
	[selected ] embedded_movies.namespace.database
	[selected ] embedded_movies.object_id
	[selected ] embedded_movies.operation_type
	[automatic] embedded_movies.replication_key
	[selected ] movies._sdc_batched_at
	[selected ] movies._sdc_extracted_at
	[selected ] movies.cluster_time
	[selected ] movies.document
	[selected ] movies.namespace
	[selected ] movies.namespace.collection
	[selected ] movies.namespace.database
	[selected ] movies.object_id
	[selected ] movies.operation_type
	[automatic] movies.replication_key
	[selected ] sessions._sdc_batched_at
	[selected ] sessions._sdc_extracted_at
	[selected ] sessions.cluster_time
	[selected ] sessions.document
	[selected ] sessions.namespace
	[selected ] sessions.namespace.collection
	[selected ] sessions.namespace.database
	[selected ] sessions.object_id
	[selected ] sessions.operation_type
	[automatic] sessions.replication_key
	[selected ] theaters._sdc_batched_at
	[selected ] theaters._sdc_extracted_at
	[selected ] theaters.cluster_time
	[selected ] theaters.document
	[selected ] theaters.namespace
	[selected ] theaters.namespace.collection
	[selected ] theaters.namespace.database
	[selected ] theaters.object_id
	[selected ] theaters.operation_type
	[automatic] theaters.replication_key
	[selected ] users._sdc_batched_at
	[selected ] users._sdc_extracted_at
	[selected ] users.cluster_time
	[selected ] users.document
	[selected ] users.namespace
	[selected ] users.namespace.collection
	[selected ] users.namespace.database
	[selected ] users.object_id
	[selected ] users.operation_type
	[automatic] users.replication_key
2. users stream selected
tap-mongodb
config in
meltano.yml
Copy code
- name: tap-mongodb
    variant: meltanolabs
    pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
    settings:
    - name: filter_collections
      kind: array
    select:
    - users.*
Copy code
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ meltano --environment dev select tap-mongodb --list --all
2024-12-06T22:25:41.931677Z [info     ] Environment 'dev' is active   
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	users.*

Selected attributes:
	[excluded ] comments._sdc_batched_at
	[excluded ] comments._sdc_extracted_at
	[excluded ] comments.cluster_time
	[excluded ] comments.document
	[excluded ] comments.namespace
	[excluded ] comments.namespace.collection
	[excluded ] comments.namespace.database
	[excluded ] comments.object_id
	[excluded ] comments.operation_type
	[excluded ] comments.replication_key
	[excluded ] embedded_movies._sdc_batched_at
	[excluded ] embedded_movies._sdc_extracted_at
	[excluded ] embedded_movies.cluster_time
	[excluded ] embedded_movies.document
	[excluded ] embedded_movies.namespace
	[excluded ] embedded_movies.namespace.collection
	[excluded ] embedded_movies.namespace.database
	[excluded ] embedded_movies.object_id
	[excluded ] embedded_movies.operation_type
	[excluded ] embedded_movies.replication_key
	[excluded ] movies._sdc_batched_at
	[excluded ] movies._sdc_extracted_at
	[excluded ] movies.cluster_time
	[excluded ] movies.document
	[excluded ] movies.namespace
	[excluded ] movies.namespace.collection
	[excluded ] movies.namespace.database
	[excluded ] movies.object_id
	[excluded ] movies.operation_type
	[excluded ] movies.replication_key
	[excluded ] sessions._sdc_batched_at
	[excluded ] sessions._sdc_extracted_at
	[excluded ] sessions.cluster_time
	[excluded ] sessions.document
	[excluded ] sessions.namespace
	[excluded ] sessions.namespace.collection
	[excluded ] sessions.namespace.database
	[excluded ] sessions.object_id
	[excluded ] sessions.operation_type
	[excluded ] sessions.replication_key
	[excluded ] theaters._sdc_batched_at
	[excluded ] theaters._sdc_extracted_at
	[excluded ] theaters.cluster_time
	[excluded ] theaters.document
	[excluded ] theaters.namespace
	[excluded ] theaters.namespace.collection
	[excluded ] theaters.namespace.database
	[excluded ] theaters.object_id
	[excluded ] theaters.operation_type
	[excluded ] theaters.replication_key
	[selected ] users._sdc_batched_at
	[selected ] users._sdc_extracted_at
	[selected ] users.cluster_time
	[selected ] users.document
	[selected ] users.namespace
	[selected ] users.namespace.collection
	[selected ] users.namespace.database
	[selected ] users.object_id
	[selected ] users.operation_type
	[automatic] users.replication_key
Notice how collections are excluded, rather than outright not present 3. All streams selected (no selection criteria), filtering single collection
tap-mongodb
config in
meltano.yml
Copy code
- name: tap-mongodb
    variant: meltanolabs
    pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
    settings:
    - name: filter_collections
      kind: array
Copy code
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["users"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T22:33:12.223370Z [info     ] Environment 'dev' is active   
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	*.*

Selected attributes:
	[selected ] users._sdc_batched_at
	[selected ] users._sdc_extracted_at
	[selected ] users.cluster_time
	[selected ] users.document
	[selected ] users.namespace
	[selected ] users.namespace.collection
	[selected ] users.namespace.database
	[selected ] users.object_id
	[selected ] users.operation_type
	[automatic] users.replication_key
Notice how collections that were previous excluded are now not discovered at all 4. users stream selected, filtering same collection (functionally the same as example 3)
tap-mongodb
config in
meltano.yml
Copy code
- name: tap-mongodb
    variant: meltanolabs
    pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
    settings:
    - name: filter_collections
      kind: array
    select:
    - users.*
Copy code
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["users"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T22:37:17.179517Z [info     ] Environment 'dev' is active   
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	users.*

Selected attributes:
	[selected ] users._sdc_batched_at
	[selected ] users._sdc_extracted_at
	[selected ] users.cluster_time
	[selected ] users.document
	[selected ] users.namespace
	[selected ] users.namespace.collection
	[selected ] users.namespace.database
	[selected ] users.object_id
	[selected ] users.operation_type
	[automatic] users.replication_key
This is demonstrating that
select
can be redundant when using
filter_collections
5. users stream selected, filtering movies collection
tap-mongodb
config in
meltano.yml
Copy code
- name: tap-mongodb
    variant: meltanolabs
    pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
    settings:
    - name: filter_collections
      kind: array
    select:
    - users.*
Copy code
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["movies"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T22:40:50.958754Z [info     ] Environment 'dev' is active   
2024-12-06T22:40:51.276788Z [warning  ] Stream `users` was not found in the catalog
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	users.*

Selected attributes:
	[excluded ] movies._sdc_batched_at
	[excluded ] movies._sdc_extracted_at
	[excluded ] movies.cluster_time
	[excluded ] movies.document
	[excluded ] movies.namespace
	[excluded ] movies.namespace.collection
	[excluded ] movies.namespace.database
	[excluded ] movies.object_id
	[excluded ] movies.operation_type
	[excluded ] movies.replication_key
Notice the stream not found in catalog warning - this is because only the
movies
collection was discovered, but we are trying to select
users
which was not discovered 6. All streams selected (no selection criteria), filtering collection that does not exist
tap-mongodb
config in
meltano.yml
Copy code
- name: tap-mongodb
    variant: meltanolabs
    pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
    settings:
    - name: filter_collections
      kind: array
Copy code
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["doesnotexist"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T22:48:39.749041Z [info     ] Environment 'dev' is active   
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	*.*

Selected attributes:
Nothing was discovered since the collection
doesnotexist
does - in fact - not exist
Repeating all cases with
metadata
config for
users
and `movies`:
Copy code
metadata: 
      users:
        replication-key: replication_key
        replication-method: INCREMENTAL
      movies:
        replication-key: replication_key
        replication-method: INCREMENTAL
1. All streams selected (no selection criteria)
Copy code
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ meltano --environment dev select tap-mongodb --list --all
2024-12-06T23:05:36.773532Z [info     ] Environment 'dev' is active   
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	*.*

Selected attributes:
	[selected ] comments._sdc_batched_at
	[selected ] comments._sdc_extracted_at
	[selected ] comments.cluster_time
	[selected ] comments.document
	[selected ] comments.namespace
	[selected ] comments.namespace.collection
	[selected ] comments.namespace.database
	[selected ] comments.object_id
	[selected ] comments.operation_type
	[automatic] comments.replication_key
	[selected ] embedded_movies._sdc_batched_at
	[selected ] embedded_movies._sdc_extracted_at
	[selected ] embedded_movies.cluster_time
	[selected ] embedded_movies.document
	[selected ] embedded_movies.namespace
	[selected ] embedded_movies.namespace.collection
	[selected ] embedded_movies.namespace.database
	[selected ] embedded_movies.object_id
	[selected ] embedded_movies.operation_type
	[automatic] embedded_movies.replication_key
	[selected ] movies._sdc_batched_at
	[selected ] movies._sdc_extracted_at
	[selected ] movies.cluster_time
	[selected ] movies.document
	[selected ] movies.namespace
	[selected ] movies.namespace.collection
	[selected ] movies.namespace.database
	[selected ] movies.object_id
	[selected ] movies.operation_type
	[automatic] movies.replication_key
	[selected ] sessions._sdc_batched_at
	[selected ] sessions._sdc_extracted_at
	[selected ] sessions.cluster_time
	[selected ] sessions.document
	[selected ] sessions.namespace
	[selected ] sessions.namespace.collection
	[selected ] sessions.namespace.database
	[selected ] sessions.object_id
	[selected ] sessions.operation_type
	[automatic] sessions.replication_key
	[selected ] theaters._sdc_batched_at
	[selected ] theaters._sdc_extracted_at
	[selected ] theaters.cluster_time
	[selected ] theaters.document
	[selected ] theaters.namespace
	[selected ] theaters.namespace.collection
	[selected ] theaters.namespace.database
	[selected ] theaters.object_id
	[selected ] theaters.operation_type
	[automatic] theaters.replication_key
	[selected ] users._sdc_batched_at
	[selected ] users._sdc_extracted_at
	[selected ] users.cluster_time
	[selected ] users.document
	[selected ] users.namespace
	[selected ] users.namespace.collection
	[selected ] users.namespace.database
	[selected ] users.object_id
	[selected ] users.operation_type
	[automatic] users.replication_key
2. users stream selected
Copy code
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ meltano --environment dev select tap-mongodb --list --all
2024-12-06T23:10:18.459047Z [info     ] Environment 'dev' is active   
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	users.*

Selected attributes:
	[excluded ] comments._sdc_batched_at
	[excluded ] comments._sdc_extracted_at
	[excluded ] comments.cluster_time
	[excluded ] comments.document
	[excluded ] comments.namespace
	[excluded ] comments.namespace.collection
	[excluded ] comments.namespace.database
	[excluded ] comments.object_id
	[excluded ] comments.operation_type
	[excluded ] comments.replication_key
	[excluded ] embedded_movies._sdc_batched_at
	[excluded ] embedded_movies._sdc_extracted_at
	[excluded ] embedded_movies.cluster_time
	[excluded ] embedded_movies.document
	[excluded ] embedded_movies.namespace
	[excluded ] embedded_movies.namespace.collection
	[excluded ] embedded_movies.namespace.database
	[excluded ] embedded_movies.object_id
	[excluded ] embedded_movies.operation_type
	[excluded ] embedded_movies.replication_key
	[excluded ] movies._sdc_batched_at
	[excluded ] movies._sdc_extracted_at
	[excluded ] movies.cluster_time
	[excluded ] movies.document
	[excluded ] movies.namespace
	[excluded ] movies.namespace.collection
	[excluded ] movies.namespace.database
	[excluded ] movies.object_id
	[excluded ] movies.operation_type
	[excluded ] movies.replication_key
	[excluded ] sessions._sdc_batched_at
	[excluded ] sessions._sdc_extracted_at
	[excluded ] sessions.cluster_time
	[excluded ] sessions.document
	[excluded ] sessions.namespace
	[excluded ] sessions.namespace.collection
	[excluded ] sessions.namespace.database
	[excluded ] sessions.object_id
	[excluded ] sessions.operation_type
	[excluded ] sessions.replication_key
	[excluded ] theaters._sdc_batched_at
	[excluded ] theaters._sdc_extracted_at
	[excluded ] theaters.cluster_time
	[excluded ] theaters.document
	[excluded ] theaters.namespace
	[excluded ] theaters.namespace.collection
	[excluded ] theaters.namespace.database
	[excluded ] theaters.object_id
	[excluded ] theaters.operation_type
	[excluded ] theaters.replication_key
	[selected ] users._sdc_batched_at
	[selected ] users._sdc_extracted_at
	[selected ] users.cluster_time
	[selected ] users.document
	[selected ] users.namespace
	[selected ] users.namespace.collection
	[selected ] users.namespace.database
	[selected ] users.object_id
	[selected ] users.operation_type
	[automatic] users.replication_key
3. All streams selected (no selection criteria), filtering single collection
Copy code
TAP_MONGODB_FILTER_COLLECTIONS='["users"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T23:06:16.279460Z [info     ] Environment 'dev' is active   
2024-12-06T23:06:23.062268Z [warning  ] Stream `movies` was not found in the catalog
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	*.*

Selected attributes:
	[selected ] users._sdc_batched_at
	[selected ] users._sdc_extracted_at
	[selected ] users.cluster_time
	[selected ] users.document
	[selected ] users.namespace
	[selected ] users.namespace.collection
	[selected ] users.namespace.database
	[selected ] users.object_id
	[selected ] users.operation_type
	[automatic] users.replication_key
Notice the new stream not found in catalog warning here, presumably coming from the
metadata
config 4. users stream selected, filtering same collection (functionally the same as example 3)
Copy code
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["users"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T23:11:05.135721Z [info     ] Environment 'dev' is active   
2024-12-06T23:11:11.975323Z [warning  ] Stream `movies` was not found in the catalog
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	users.*

Selected attributes:
	[selected ] users._sdc_batched_at
	[selected ] users._sdc_extracted_at
	[selected ] users.cluster_time
	[selected ] users.document
	[selected ] users.namespace
	[selected ] users.namespace.collection
	[selected ] users.namespace.database
	[selected ] users.object_id
	[selected ] users.operation_type
	[automatic] users.replication_key
Same stream not found in catalog warning here 5. users stream selected, filtering movies collection
Copy code
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["movies"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T23:13:43.564784Z [info     ] Environment 'dev' is active   
2024-12-06T23:13:45.461848Z [warning  ] Stream `users` was not found in the catalog
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	users.*

Selected attributes:
	[excluded ] movies._sdc_batched_at
	[excluded ] movies._sdc_extracted_at
	[excluded ] movies.cluster_time
	[excluded ] movies.document
	[excluded ] movies.namespace
	[excluded ] movies.namespace.collection
	[excluded ] movies.namespace.database
	[excluded ] movies.object_id
	[excluded ] movies.operation_type
	[excluded ] movies.replication_key
Same kind of warning, but this time for
users
6. All streams selected (no selection criteria), filtering collection that does not exist
Copy code
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["doesnotexist"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T23:14:30.426717Z [info     ] Environment 'dev' is active   
2024-12-06T23:14:32.204437Z [warning  ] Stream `users` was not found in the catalog
2024-12-06T23:14:32.204743Z [warning  ] Stream `movies` was not found in the catalog
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	*.*

Selected attributes:
Same warnings for both
users
and
movies
Overall, no functional difference when specifying
metadata
in this case it seems...
e
do you think the select streams being accessible from the meltano tap is a feature that would be added in the near future?
I plan to talk about this in office hours next week, cause I would love to implement it but I just have no idea how to tackle it 🙂
👍 2
m
Ahhhhhh nice! @Reuben (Matatika) - the issue I was having case sensitivity! The collection is actually called
casinoData
, so when I wrote
TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]'
it did not detect it but
TAP_MONGODB_FILTER_COLLECTIONS='["casinoData"]'
did!
Copy code
TAP_MONGODB_FILTER_COLLECTIONS='["casinoData"]' meltano --environment dev select tap-mongodb --list --all
2024-12-07T00:30:13.729047Z [info     ] Environment 'dev' is active   
Legend:
	selected
	excluded
	automatic

Enabled patterns:
	*.*

Selected attributes:
	[selected ] casinodata._sdc_batched_at
	[selected ] casinodata._sdc_extracted_at
	[selected ] casinodata.cluster_time
	[selected ] casinodata.document
	[selected ] casinodata.namespace
	[selected ] casinodata.namespace.collection
	[selected ] casinodata.namespace.database
	[selected ] casinodata.object_id
	[selected ] casinodata.operation_type
	[automatic] casinodata.replication_key
🙌 1
r
I can't believe it 😭 😂
m
hilarious… hmm not too opinionated about this, but maybe we can add a
.lower()
line when searching the collections in discovery mode
or match how the tap is moving
casinoData
->
casinodata
r
m
will take a look in a bit!
r
👍 btw, there is a PR for this: https://github.com/MeltanoLabs/tap-mongodb/pull/36 I would switch the
pip_url
back over to the default if/once it is merged- or if you want as much stability as possible (given that this will be running in production), fork my fork (uncheck "Copy the
main
branch only") and update to your username in the
pip_url
.
1
m
hey @Reuben (Matatika) any idea why these 2 pre-commits keep failing? https://github.com/melgazar9/tap-mongodb/actions/runs/12261471675/job/34208550142
they are working locally but not on github.
Copy code
pre-commit clean
pre-commit run --all-files

Cleaned /Users/melgazar9/.cache/pre-commit.
[INFO] Initializing environment for <https://github.com/pre-commit/pre-commit-hooks>.
[INFO] Initializing environment for <https://github.com/lyz-code/yamlfix.git>.
[INFO] Initializing environment for <https://github.com/adrienverge/yamllint.git>.
[INFO] Initializing environment for <https://github.com/charliermarsh/ruff-pre-commit>.
[INFO] Initializing environment for <https://github.com/psf/black>.
[INFO] Installing environment for <https://github.com/pre-commit/pre-commit-hooks>.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for <https://github.com/lyz-code/yamlfix.git>.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for <https://github.com/adrienverge/yamllint.git>.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for <https://github.com/charliermarsh/ruff-pre-commit>.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for <https://github.com/psf/black>.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
check for added large files..............................................Passed
check toml...............................................................Passed
check vcs permalinks.....................................................Passed
detect private key.......................................................Passed
fix end of files.........................................................Passed
python tests naming......................................................Passed
pretty format json...................................(no files to check)Skipped
trim trailing whitespace.................................................Passed
yamlfix..................................................................Passed
yamllint.................................................................Passed
ruff.....................................................................Passed
black....................................................................Passed
pylint...................................................................Passed
r
I think the 3.7 failure is because you updated some pre-commit hooks to versions that may no longer support that Python version, and the 3.8 one because
maison>=2.0.0
only supports Python 3.9 and above? None of this should stop you from using your fork though.
m
ah makes sense, yea I just don’t like to see failures 😂 It all works great! Testing in a dev environment for the last few days
😂 1
r
Nice! I noticed there are still some pre-commit failures in the PR that you addressed in your fork, so I'll get on those. 🙏
m
ok cool, also is it possible to use the branch I pushed? I updated pre-commits and added a different check in there a while back for
None
and
inf
datatypes
r
I would open separate PRs for those to keep so scope of changes narrow. I'm not a maintainer of the tap, but I would imagine that's how they would prefer contributions.
m
yea it’s been open for months 😅https://github.com/MeltanoLabs/tap-mongodb/pull/32
r
Oh right 😂 I've seen some activity on my PR so maybe it will kick-start some other PR reviews again.
🙌 1
but also that is now doing much more than just fix an edge-case, so you might get asked to split out the changes into separate PRs or something 🤷
132 Views