matt_elgazar
12/05/2024, 6:58 PMsettings
and select
streams from meltano.yml in the tap itself? In the tap-mongodb codebase there is a part that hits all collections in the database, but this is unnecessary if I’m only running a select on one collection
for collection in self.database.list_collection_names(authorizedCollections=True, nameOnly=True):
...
I was thinking I can add a configuration for the behavior
if self.discovery_mode == 'select':
collections = <get current selected streams>
else:
collections = self.database.list_collection_names(authorizedCollections=True, nameOnly=True)
I can force it in a way that’s probably super bad practice and wouldn’t generalize across different env configurations:
selected_collections = yaml.safe_load(open('meltano.yml')).get('plugins').get('extractors')[0].get('select')
Reuben (Matatika)
12/05/2024, 8:39 PMtap-mongodb
?matt_elgazar
12/05/2024, 8:39 PMmatt_elgazar
12/05/2024, 8:41 PMselect
values from meltano.ymlReuben (Matatika)
12/05/2024, 8:51 PMself.config
to access setting values and SQLStream.get_selected_schema for the selected schema.matt_elgazar
12/05/2024, 8:55 PMget_selected_schema
start being available? I’m not sure where to call it? In discover_catalog_entries there is no value. When I look at self.config
I only see a mapping proxy:
self.config
mappingproxy({'database': <DB>, 'mongodb_connection_string': <db_connection_string>, 'discovery_mode': 'select', 'datetime_conversion': 'datetime', 'prefix': '', 'add_record_metadata': False, 'allow_modify_change_streams': False, 'operation_types': ['create', 'delete', 'insert', 'replace', 'update']})
matt_elgazar
12/05/2024, 8:57 PMfor i in dir(self):
print(i)
_PluginBase__initialized_at
__abstractmethods__
__annotations__
__class__
__delattr__
__dict__
__dir__
__doc__
__eq__
__format__
__ge__
__getattribute__
__gt__
__hash__
__init__
__init_subclass__
__le__
__lt__
__module__
__ne__
__new__
__reduce__
__reduce_ex__
__repr__
__setattr__
__sizeof__
__str__
__subclasshook__
__weakref__
_abc_impl
_catalog
_catalog_dict
_config
_env_var_config
_get_about_info
_get_mongo_connection_string
_get_mongo_options
_get_package_version
_get_supported_python_versions
_input_catalog
_is_secret_config
_mapper
_reset_state_progress_markers
_set_compatible_replication_methods
_singer_catalog
_state
_streams
_validate_config
append_builtin_config
capabilities
catalog
catalog_dict
catalog_json_text
cb_discover
cb_test
cb_version
cli
config
config_from_cli_args
config_jsonschema
connector
discover_streams
get_plugin_version
get_sdk_version
get_singer_command
get_supported_python_versions
initialized_at
input_catalog
invoke
load_state
load_streams
logger
mapper
metrics_logger
name
package_name
plugin_version
print_about
print_version
run_connection_test
run_discovery
run_sync_dry_run
sdk_version
setup_mapper
state
streams
sync_all
write_schemas
self.streams
{}
Reuben (Matatika)
12/05/2024, 8:59 PMSQLStream
classes only - i.e. MongoDBCollectionStream
which are instantiated using the connector: https://github.com/MeltanoLabs/tap-mongodb/blob/b84a1e04052d83bade0f49fad79a1c9311b7a109/tap_mongodb/tap.py#L240-L243matt_elgazar
12/05/2024, 9:01 PMif self.discovery_mode == 'select':
collections = <pull selected streams>
else:
collections = self.database.list_collection_names(authorizedCollections=True, nameOnly=True)
Reuben (Matatika)
12/05/2024, 9:18 PMfilter_collections
) during catalog discovery. select
is used to filter streams/properties from the discovered catalog. You would be able to omit your select
rules if you specified filter_collections
as you would have already "selected" collections during discovery. A couple other taps follow a similar design pattern;
• tap-mssql
and `filter_dbs`: https://hub.meltano.com/extractors/tap-mssql/#filter_dbs-setting
• tap-mysql
and `filter_dbs`: https://hub.meltano.com/extractors/tap-mysql/#filter_dbs-setting
• tap-postgres
and `filter_schemas`: https://hub.meltano.com/extractors/tap-postgres#filter_schemas-setting
• tap-snowflakes
and `tables`: https://hub.meltano.com/extractors/tap-snowflake#tables-settingReuben (Matatika)
12/05/2024, 9:20 PMMongoDBConnector
constructor to accept filter_collections
from config: https://github.com/MeltanoLabs/tap-mongodb/blob/b84a1e04052d83bade0f49fad79a1c9311b7a109/tap_mongodb/tap.py#L207-L213matt_elgazar
12/05/2024, 9:20 PMmeltano el tap-mongodb target-snowflake --select collection1 --state-id tap_mongodb_collection1
^^ then I’d have to change the config param for every time I run a new tap with --select
matt_elgazar
12/05/2024, 9:20 PMmatt_elgazar
12/05/2024, 9:21 PMReuben (Matatika)
12/05/2024, 9:37 PMmeltano.yml
select:
- collection1.*
# same as above during sync, but with improved discovery performance
config:
filter_collections:
- collection1
then I’d have to change the config param for every time I run a new tap withNot sure what you mean here by "new tap" - do you have multiple--select
tap-mongodb
instances defined? If you wanted the config directly in the command rather than meltano.yml
, you could do
TAP_MONGODB_FILTER_COLLECTIONS='["collection1"]' meltano el tap-mongodb target-snowflake --state-id tap_mongodb_collection1
matt_elgazar
12/05/2024, 9:38 PMmatt_elgazar
12/05/2024, 9:41 PMfilter_collections
to the config would do if I run this tap 3 times with different streams selected. For example:
for collection in ['col1', 'col2', 'col3']:
subprocess.run(['meltano', 'el', 'tap-mongodb', 'target-snowflake', '--select', collection, '--state-id', f'state_{collection}'])
^ it would still run the discovery mode for all collections under filter_collections
Reuben (Matatika)
12/05/2024, 9:47 PMenv
of `subprocess.run`:
for collection in ['col1', 'col2', 'col3']:
subprocess.run(
['meltano', 'el', 'tap-mongodb', 'target-snowflake', '--state-id', f'state_{collection}']
env={
"TAP_MONGODB_FILTER_COLLECTIONS": json.dumps([collection])
}
)
Reuben (Matatika)
12/05/2024, 9:48 PMfor collection in ['col1', 'col2', 'col3']:
subprocess.run(
['meltano', 'el', 'tap-mongodb', 'target-snowflake', '--state-id', f'state_{collection}']
env={
"TAP_MONGODB_FILTER_COLLECTIONS": collection
}
)
matt_elgazar
12/05/2024, 9:59 PMmatt_elgazar
12/05/2024, 10:05 PMselect:
- col1
- col2
If I run meltano with --select col1, It should only show deselected stream col2, not deselected stream col3, ... colnReuben (Matatika)
12/05/2024, 10:07 PMfilter_collections
would come in:
connector.py
class MongoDBConnector:
"""MongoDB/DocumentDB connector class"""
def __init__( # pylint: disable=too-many-arguments
self,
connection_string: str,
options: Dict[str, Any],
db_name: str,
datetime_conversion: str,
prefix: Optional[str] = None,
collections: List[str] = None,
) -> None:
self._connection_string = connection_string
self._options = options
self._db_name = db_name
self._datetime_conversion: str = datetime_conversion.upper()
self._prefix: Optional[str] = prefix
self._collections = collections
self._logger: Logger = getLogger(__name__)
self._version: Optional[MongoVersion] = None
...
def discover_catalog_entries(self) -> List[Dict[str, Any]]:
"""Return a list of catalog entries from discovery.
Returns:
The discovered catalog entries as a list.
"""
result: List[Dict] = []
collections = self._collections or self.database.list_collection_names(authorizedCollections=True, nameOnly=True)
for collection in collections:
...
tap.py
@cached_property
def connector(self) -> MongoDBConnector:
"""Get MongoDBConnector instance. Instance is cached and reused."""
return MongoDBConnector(
self._get_mongo_connection_string(),
self._get_mongo_options(),
self.config.get("database"),
self.config.get("datetime_conversion"),
prefix=self.config.get("prefix", None),
collections=self.config.get("filter_collections),
)
matt_elgazar
12/05/2024, 10:18 PMmatt_elgazar
12/05/2024, 10:30 PMReuben (Matatika)
12/05/2024, 10:35 PMI would think if the collection is not listed under the select section then the stream should not run or even appear in "deselected stream"I think the point of discovery and stream/property selection being separate concepts by default is that discovered catalog is exposed to a user, who can select what entities they want to sync without any prior knowledge of what data is available. The performance issue you are running into is really only a problem for taps that perform dynamic discovery (rather than others will well-known schemas that are statically defined) e.g. most SQL taps,
tap-google-sheets
.
If I set the tap like this it should not look for any streams outside of col1 and col2This isn't how
select
works though - it operates on streams/properties that have already been "looked up" i.e. discovered. That configuration would include only col1
and col2
in the sync, but all collections would still be discovered beforehand.Reuben (Matatika)
12/05/2024, 10:38 PMfilter_collections
to a single collection via environment variable in a loop. --select
would be redundant as the discovered catalog would only contain information about colN
, so it's safe to omit.matt_elgazar
12/05/2024, 10:41 PMReuben (Matatika)
12/05/2024, 10:42 PMReuben (Matatika)
12/05/2024, 10:43 PMmatt_elgazar
12/05/2024, 10:46 PMReuben (Matatika)
12/05/2024, 10:48 PMReuben (Matatika)
12/05/2024, 11:09 PMtap-mongodb
definition with the subprocess.run change:
plugins:
extractors:
- name: tap-mongodb
variant: meltanolabs
pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
settings:
- name: filter_collections
https://github.com/ReubenFrankel/tap-mongodb/tree/filter-collectionsmatt_elgazar
12/05/2024, 11:17 PMTypeError: __init__() got an unexpected keyword argument 'discover_streams'
Does meltano.yml look like this?
filtered_collections:
- 'col1.*'
Reuben (Matatika)
12/05/2024, 11:19 PMplugins:
extractors:
- name: tap-mongodb
variant: meltanolabs
pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
settings:
- name: filter_collections
config:
filter_collections: col1
# or
# filter_collections: [col1]
Reuben (Matatika)
12/05/2024, 11:21 PMfilter_collections
, not filtered_collections
matt_elgazar
12/05/2024, 11:22 PMTypeError: __init__() got an unexpected keyword argument 'discover_streams'
matt_elgazar
12/05/2024, 11:24 PMReuben (Matatika)
12/05/2024, 11:24 PMReuben (Matatika)
12/05/2024, 11:25 PMReuben (Matatika)
12/05/2024, 11:25 PMmeltano install --clean extractor tap-mongodb
matt_elgazar
12/05/2024, 11:26 PMReuben (Matatika)
12/05/2024, 11:26 PMmatt_elgazar
12/05/2024, 11:27 PMFor more detailed log messages re-run the command using 'meltano --log-level=debug ...' CLI flag. cmd_type=elt name=meltano run_id=3b0ce514-13b6-44eb-a069-523e5a67c1aa state_id=tap_mongodb_testing_casinodata stdio=stderr
2024-12-05T23:26:57.947817Z [info ] Note that you can also check the generated log file at '/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/logs/elt/tap_mongodb_testing_casinodata/3b0ce514-13b6-44eb-a069-523e5a67c1aa/elt.log'. cmd_type=elt name=meltano run_id=3b0ce514-13b6-44eb-a069-523e5a67c1aa state_id=tap_mongodb_testing_casinodata stdio=stderr
2024-12-05T23:26:57.947875Z [info ] For more information on debugging and logging: <https://docs.meltano.com/reference/command-line-interface#debugging> cmd_type=elt name=meltano run_id=3b0ce514-13b6-44eb-a069-523e5a67c1aa state_id=tap_mongodb_testing_casinodata stdio=stderr
Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.
ELT could not be completed: Cannot start extractor: Catalog discovery failed: command ['/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/bin/tap-mongodb', '--config', '/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/run/elt/tap_mongodb_testing_casinodata/3b0ce514-13b6-44eb-a069-523e5a67c1aa/tap.e30f052b-3a01-47b1-aef0-91df12c2d032.config.json', '--discover'] returned 1 with stderr:
Traceback (most recent call last):
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/bin/tap-mongodb", line 5, in <module>
from tap_mongodb.tap import TapMongoDB
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/tap_mongodb/tap.py", line 22, in <module>
class TapMongoDB(Tap):
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/tap_mongodb/tap.py", line 27, in TapMongoDB
config_jsonschema = th.PropertiesList(
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/singer_sdk/typing.py", line 242, in to_dict
return self.type_dict # type: ignore[no-any-return]
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/singer_sdk/typing.py", line 692, in type_dict
merged_props.update(w.to_dict())
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/singer_sdk/typing.py", line 566, in to_dict
type_dict = append_type(type_dict, "null")
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/singer_sdk/helpers/_typing.py", line 74, in append_type
raise ValueError(msg)
ValueError: Could not append type because the JSON schema for the dictionary `{'oneOf': [{'type': ['string']}, {'type': 'array', 'items': {'type': ['string']}}]}` appears to be invalid.
.
matt_elgazar
12/05/2024, 11:27 PMmatt_elgazar
12/05/2024, 11:28 PMReuben (Matatika)
12/05/2024, 11:29 PMmatt_elgazar
12/05/2024, 11:30 PMmatt_elgazar
12/05/2024, 11:30 PMReuben (Matatika)
12/05/2024, 11:38 PMmatt_elgazar
12/05/2024, 11:45 PMmatt_elgazar
12/05/2024, 11:48 PMmeltano.yml
within the extractor without hard coding using yaml.safe_load
?Reuben (Matatika)
12/05/2024, 11:56 PMself._logger.info(self._collections)
self._logger.info(collections)
here, what do you see?Edgar Ramírez (Arch.dev)
12/06/2024, 12:04 AMis it possible to extract the selected stream text fromI wouldn't recommend coupling a Meltano project to a tap implementation, we should be able to solve this with config + acting early in the implementation ofwithin the extractor without hard coding usingmeltano.yml
?yaml.safe_load
discover_catalog_entries
.Edgar Ramírez (Arch.dev)
12/06/2024, 12:06 AMReuben (Matatika)
12/06/2024, 12:10 AMselect
to work here (through meltano el --select
) - thanks for finding the issue, would definitely be good to have.Edgar Ramírez (Arch.dev)
12/06/2024, 1:08 AMfilter_collections
.
What do select
patterns mean when there aren't any streams against which they can be applied before discovery? You would need an invocation like tap --discover --catalog pre-selected-catalog.json
, but how do you generate that catalog if you only know the patterns, not the actual streams? I'm certainly open to ideas, but this is the reason this request hasn't moved forward.Reuben (Matatika)
12/06/2024, 1:33 AMselect
rules for either, though filter_collections
is behaving as select
otherwise would.Reuben (Matatika)
12/06/2024, 1:43 AMmatt_elgazar
12/06/2024, 5:40 PMTAP_MONGODB_FILTER_COLLECTIONS=collection1
if I’m running meltano el
? Do I need to do anything in meltano.yml if I’m calling the filter collections parameter via the CLI?matt_elgazar
12/06/2024, 5:49 PMTAP_MONGODB_FILTER_COLLECTIONS=test_collection meltano --environment=dev el tap-mongodb target-jsonl --state-id test --full-refresh
but I get this error:
Failed to parse JSON array from string: 'test_collection'
matt_elgazar
12/06/2024, 5:49 PMTAP_MONGODB__CONFIG__FILTER_COLLECTIONS
and MELTANO_EXTRACTORS__TAP_MONGODB__CONFIG__FILTER_COLLECTIONS
but these do nothing and hit the entire DB during discovery.
When I use TAP_MONGODB_FILTER_COLLECTIONS='["test_collection"]'
then it also hit’s the entire dbReuben (Matatika)
12/06/2024, 6:11 PMkind: array
for the setting and trying to support both a single collection as a string and multiple collections as an array of strings.
TAP_MONGODB_FILTER_COLLECTIONS='["test_collection"]'
should be correct. When you say it's hitting all collections, are you seeing multiple Discovered collection
log messages?matt_elgazar
12/06/2024, 6:15 PMTAP_MONGODB_FILTER_COLLECTIONS='["test_collection"]' meltano --environment=dev el tap-mongodb target-jsonl --state-id test --full-refresh
...
2024-12-06T18:15:11.286080Z [warning ] Stream `videostreamdata` was not found in the catalog
2024-12-06T18:15:11.286124Z [warning ] Stream `wearabledispensepriceditems` was not found in the catalog
2024-12-06T18:15:11.286175Z [warning ] Stream `wearabledispenseritems` was not found in the catalog
2024-12-06T18:15:11.286219Z [warning ] Stream `wearabledispenserpaymenttokens` was not found in the catalog
2024-12-06T18:15:11.286261Z [warning ] Stream `wearabledispensers` was not found in the catalog
2024-12-06T18:15:11.286307Z [warning ] Stream `xdgrewardtrees` was not found in the catalog
2024-12-06T18:15:13.290110Z [info ] Writing state to Azure Blob Storage
2024-12-06T18:15:13.974264Z [info ] uploading part #1, 17 bytes (total 0.000GB)
2024-12-06T18:15:14.224768Z [info ] uploading part #1, 50 bytes (total 0.000GB)
2024-12-06T18:15:14.475834Z [info ] Incremental state has been updated at 2024-12-06 18:15:14.475721.
2024-12-06T18:15:14.482935Z [info ] Extract & load complete! name=meltano run_id=de83ae67-4260-4360-9b0e-c112cb6d9b3a state_id=test
Reuben (Matatika)
12/06/2024, 6:15 PMTAP_MONGODB_FILTER_COLLECTIONS='["test_collection"]' meltano config tap-mongodb list
will show you what configuration the tap will use - you should see the array value for filter_collections
listed.matt_elgazar
12/06/2024, 6:17 PMON_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
_metadata.wearablerewards.replication-key [env: TAP_MONGODB__METADATA_WEARABLEREWARDS_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.wearablerewards.replication-method [env: TAP_MONGODB__METADATA_WEARABLEREWARDS_REPLICATION_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
_metadata.wearabledispensepriceditems.replication-key [env: TAP_MONGODB__METADATA_WEARABLEDISPENSEPRICEDITEMS_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.wearabledispensepriceditems.replication-method [env: TAP_MONGODB__METADATA_WEARABLEDISPENSEPRICEDITEMS_REPLICATION_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
_metadata.wearabledispenseritems.replication-key [env: TAP_MONGODB__METADATA_WEARABLEDISPENSERITEMS_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.wearabledispenseritems.replication-method [env: TAP_MONGODB__METADATA_WEARABLEDISPENSERITEMS_REPLICATION_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
_metadata.wearabledispenserpaymenttokens.replication-key [env: TAP_MONGODB__METADATA_WEARABLEDISPENSERPAYMENTTOKENS_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.wearabledispenserpaymenttokens.replication-method [env: TAP_MONGODB__METADATA_WEARABLEDISPENSERPAYMENTTOKENS_REPLICATION_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
_metadata.wearabledispensers.replication-key [env: TAP_MONGODB__METADATA_WEARABLEDISPENSERS_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.wearabledispensers.replication-method [env: TAP_MONGODB__METADATA_WEARABLEDISPENSERS_REPLICATION_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
_metadata.xdgrewardtrees.replication-key [env: TAP_MONGODB__METADATA_XDGREWARDTREES_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.xdgrewardtrees.replication-method [env: TAP_MONGODB__METADATA_XDGREWARDTREES_REPLICATION_METHOD] current value: 'INCREMENTAL' (from `meltano.yml`)
Reuben (Matatika)
12/06/2024, 6:19 PMmetadata
defined? That's probably messing with this.matt_elgazar
12/06/2024, 6:20 PMmetadata:
'accessorynftinfos':
replication-key: replication_key
replication-method: INCREMENTAL
'activebanners':
replication-key: replication_key
replication-method: LOG_BASED
Reuben (Matatika)
12/06/2024, 6:25 PMStream {} was not found in the catalog
implies that it's not hitting the collection, no? Just that you have metadata defined for that stream, but it's not present in the discovered catalog after using filter_collections
?matt_elgazar
12/06/2024, 6:31 PMmatt_elgazar
12/06/2024, 6:32 PMTAP_MONGODB_FILTER_COLLECTIONS='["test_collection"]' meltano --environment=dev el tap-mongodb target-jsonl --state-id test --full-refresh
I don’t see any jsonl file in the output/
directory. I also tried test_collection.*
Reuben (Matatika)
12/06/2024, 6:38 PMtest_collection
?matt_elgazar
12/06/2024, 6:38 PMmatt_elgazar
12/06/2024, 6:40 PM2024-12-06T18:40:01.812740Z [warning ] Stream `test_collection` was not found in the catalog
Reuben (Matatika)
12/06/2024, 6:44 PMmeltano select tap-mongodb --list
?matt_elgazar
12/06/2024, 6:49 PMRuntimeError: Could not connect to MongoDB
matt_elgazar
12/06/2024, 6:49 PMmatt_elgazar
12/06/2024, 6:50 PMCannot list the selected attributes: Catalog discovery failed: command ['/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/extractors/tap-mongodb/venv/bin/tap-mongodb', '--config', '/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap-mongodb/.meltano/run/tap-mongodb/tap.575bc33b-3bd2-4f59-a518-2ce983bcb92f.config.json', '--discover'] returned 1 with stderr:
Config validation failed: 'database' is a required property
matt_elgazar
12/06/2024, 6:51 PMdatabase
is defined in meltano.ymlmatt_elgazar
12/06/2024, 6:53 PMReuben (Matatika)
12/06/2024, 6:53 PMmeltano config tap-mongodb list
?matt_elgazar
12/06/2024, 6:55 PMmeltano config tap-mongodb list | grep testcollection
2024-12-06T18:54:10.534217Z [info ] The default environment 'dev' will be ignored for `meltano config`. To configure a specific environment, please use the option `--environment=<environment name>`.
_metadata.testcollection.replication-key [env: TAP_MONGODB__METADATA_TESTCOLLECTION_REPLICATION_KEY] current value: 'replication_key' (from `meltano.yml`)
_metadata.testcollection.replication-method [env: TAP_MONGODB__METADATA_TESTCOLLECTION_REPLICATION_METHOD] current value: 'LOG_BASED' (from `meltano.yml`)
matt_elgazar
12/06/2024, 7:13 PMmeltano el
?Reuben (Matatika)
12/06/2024, 7:20 PMmeltano select tap-mongodb --list
doesn't work for you. The tap outputs records with my config: https://meltano.slack.com/archives/C06A1LKFAAC/p1733448784844729?thread_ts=1733425135.546469&cid=C06A1LKFAAC
If you're not seeing any files in the output
directory when running with target-jsonl
, most likely the tap is not outputting any records.matt_elgazar
12/06/2024, 7:29 PMReuben (Matatika)
12/06/2024, 7:29 PMmatt_elgazar
12/06/2024, 7:34 PMversion: 1
send_anonymous_usage_stats: true
project_id: tap-mongodb
default_environment: dev
state_backend:
type: remote
uri: ${AZURE_TAP_MONGODB_STATE_URI}
azure:
connection_string: ${AZURE_TAP_MONGODB_STATE_CONNECTION_STRING}
plugins:
extractors:
- name: tap-mongodb
namespace: tap_mongodb
pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
capabilities:
- state
- catalog
- discover
- about
- stream-maps
config:
add_record_metadata: true
allow_modify_change_streams: true
select:
- 'accessorynftinfos.*'
- 'activebanners.*'
- 'testcollection.*'
metadata:
'accessorynftinfos':
replication-key: replication_key
replication-method: INCREMENTAL
'activebanners':
replication-key: replication_key
replication-method: LOG_BASED
'activebanners':
replication-key: replication_key
replication-method: INCREMENTAL
loaders:
- name: target-jsonl
variant: andyh1203
pip_url: git+<https://github.com/andyhuynh3/target-jsonl.git>
environments:
- name: dev
config:
plugins:
extractors:
- name: tap-mongodb
config:
mongodb_connection_string: ${DEV_MONGODB_CONNECTION_STRING}
database: Dev_DB
loaders:
- name: target-snowflake
config:
default_target_schema: MONGODB_DEV
Reuben (Matatika)
12/06/2024, 7:45 PMdatabase
configured for the dev
environment, try
meltano --environment dev select tap-mongodb --list
matt_elgazar
12/06/2024, 7:45 PMmatt_elgazar
12/06/2024, 7:45 PMmatt_elgazar
12/06/2024, 7:46 PM2024-12-06T19:45:39.570655Z [warning ] Stream `internalauthservers` was not found in the catalog
2024-12-06T19:45:39.570905Z [warning ] Stream `packages` was not found in the catalog
Legend:
selected
excluded
automatic
Enabled patterns:
bankedicemarketplaceinfos.*
bannedusers.*
casinodata.*
realworldprizeredemptions.*
wearablerewards.*
Selected attributes:
[selected ] bankedicemarketplaceinfos._sdc_batched_at
[selected ] bankedicemarketplaceinfos._sdc_extracted_at
[selected ] bankedicemarketplaceinfos.cluster_time
[selected ] bankedicemarketplaceinfos.document
[selected ] bankedicemarketplaceinfos.namespace
[selected ] bankedicemarketplaceinfos.namespace.collection
[selected ] bankedicemarketplaceinfos.namespace.database
[selected ] bankedicemarketplaceinfos.object_id
[selected ] bankedicemarketplaceinfos.operation_type
[automatic] bankedicemarketplaceinfos.replication_key
[selected ] bannedusers._sdc_batched_at
[selected ] bannedusers._sdc_extracted_at
[selected ] bannedusers.cluster_time
[selected ] bannedusers.document
[selected ] bannedusers.namespace
[selected ] bannedusers.namespace.collection
[selected ] bannedusers.namespace.database
[selected ] bannedusers.object_id
[selected ] bannedusers.operation_type
[automatic] bannedusers.replication_key
Reuben (Matatika)
12/06/2024, 7:48 PMTAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano select --environment dev tap-mongodb --list
Reuben (Matatika)
12/06/2024, 7:51 PMtestcollection
configured for filter_collections
before, but you have other streams selected - not including testcollection
, so it would have been excluded.
Enabled patterns:
bankedicemarketplaceinfos.*
bannedusers.*
casinodata.*
realworldprizeredemptions.*
wearablerewards.*
Remove these selection rules and try again.matt_elgazar
12/06/2024, 7:55 PMmelgazar9@MacBook-Pro tap-mongodb % TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev select tap-mongodb --list
2024-12-06T19:52:52.802146Z [info ] Environment 'dev' is active
2024-12-06T19:52:54.019420Z [warning ] Stream `internalauthservers` was not found in the catalog
2024-12-06T19:52:54.019542Z [warning ] Stream `packages` was not found in the catalog
Legend:
selected
excluded
automatic
Enabled patterns:
bankedicemarketplaceinfos.*
bannedusers.*
casinodata.*
realworldprizeredemptions.*
wearablerewards.*
Selected attributes:
[selected ] bankedicemarketplaceinfos._sdc_batched_at
[selected ] bankedicemarketplaceinfos._sdc_extracted_at
[selected ] bankedicemarketplaceinfos.cluster_time
[selected ] bankedicemarketplaceinfos.document
[selected ] bankedicemarketplaceinfos.namespace
[selected ] bankedicemarketplaceinfos.namespace.collection
[selected ] bankedicemarketplaceinfos.namespace.database
[selected ] bankedicemarketplaceinfos.object_id
[selected ] bankedicemarketplaceinfos.operation_type
[automatic] bankedicemarketplaceinfos.replication_key
[selected ] bannedusers._sdc_batched_at
[selected ] bannedusers._sdc_extracted_at
[selected ] bannedusers.cluster_time
[selected ] bannedusers.document
[selected ] bannedusers.namespace
[selected ] bannedusers.namespace.collection
[selected ] bannedusers.namespace.database
[selected ] bannedusers.object_id
[selected ] bannedusers.operation_type
[automatic] bannedusers.replication_key
[selected ] casinodata._sdc_batched_at
[selected ] casinodata._sdc_extracted_at
[selected ] casinodata.cluster_time
[selected ] casinodata.document
[selected ] casinodata.namespace
[selected ] casinodata.namespace.collection
[selected ] casinodata.namespace.database
[selected ] casinodata.object_id
[selected ] casinodata.operation_type
[automatic] casinodata.replication_key
[selected ] realworldprizeredemptions._sdc_batched_at
[selected ] realworldprizeredemptions._sdc_extracted_at
[selected ] realworldprizeredemptions.cluster_time
[selected ] realworldprizeredemptions.document
[selected ] realworldprizeredemptions.namespace
[selected ] realworldprizeredemptions.namespace.collection
[selected ] realworldprizeredemptions.namespace.database
[selected ] realworldprizeredemptions.object_id
[selected ] realworldprizeredemptions.operation_type
[automatic] realworldprizeredemptions.replication_key
[selected ] wearablerewards._sdc_batched_at
[selected ] wearablerewards._sdc_extracted_at
[selected ] wearablerewards.cluster_time
[selected ] wearablerewards.document
[selected ] wearablerewards.namespace
[selected ] wearablerewards.namespace.collection
[selected ] wearablerewards.namespace.database
[selected ] wearablerewards.object_id
[selected ] wearablerewards.operation_type
[automatic] wearablerewards.replication_key
matt_elgazar
12/06/2024, 7:55 PMReuben (Matatika)
12/06/2024, 7:56 PMmeltano.yml
?matt_elgazar
12/06/2024, 7:57 PMmatt_elgazar
12/06/2024, 7:59 PMTAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el tap-mongodb target-jsonl --state-id test
Are you saying remove everything selected in meltano.yml?Reuben (Matatika)
12/06/2024, 8:01 PMmatt_elgazar
12/06/2024, 8:01 PMmatt_elgazar
12/06/2024, 8:02 PMTAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el tap-mongodb target-jsonl --state-id test
Reuben (Matatika)
12/06/2024, 8:03 PMcasinodata
in the select list running
TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev select tap-mongodb --list
with no select
config in your meltano.yml
.matt_elgazar
12/06/2024, 8:04 PMmatt_elgazar
12/06/2024, 8:06 PMconfig:
add_record_metadata: true
allow_modify_change_streams: true
Reuben (Matatika)
12/06/2024, 8:08 PMoutput
?matt_elgazar
12/06/2024, 8:10 PMmelgazar9@MacBook-Pro tap-mongodb % ls -ltra output
total 8
-rw-r--r-- 1 melgazar9 staff 14 Nov 1 2023 .gitignore
drwxr-xr-x 3 melgazar9 staff 96 Dec 6 14:05 .
drwxr-xr-x 27 melgazar9 staff 864 Dec 6 14:06 ..
melgazar9@MacBook-Pro tap-mongodb % TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el tap-mongodb target-jsonl --state-id test
2024-12-06T20:08:42.434383Z [info ] Environment 'dev' is active
2024-12-06T20:08:43.072167Z [info ] Running extract & load... name=meltano run_id=4b899814-3f67-46cd-b2f5-fc6fc64e1ba6 state_id=test
2024-12-06T20:08:44.190784Z [info ] Reading state from Azure Blob Storage
2024-12-06T20:08:45.249587Z [info ] uploading part #1, 16 bytes (total 0.000GB)
2024-12-06T20:09:01.513999Z [warning ] Stream `internalauthservers` was not found in the catalog
2024-12-06T20:09:01.514431Z [warning ] Stream `packages` was not found in the catalog
2024-12-06T20:09:02.263224Z [info ] 2024-12-06 14:09:02,262 | INFO | tap-mongodb | Beginning incremental sync of 'accessorynftinfos'... cmd_type=extractor name=tap-mongodb run_id=4b899814-3f67-46cd-b2f5-fc6fc64e1ba6 state_id=test stdio=stderr
2024-12-06T20:09:02.263640Z [info ] 2024-12-06 14:09:02,263 | INFO | tap-mongodb | Tap has custom mapper. Using 1 provided map(s). cmd_type=extractor name=tap-mongodb run_id=4b899814-3f67-46cd-b2f5-fc6fc64e1ba6 state_id=test stdio=stderr
......
2024-12-06T20:10:03.967851Z [info ] Writing state to Azure Blob Storage
2024-12-06T20:10:04.633706Z [info ] uploading part #1, 17 bytes (total 0.000GB)
2024-12-06T20:10:04.898355Z [info ] uploading part #1, 826 bytes (total 0.000GB)
2024-12-06T20:10:05.154122Z [info ] Incremental state has been updated at 2024-12-06 20:10:05.154066.
2024-12-06T20:10:05.165889Z [info ] Extract & load complete! name=meltano run_id=1556e4be-7089-4215-a139-d767a521e116 state_id=test
2024-12-06T20:10:05.167017Z [info ] Transformation skipped. name=meltano run_id=1556e4be-7089-4215-a139-d767a521e116 state_id=test
melgazar9@MacBook-Pro tap-mongodb % ls -ltra output
total 48464
-rw-r--r-- 1 melgazar9 staff 14 Nov 1 2023 .gitignore
drwxr-xr-x 27 melgazar9 staff 864 Dec 6 14:06 ..
-rw-r--r-- 1 melgazar9 staff 7594 Dec 6 14:09 accessorynftinfos.jsonl
-rw-r--r-- 1 melgazar9 staff 404 Dec 6 14:09 activenotice.jsonl
-rw-r--r-- 1 melgazar9 staff 450 Dec 6 14:09 activepoap.jsonl
-rw-r--r-- 1 melgazar9 staff 2143 Dec 6 14:09 activerpc.jsonl
-rw-r--r-- 1 melgazar9 staff 904 Dec 6 14:09 allowedorigins.jsonl
-rw-r--r-- 1 melgazar9 staff 21636 Dec 6 14:09 appconfig.jsonl
-rw-r--r-- 1 melgazar9 staff 23252941 Dec 6 14:09 arcadehandanalyticsdata.jsonl
drwxr-xr-x 11 melgazar9 staff 352 Dec 6 14:10 .
-rw-r--r-- 1 melgazar9 staff 609066 Dec 6 14:10 casinodata.jsonl
Reuben (Matatika)
12/06/2024, 8:14 PMmetadata
is conflicting with what we're trying to do here. What happens if you add --select casinodata
to the command?Reuben (Matatika)
12/06/2024, 8:15 PMmatt_elgazar
12/06/2024, 8:16 PM--select casinodata
it only runs casinodata but it still shows other streams. One sec let me time the 2
ironment=dev el tap-mongodb target-jsonl --state-id test --select casinodata
2024-12-06T20:15:11.344063Z [info ] Environment 'dev' is active
2024-12-06T20:15:11.978909Z [info ] Running extract & load... name=meltano run_id=4f1852c2-5793-42a8-8f6a-342b4f7dc40f state_id=test
2024-12-06T20:15:13.102493Z [info ] Reading state from Azure Blob Storage
2024-12-06T20:15:13.816461Z [info ] uploading part #1, 17 bytes (total 0.000GB)
2024-12-06T20:15:29.707819Z [warning ] Stream `internalauthservers` was not found in the catalog
2024-12-06T20:15:29.708050Z [warning ] Stream `packages` was not found in the catalog
2024-12-06T20:15:30.325400Z [info ] 2024-12-06 14:15:30,325 | INFO | tap-mongodb | Skipping deselected stream 'accessorynftinfos'. cmd_type=extractor name=tap-mongodb run_id=4f1852c2-5793-42a8-8f6a-342b4f7dc40f state_id=test stdio=stderr
2024-12-06T20:15:30.325820Z [info ] 2024-12-06 14:15:30,325 | INFO | tap-mongodb | Skipping deselected stream 'activebanners'. cmd_type=extractor name=tap-mongodb run_id=4f1852c2-5793-42a8-8f6a-342b4f7dc40f state_id=test stdio=stderr
2024-12-06T20:15:30.325903Z [info ] 2024-12-06 14:15:30,325 | INFO | tap-mongodb | Skipping deselected stream 'activenotice'. cmd_type=extractor name=tap-mongodb run_id=4f1852c2-5793-42a8-8f6a-342b4f7dc40f state_id=test stdio=stderr
2024-12-06T20:15:30.325986Z [info ] 2024-12-06 14:15:30,325 | INFO | tap-mongodb | Skipping deselected stream 'activepoap'. cmd_type=extractor name=tap-mongodb run_id=4f1852c2-5793-42a8-8f6a-342b4f7dc40f state_id=test stdio=stderr
matt_elgazar
12/06/2024, 8:18 PMtime TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el 4.22s user 0.57s system 19% cpu 24.809 total
meltano --environment=dev el tap-mongodb target-jsonl --state-id test --selec 4.19s user 0.51s system 19% cpu 24.278 total
matt_elgazar
12/06/2024, 8:19 PMmatt_elgazar
12/06/2024, 8:21 PMtime TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el tap-mongodb target-jsonl --state-id test
matt_elgazar
12/06/2024, 8:21 PM--select
it doesnt hit the collection, but then it streams everythingmatt_elgazar
12/06/2024, 8:22 PMcasinodata
it still tries to hit everythingEdgar Ramírez (Arch.dev)
12/06/2024, 8:24 PM--refresh-catalog
each time in case the catalog cache is not being invalidated for whatever reasonmatt_elgazar
12/06/2024, 8:24 PMmatt_elgazar
12/06/2024, 8:25 PMEdgar Ramírez (Arch.dev)
12/06/2024, 8:25 PMmeltano el --refresh-catalog ...
matt_elgazar
12/06/2024, 8:26 PMtime TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el --refresh-catalog tap-mongodb target-jsonl --state-id test --full-refresh
Try 'meltano el --help' for help.
Error: No such option: --refresh-catalog Did you mean --catalog?
TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el 0.72s user 0.17s system 56% cpu 1.586 total
Edgar Ramírez (Arch.dev)
12/06/2024, 8:27 PMmatt_elgazar
12/06/2024, 8:28 PMmatt_elgazar
12/06/2024, 8:28 PMmatt_elgazar
12/06/2024, 8:29 PMmeltano --version
meltano, version 3.5.4
melgazar9@MacBook-Pro tap-mongodb % time TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el --refresh-catalog tap-mongodb target-jsonl --state-id test --full-refreshme
Usage: meltano el [OPTIONS] EXTRACTOR LOADER
Try 'meltano el --help' for help.
Error: No such option: --full-refreshme Did you mean --full-refresh?
TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]' meltano --environment=dev el 0.71s user 0.13s system 72% cpu 1.161 total
Edgar Ramírez (Arch.dev)
12/06/2024, 8:31 PMmatt_elgazar
12/06/2024, 8:35 PMmatt_elgazar
12/06/2024, 8:42 PMReuben (Matatika)
12/06/2024, 8:42 PMmetadata
config when I get back home later, maybe that will clear some things up.matt_elgazar
12/06/2024, 8:43 PMReuben (Matatika)
12/06/2024, 10:50 PMmeltano select
(no metadata
config):
meltano.yml
version: 1
default_environment: dev
project_id: 26ced3f3-b65d-421e-bace-bc4daaf99f7d
environments:
- name: dev
config:
plugins:
extractors:
- name: tap-mongodb
config:
database: sample_mflix
- name: staging
- name: prod
plugins:
extractors:
- name: tap-mongodb
variant: meltanolabs
pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
settings:
- name: filter_collections
kind: array
loaders:
- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl
(TAP_MONGODB_MONGODB_CONNECTION_STRING
set in .env
)
1. All streams selected (no selection criteria)
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ meltano --environment dev select tap-mongodb --list --all
2024-12-06T22:24:32.785127Z [info ] Environment 'dev' is active
Legend:
selected
excluded
automatic
Enabled patterns:
*.*
Selected attributes:
[selected ] comments._sdc_batched_at
[selected ] comments._sdc_extracted_at
[selected ] comments.cluster_time
[selected ] comments.document
[selected ] comments.namespace
[selected ] comments.namespace.collection
[selected ] comments.namespace.database
[selected ] comments.object_id
[selected ] comments.operation_type
[automatic] comments.replication_key
[selected ] embedded_movies._sdc_batched_at
[selected ] embedded_movies._sdc_extracted_at
[selected ] embedded_movies.cluster_time
[selected ] embedded_movies.document
[selected ] embedded_movies.namespace
[selected ] embedded_movies.namespace.collection
[selected ] embedded_movies.namespace.database
[selected ] embedded_movies.object_id
[selected ] embedded_movies.operation_type
[automatic] embedded_movies.replication_key
[selected ] movies._sdc_batched_at
[selected ] movies._sdc_extracted_at
[selected ] movies.cluster_time
[selected ] movies.document
[selected ] movies.namespace
[selected ] movies.namespace.collection
[selected ] movies.namespace.database
[selected ] movies.object_id
[selected ] movies.operation_type
[automatic] movies.replication_key
[selected ] sessions._sdc_batched_at
[selected ] sessions._sdc_extracted_at
[selected ] sessions.cluster_time
[selected ] sessions.document
[selected ] sessions.namespace
[selected ] sessions.namespace.collection
[selected ] sessions.namespace.database
[selected ] sessions.object_id
[selected ] sessions.operation_type
[automatic] sessions.replication_key
[selected ] theaters._sdc_batched_at
[selected ] theaters._sdc_extracted_at
[selected ] theaters.cluster_time
[selected ] theaters.document
[selected ] theaters.namespace
[selected ] theaters.namespace.collection
[selected ] theaters.namespace.database
[selected ] theaters.object_id
[selected ] theaters.operation_type
[automatic] theaters.replication_key
[selected ] users._sdc_batched_at
[selected ] users._sdc_extracted_at
[selected ] users.cluster_time
[selected ] users.document
[selected ] users.namespace
[selected ] users.namespace.collection
[selected ] users.namespace.database
[selected ] users.object_id
[selected ] users.operation_type
[automatic] users.replication_key
2. users stream selected
tap-mongodb
config in meltano.yml
- name: tap-mongodb
variant: meltanolabs
pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
settings:
- name: filter_collections
kind: array
select:
- users.*
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ meltano --environment dev select tap-mongodb --list --all
2024-12-06T22:25:41.931677Z [info ] Environment 'dev' is active
Legend:
selected
excluded
automatic
Enabled patterns:
users.*
Selected attributes:
[excluded ] comments._sdc_batched_at
[excluded ] comments._sdc_extracted_at
[excluded ] comments.cluster_time
[excluded ] comments.document
[excluded ] comments.namespace
[excluded ] comments.namespace.collection
[excluded ] comments.namespace.database
[excluded ] comments.object_id
[excluded ] comments.operation_type
[excluded ] comments.replication_key
[excluded ] embedded_movies._sdc_batched_at
[excluded ] embedded_movies._sdc_extracted_at
[excluded ] embedded_movies.cluster_time
[excluded ] embedded_movies.document
[excluded ] embedded_movies.namespace
[excluded ] embedded_movies.namespace.collection
[excluded ] embedded_movies.namespace.database
[excluded ] embedded_movies.object_id
[excluded ] embedded_movies.operation_type
[excluded ] embedded_movies.replication_key
[excluded ] movies._sdc_batched_at
[excluded ] movies._sdc_extracted_at
[excluded ] movies.cluster_time
[excluded ] movies.document
[excluded ] movies.namespace
[excluded ] movies.namespace.collection
[excluded ] movies.namespace.database
[excluded ] movies.object_id
[excluded ] movies.operation_type
[excluded ] movies.replication_key
[excluded ] sessions._sdc_batched_at
[excluded ] sessions._sdc_extracted_at
[excluded ] sessions.cluster_time
[excluded ] sessions.document
[excluded ] sessions.namespace
[excluded ] sessions.namespace.collection
[excluded ] sessions.namespace.database
[excluded ] sessions.object_id
[excluded ] sessions.operation_type
[excluded ] sessions.replication_key
[excluded ] theaters._sdc_batched_at
[excluded ] theaters._sdc_extracted_at
[excluded ] theaters.cluster_time
[excluded ] theaters.document
[excluded ] theaters.namespace
[excluded ] theaters.namespace.collection
[excluded ] theaters.namespace.database
[excluded ] theaters.object_id
[excluded ] theaters.operation_type
[excluded ] theaters.replication_key
[selected ] users._sdc_batched_at
[selected ] users._sdc_extracted_at
[selected ] users.cluster_time
[selected ] users.document
[selected ] users.namespace
[selected ] users.namespace.collection
[selected ] users.namespace.database
[selected ] users.object_id
[selected ] users.operation_type
[automatic] users.replication_key
Notice how collections are excluded, rather than outright not present
3. All streams selected (no selection criteria), filtering single collection
tap-mongodb
config in meltano.yml
- name: tap-mongodb
variant: meltanolabs
pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
settings:
- name: filter_collections
kind: array
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["users"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T22:33:12.223370Z [info ] Environment 'dev' is active
Legend:
selected
excluded
automatic
Enabled patterns:
*.*
Selected attributes:
[selected ] users._sdc_batched_at
[selected ] users._sdc_extracted_at
[selected ] users.cluster_time
[selected ] users.document
[selected ] users.namespace
[selected ] users.namespace.collection
[selected ] users.namespace.database
[selected ] users.object_id
[selected ] users.operation_type
[automatic] users.replication_key
Notice how collections that were previous excluded are now not discovered at all
4. users stream selected, filtering same collection (functionally the same as example 3)
tap-mongodb
config in meltano.yml
- name: tap-mongodb
variant: meltanolabs
pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
settings:
- name: filter_collections
kind: array
select:
- users.*
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["users"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T22:37:17.179517Z [info ] Environment 'dev' is active
Legend:
selected
excluded
automatic
Enabled patterns:
users.*
Selected attributes:
[selected ] users._sdc_batched_at
[selected ] users._sdc_extracted_at
[selected ] users.cluster_time
[selected ] users.document
[selected ] users.namespace
[selected ] users.namespace.collection
[selected ] users.namespace.database
[selected ] users.object_id
[selected ] users.operation_type
[automatic] users.replication_key
This is demonstrating that select
can be redundant when using filter_collections
5. users stream selected, filtering movies collection
tap-mongodb
config in meltano.yml
- name: tap-mongodb
variant: meltanolabs
pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
settings:
- name: filter_collections
kind: array
select:
- users.*
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["movies"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T22:40:50.958754Z [info ] Environment 'dev' is active
2024-12-06T22:40:51.276788Z [warning ] Stream `users` was not found in the catalog
Legend:
selected
excluded
automatic
Enabled patterns:
users.*
Selected attributes:
[excluded ] movies._sdc_batched_at
[excluded ] movies._sdc_extracted_at
[excluded ] movies.cluster_time
[excluded ] movies.document
[excluded ] movies.namespace
[excluded ] movies.namespace.collection
[excluded ] movies.namespace.database
[excluded ] movies.object_id
[excluded ] movies.operation_type
[excluded ] movies.replication_key
Notice the stream not found in catalog warning - this is because only the movies
collection was discovered, but we are trying to select users
which was not discovered
6. All streams selected (no selection criteria), filtering collection that does not exist
tap-mongodb
config in meltano.yml
- name: tap-mongodb
variant: meltanolabs
pip_url: git+<https://github.com/ReubenFrankel/tap-mongodb.git@filter-collections>
settings:
- name: filter_collections
kind: array
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["doesnotexist"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T22:48:39.749041Z [info ] Environment 'dev' is active
Legend:
selected
excluded
automatic
Enabled patterns:
*.*
Selected attributes:
Nothing was discovered since the collection doesnotexist
does - in fact - not existReuben (Matatika)
12/06/2024, 11:15 PMmetadata
config for users
and `movies`:
metadata:
users:
replication-key: replication_key
replication-method: INCREMENTAL
movies:
replication-key: replication_key
replication-method: INCREMENTAL
1. All streams selected (no selection criteria)
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ meltano --environment dev select tap-mongodb --list --all
2024-12-06T23:05:36.773532Z [info ] Environment 'dev' is active
Legend:
selected
excluded
automatic
Enabled patterns:
*.*
Selected attributes:
[selected ] comments._sdc_batched_at
[selected ] comments._sdc_extracted_at
[selected ] comments.cluster_time
[selected ] comments.document
[selected ] comments.namespace
[selected ] comments.namespace.collection
[selected ] comments.namespace.database
[selected ] comments.object_id
[selected ] comments.operation_type
[automatic] comments.replication_key
[selected ] embedded_movies._sdc_batched_at
[selected ] embedded_movies._sdc_extracted_at
[selected ] embedded_movies.cluster_time
[selected ] embedded_movies.document
[selected ] embedded_movies.namespace
[selected ] embedded_movies.namespace.collection
[selected ] embedded_movies.namespace.database
[selected ] embedded_movies.object_id
[selected ] embedded_movies.operation_type
[automatic] embedded_movies.replication_key
[selected ] movies._sdc_batched_at
[selected ] movies._sdc_extracted_at
[selected ] movies.cluster_time
[selected ] movies.document
[selected ] movies.namespace
[selected ] movies.namespace.collection
[selected ] movies.namespace.database
[selected ] movies.object_id
[selected ] movies.operation_type
[automatic] movies.replication_key
[selected ] sessions._sdc_batched_at
[selected ] sessions._sdc_extracted_at
[selected ] sessions.cluster_time
[selected ] sessions.document
[selected ] sessions.namespace
[selected ] sessions.namespace.collection
[selected ] sessions.namespace.database
[selected ] sessions.object_id
[selected ] sessions.operation_type
[automatic] sessions.replication_key
[selected ] theaters._sdc_batched_at
[selected ] theaters._sdc_extracted_at
[selected ] theaters.cluster_time
[selected ] theaters.document
[selected ] theaters.namespace
[selected ] theaters.namespace.collection
[selected ] theaters.namespace.database
[selected ] theaters.object_id
[selected ] theaters.operation_type
[automatic] theaters.replication_key
[selected ] users._sdc_batched_at
[selected ] users._sdc_extracted_at
[selected ] users.cluster_time
[selected ] users.document
[selected ] users.namespace
[selected ] users.namespace.collection
[selected ] users.namespace.database
[selected ] users.object_id
[selected ] users.operation_type
[automatic] users.replication_key
2. users stream selected
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ meltano --environment dev select tap-mongodb --list --all
2024-12-06T23:10:18.459047Z [info ] Environment 'dev' is active
Legend:
selected
excluded
automatic
Enabled patterns:
users.*
Selected attributes:
[excluded ] comments._sdc_batched_at
[excluded ] comments._sdc_extracted_at
[excluded ] comments.cluster_time
[excluded ] comments.document
[excluded ] comments.namespace
[excluded ] comments.namespace.collection
[excluded ] comments.namespace.database
[excluded ] comments.object_id
[excluded ] comments.operation_type
[excluded ] comments.replication_key
[excluded ] embedded_movies._sdc_batched_at
[excluded ] embedded_movies._sdc_extracted_at
[excluded ] embedded_movies.cluster_time
[excluded ] embedded_movies.document
[excluded ] embedded_movies.namespace
[excluded ] embedded_movies.namespace.collection
[excluded ] embedded_movies.namespace.database
[excluded ] embedded_movies.object_id
[excluded ] embedded_movies.operation_type
[excluded ] embedded_movies.replication_key
[excluded ] movies._sdc_batched_at
[excluded ] movies._sdc_extracted_at
[excluded ] movies.cluster_time
[excluded ] movies.document
[excluded ] movies.namespace
[excluded ] movies.namespace.collection
[excluded ] movies.namespace.database
[excluded ] movies.object_id
[excluded ] movies.operation_type
[excluded ] movies.replication_key
[excluded ] sessions._sdc_batched_at
[excluded ] sessions._sdc_extracted_at
[excluded ] sessions.cluster_time
[excluded ] sessions.document
[excluded ] sessions.namespace
[excluded ] sessions.namespace.collection
[excluded ] sessions.namespace.database
[excluded ] sessions.object_id
[excluded ] sessions.operation_type
[excluded ] sessions.replication_key
[excluded ] theaters._sdc_batched_at
[excluded ] theaters._sdc_extracted_at
[excluded ] theaters.cluster_time
[excluded ] theaters.document
[excluded ] theaters.namespace
[excluded ] theaters.namespace.collection
[excluded ] theaters.namespace.database
[excluded ] theaters.object_id
[excluded ] theaters.operation_type
[excluded ] theaters.replication_key
[selected ] users._sdc_batched_at
[selected ] users._sdc_extracted_at
[selected ] users.cluster_time
[selected ] users.document
[selected ] users.namespace
[selected ] users.namespace.collection
[selected ] users.namespace.database
[selected ] users.object_id
[selected ] users.operation_type
[automatic] users.replication_key
3. All streams selected (no selection criteria), filtering single collection
TAP_MONGODB_FILTER_COLLECTIONS='["users"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T23:06:16.279460Z [info ] Environment 'dev' is active
2024-12-06T23:06:23.062268Z [warning ] Stream `movies` was not found in the catalog
Legend:
selected
excluded
automatic
Enabled patterns:
*.*
Selected attributes:
[selected ] users._sdc_batched_at
[selected ] users._sdc_extracted_at
[selected ] users.cluster_time
[selected ] users.document
[selected ] users.namespace
[selected ] users.namespace.collection
[selected ] users.namespace.database
[selected ] users.object_id
[selected ] users.operation_type
[automatic] users.replication_key
Notice the new stream not found in catalog warning here, presumably coming from the metadata
config
4. users stream selected, filtering same collection (functionally the same as example 3)
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["users"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T23:11:05.135721Z [info ] Environment 'dev' is active
2024-12-06T23:11:11.975323Z [warning ] Stream `movies` was not found in the catalog
Legend:
selected
excluded
automatic
Enabled patterns:
users.*
Selected attributes:
[selected ] users._sdc_batched_at
[selected ] users._sdc_extracted_at
[selected ] users.cluster_time
[selected ] users.document
[selected ] users.namespace
[selected ] users.namespace.collection
[selected ] users.namespace.database
[selected ] users.object_id
[selected ] users.operation_type
[automatic] users.replication_key
Same stream not found in catalog warning here
5. users stream selected, filtering movies collection
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["movies"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T23:13:43.564784Z [info ] Environment 'dev' is active
2024-12-06T23:13:45.461848Z [warning ] Stream `users` was not found in the catalog
Legend:
selected
excluded
automatic
Enabled patterns:
users.*
Selected attributes:
[excluded ] movies._sdc_batched_at
[excluded ] movies._sdc_extracted_at
[excluded ] movies.cluster_time
[excluded ] movies.document
[excluded ] movies.namespace
[excluded ] movies.namespace.collection
[excluded ] movies.namespace.database
[excluded ] movies.object_id
[excluded ] movies.operation_type
[excluded ] movies.replication_key
Same kind of warning, but this time for users
6. All streams selected (no selection criteria), filtering collection that does not exist
reuben@reuben-Inspiron-14-5425:/tmp/mongodb-test$ TAP_MONGODB_FILTER_COLLECTIONS='["doesnotexist"]' meltano --environment dev select tap-mongodb --list --all
2024-12-06T23:14:30.426717Z [info ] Environment 'dev' is active
2024-12-06T23:14:32.204437Z [warning ] Stream `users` was not found in the catalog
2024-12-06T23:14:32.204743Z [warning ] Stream `movies` was not found in the catalog
Legend:
selected
excluded
automatic
Enabled patterns:
*.*
Selected attributes:
Same warnings for both users
and movies
Overall, no functional difference when specifying metadata
in this case it seems...Edgar Ramírez (Arch.dev)
12/06/2024, 11:20 PMdo you think the select streams being accessible from the meltano tap is a feature that would be added in the near future?I plan to talk about this in office hours next week, cause I would love to implement it but I just have no idea how to tackle it 🙂
matt_elgazar
12/07/2024, 12:29 AMcasinoData
, so when I wrote TAP_MONGODB_FILTER_COLLECTIONS='["casinodata"]'
it did not detect it but TAP_MONGODB_FILTER_COLLECTIONS='["casinoData"]'
did!matt_elgazar
12/07/2024, 12:30 AMTAP_MONGODB_FILTER_COLLECTIONS='["casinoData"]' meltano --environment dev select tap-mongodb --list --all
2024-12-07T00:30:13.729047Z [info ] Environment 'dev' is active
Legend:
selected
excluded
automatic
Enabled patterns:
*.*
Selected attributes:
[selected ] casinodata._sdc_batched_at
[selected ] casinodata._sdc_extracted_at
[selected ] casinodata.cluster_time
[selected ] casinodata.document
[selected ] casinodata.namespace
[selected ] casinodata.namespace.collection
[selected ] casinodata.namespace.database
[selected ] casinodata.object_id
[selected ] casinodata.operation_type
[automatic] casinodata.replication_key
Reuben (Matatika)
12/07/2024, 12:32 AMmatt_elgazar
12/07/2024, 12:32 AM.lower()
line when searching the collections in discovery modematt_elgazar
12/07/2024, 12:33 AMcasinoData
-> casinodata
Reuben (Matatika)
12/07/2024, 12:57 AMmatt_elgazar
12/07/2024, 12:58 AMReuben (Matatika)
12/07/2024, 1:06 AMpip_url
back over to the default if/once it is merged- or if you want as much stability as possible (given that this will be running in production), fork my fork (uncheck "Copy the main
branch only") and update to your username in the pip_url
.matt_elgazar
12/10/2024, 5:31 PMmatt_elgazar
12/10/2024, 5:32 PMpre-commit clean
pre-commit run --all-files
Cleaned /Users/melgazar9/.cache/pre-commit.
[INFO] Initializing environment for <https://github.com/pre-commit/pre-commit-hooks>.
[INFO] Initializing environment for <https://github.com/lyz-code/yamlfix.git>.
[INFO] Initializing environment for <https://github.com/adrienverge/yamllint.git>.
[INFO] Initializing environment for <https://github.com/charliermarsh/ruff-pre-commit>.
[INFO] Initializing environment for <https://github.com/psf/black>.
[INFO] Installing environment for <https://github.com/pre-commit/pre-commit-hooks>.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for <https://github.com/lyz-code/yamlfix.git>.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for <https://github.com/adrienverge/yamllint.git>.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for <https://github.com/charliermarsh/ruff-pre-commit>.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for <https://github.com/psf/black>.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
check for added large files..............................................Passed
check toml...............................................................Passed
check vcs permalinks.....................................................Passed
detect private key.......................................................Passed
fix end of files.........................................................Passed
python tests naming......................................................Passed
pretty format json...................................(no files to check)Skipped
trim trailing whitespace.................................................Passed
yamlfix..................................................................Passed
yamllint.................................................................Passed
ruff.....................................................................Passed
black....................................................................Passed
pylint...................................................................Passed
Reuben (Matatika)
12/10/2024, 7:59 PMmaison>=2.0.0
only supports Python 3.9 and above? None of this should stop you from using your fork though.matt_elgazar
12/10/2024, 8:23 PMReuben (Matatika)
12/10/2024, 9:11 PMmatt_elgazar
12/10/2024, 9:26 PMNone
and inf
datatypesReuben (Matatika)
12/10/2024, 10:06 PMmatt_elgazar
12/10/2024, 10:07 PMReuben (Matatika)
12/10/2024, 10:08 PMReuben (Matatika)
12/10/2024, 10:11 PM