matt_elgazar
04/26/2023, 4:40 AMtap-mongodb
. I’m able to connect via a db connection string using the python mongo client, but parsing out that same string doesn’t seem to work when setting the variables in meltano.yml
. Is there a way to directly use the db connection string URI without having to pass all of those parameters? If not, here is my meltano.yml
file:
version: 1
default_environment: staging
project_id: 90ea9b0c-a807-4f93-9f54-0b8cec9c519f
environments:
- name: dev
- name: staging
- name: prod
plugins:
extractors:
- name: tap-mongodb
variant: transferwise
pip_url: pipelinewise-tap-mongodb
config:
user: staging
host: <host>
auth_database: Staging
database: admin
srv: mongodb+srv
port: 27017
replica_set: atlas-ehcfwa-shard-0
ssl: 'true'
select:
- '*.*'
- name: tap-csv
variant: meltanolabs
pip_url: git+<https://github.com/MeltanoLabs/tap-csv.git>
config:
files:
- entity: tmp_csv_example
path: csvs/tmp_csv_example.csv
keys: [col1, col2]
loaders:
- name: target-snowflake
variant: transferwise
pip_url: pipelinewise-target-snowflake
config:
account: account
user: MELTANO
warehouse: MY_WAREHOUSE
dbname: MELTANO
file_format: CSV_FORMAT
role: MELTANO
default_target_schema: TMP_SCHEMA
- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl
When running meltano config tap-mongodb test
I get a run timeout error
```Plugin configuration is invalid
Catalog discovery failed: command ['/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap_dg_mongodb/.meltano/extractors/tap-mongodb/venv/bin/tap-mongodb', '--config', '/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap_dg_mongodb/.meltano/run/tap-mongodb/tap.fc8b4358-69cb-4558-b8c3-3f6986fbedea.config.json', '--discover'] returned 1 with stderr:
time=2023-04-25 233205 name=tap_mongodb level=ERROR message=No replica set members available for replica set name "atlas-ehcfwa-shard-0", Timeout: 30s, Topology Description: <TopologyDescription id: 6448a927e78fa2444067000f, topology_type: ReplicaSetNoPrimary, servers: []>
Traceback (most recent call last):
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap_dg_mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/tap_mongodb/__init__.py", line 322, in main
main_impl()
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap_dg_mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/tap_mongodb/__init__.py", line 305, in main_impl
client.server_info().get('version', 'unknown'))
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap_dg_mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1994, in server_info
return self.admin.command("buildinfo",
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap_dg_mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/pymongo/database.py", line 757, in command
with self.__client._socket_for_reads(
File "/opt/homebrew/Cellar/python@3.9/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/contextlib.py", line 119, in enter
return next(self.gen)
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap_dg_mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1387, in _socket_for_reads
server = self._select_server(read_preference, session)
File "/Users/melgazar9/scripts/github/DG/data-science/projects/elt/meltano_projects/tap_dg_mongodb/.meltano/extractors/tap-mongodb/venv/lib/python3.9/site-packages/pymongo/mongo_cl…Matt Menzenski
04/26/2023, 12:54 PMselect
key is indented within the config
options - it should be de-dented one unit so that it’s the same indentation as the config
key.pat_nadolny
04/26/2023, 2:09 PMmatt_elgazar
04/26/2023, 3:15 PMmeltano config tap-mongodb test
which should just test the connection to MongoDB. Not sure what else I can do here. The db connection string works as expected when running in python natively, but not with meltano. I tested my connection to target-snowflake
and those credentials work as expected. It is something going on with tap-mongodbMatt Menzenski
04/26/2023, 3:16 PMmatt_elgazar
04/26/2023, 3:19 PMmatt_elgazar
04/26/2023, 3:19 PMMatt Menzenski
04/26/2023, 3:20 PMMatt Menzenski
04/26/2023, 3:20 PMmatt_elgazar
04/26/2023, 3:21 PMmatt_elgazar
04/26/2023, 3:21 PMmeltano.yml
file that would be great as well!matt_elgazar
04/26/2023, 3:41 PMmeltano add extractor tap-mongodb
but change the meltano.yml
to look like this?
extractors:
- name: tap-mongodb
variant: transferwise
pip_url: git+<https://github.com/menzenski/tap-mongodb.git@f51ecab>
config:
mongodb_connection_string: <long-string>
database_includes:
- database: test-db
Would this work or do I need to add extractor that hits your git repo in the cli?Matt Menzenski
04/26/2023, 3:42 PMplugins:
extractors:
- name: tap-mongodb
namespace: tap_mongodb
# variant: menzenski
pip_url: git+<https://github.com/menzenski/tap-mongodb.git@f51ecab72a5d7b2d4dae108356ec395c405dafc6>
executable: tap-mongodb
capabilities:
- state
- catalog
- discover
- about
- stream-maps
config:
add_record_metadata: true
allow_modify_change_streams: true
settings:
- name: mongodb_connection_string
kind: password
- name: documentdb_credential_json_string
kind: password
- name: documentdb_credential_json_extra_options
kind: string
- name: prefix
kind: string
- name: start_date
kind: date_iso8601
- name: database_includes
kind: array
- name: add_record_metadata
kind: boolean
- name: allow_modify_change_streams
kind: boolean
- name: operation_types
kind: array
Matt Menzenski
04/26/2023, 3:43 PMvariant
and include a namespace
(which I believe can be anything)Matt Menzenski
04/26/2023, 3:43 PMmatt_elgazar
04/26/2023, 5:38 PM- name: tap-mongodb
variant: z3z1ma
pip_url: git+<https://github.com/z3z1ma/tap-mongodb.git>
config:
mongo:
host: '<DB Connection String>'
matt_elgazar
04/26/2023, 6:14 PMdatabase_includes
filter but not a collection_includes
filter. I’m super new to meltano so not sure if this is a setting somewhere? Looking for something like this:
config:
mongo:
host: <DB connection string>
strategy: raw
collections: ['collection1', 'collection2', 'collection3']
query:
collection1:
load_type: full-refresh
filter: {}
collection2:
load_type: incremental
filter: {}
collection3:
load_type: incremental
filter: {'_id': 1, 'winnerSeatNumber': 1, 'playerActions': 1, 'createdAt': 1, 'updatedAt': 1}
Matt Menzenski
04/26/2023, 6:15 PMconfig:
mongodb_connection_string: <mongodb://admin:password@localhost:27017/>
database_includes:
- database: test-database
collection: TestDocument
and it will load incrementally by defaultmatt_elgazar
04/26/2023, 6:16 PMMatt Menzenski
04/26/2023, 6:16 PMselect
syntax should workmatt_elgazar
04/26/2023, 6:17 PMMatt Menzenski
04/26/2023, 6:18 PMMatt Menzenski
04/26/2023, 6:20 PMpat_nadolny
04/26/2023, 6:31 PMmatt_elgazar
04/26/2023, 6:32 PM- name: tap-mongodb
namespace: tap_mongodb
# variant: menzenski
pip_url: git+<https://github.com/menzenski/tap-mongodb.git@f51ecab72a5d7b2d4dae108356ec395c405dafc6>
executable: tap-mongodb
capabilities:
- state
- catalog
- discover
- about
- stream-maps
config:
add_record_metadata: true
allow_modify_change_streams: true
mongodb_connection_string: <db string>
database_includes:
- database: Staging
collection: collection1, collection2
settings:
- name: mongodb_connection_string
kind: password
- name: documentdb_credential_json_string
kind: password
- name: documentdb_credential_json_extra_options
kind: string
- name: prefix
kind: string
- name: start_date
kind: date_iso8601
- name: database_includes
kind: array
- name: add_record_metadata
kind: boolean
- name: allow_modify_change_streams
kind: boolean
- name: operation_types
kind: array
meltano config tap-mongodb test
2023-04-26T18:31:34.852644Z [info ] The default environment 'staging' will be ignored for `meltano config`. To configure a specific environment, please use the option `--environment=<environment name>`.
Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.
Plugin configuration is invalid
ValueError: Unrecognized replication method FULL_TABLE. Only INCREMENTAL and LOG_BASED replication methods are supported.
Matt Menzenski
04/26/2023, 6:39 PMmetadata
here should be the same indenattion level as config
and settings
metadata:
'*':
replication-key: _id
replication-method: INCREMENTAL
Matt Menzenski
04/26/2023, 6:39 PMFULL TABLE
user
04/26/2023, 6:41 PMmeltano select tap-mongodb --list
you'd see all available collections as individual streams and then you could chose the ones you want?matt_elgazar
04/26/2023, 6:42 PMMatt Menzenski
04/26/2023, 6:42 PMMatt Menzenski
04/26/2023, 6:42 PMmatt_elgazar
04/26/2023, 6:50 PMoutputs
and run meltano run tap-mongodb target-jsonl --full-refresh
it says it’s ignoring the state, but no data gets populated in the outputs
directory. Is this expected?Matt Menzenski
04/26/2023, 6:51 PMMatt Menzenski
04/26/2023, 6:51 PMoutput
or outputs
? singular or plural? I would expect output
(singular) to receive that outputmatt_elgazar
04/26/2023, 6:52 PMoutput
matt_elgazar
04/26/2023, 6:56 PMmatt_elgazar
04/26/2023, 6:56 PMmatt_elgazar
04/26/2023, 7:13 PMmatt_elgazar
04/26/2023, 7:13 PMMatt Menzenski
04/26/2023, 7:14 PMdatabase_includes:
- database: test-database
collection: TestDocument
- database: test-database
collection: OtherCollection
- database: other-database
collection: ThirdCollection
matt_elgazar
04/26/2023, 7:14 PMmatt_elgazar
04/26/2023, 7:17 PMmatt_elgazar
04/26/2023, 8:18 PMMatt Menzenski
04/26/2023, 8:20 PMenv
setting in an environment to set a TAP_MONGODB_MONGODB_CONNECTION_STRING
environment variable per meltano environment, for example: https://docs.meltano.com/guide/configuration#environment-variablesmatt_elgazar
04/26/2023, 8:22 PMMONGODB_DEV
, MONGODB_STAGING
, MONGODB_PROD
. Can I set those extractors and loaders in the yml file within your fork?matt_elgazar
04/26/2023, 8:22 PMMatt Menzenski
04/26/2023, 8:27 PMtap-mongodb
and a target-snowflake
in the top level of you meltano.yml file (not within the environments
).
Then I would add overrides of those values per environment, like in https://docs.meltano.com/guide/configuration#environment-variables:
plugins:
extractors:
- name: tap-mongodb
# config here
loaders:
- name: target-snowflake
# config here
environments:
- name: dev
config:
plugins:
extractors:
- name: tap-mongodb
env:
TAP_MONGODB_MONGODB_CONNECTION_STRING: "dev-connection-string"
loaders:
- name: target-snowflake
env:
TARGET_SNOWFLAKE_DEFAULT_TARGET_SCHEMA: "dev-schema"
Matt Menzenski
04/26/2023, 8:27 PMmatt_elgazar
04/26/2023, 8:54 PMMatt Menzenski
04/26/2023, 8:56 PMmetadata:
'*':
replication-key: _id
replication-method: INCREMENTAL
on your tap-mongodb under the top plugins:
blockMatt Menzenski
04/26/2023, 9:03 PMselect
part:
select:
- _id
- address
- type
again, I haven’t tested select
behavior myself but I would expect these to look more like - collection1._id
or possibly even - Staging_collection1._id
matt_elgazar
04/26/2023, 9:17 PMmatt_elgazar
04/26/2023, 9:22 PM.env
? I don’t see where it’s referenced in the yml fileMatt Menzenski
04/26/2023, 9:24 PM.env
file for local testing, yes - when running meltano non-locally we set them in the runtime environment as environment variablesmatt_elgazar
04/26/2023, 9:25 PM.env
so it knows when you run meltano --environment=dev run tap-csv target-jsonl
it reads the dev db connection string?matt_elgazar
04/26/2023, 9:25 PMMatt Menzenski
04/26/2023, 9:26 PMMatt Menzenski
04/26/2023, 9:26 PMMatt Menzenski
04/26/2023, 9:26 PMstaging
environment, etcmatt_elgazar
04/26/2023, 9:27 PM.env
and one meltano.yml
file. Is this reasonable?matt_elgazar
04/26/2023, 9:27 PMuser
04/26/2023, 9:28 PMMatt Menzenski
04/26/2023, 9:29 PMTAP_MONGODB_DEV_MONGODB_CONNECTION_STRING
(for dev) and TAP_MONGODB_STAGING_MONGODB_CONNECTION_STRING
(for staging) environment variables in the .env file. I think that would work, based on my experience. But personally I’d recommend keeping your environments well separated.user
04/26/2023, 9:29 PMmatt_elgazar
04/26/2023, 9:30 PMmatt_elgazar
04/27/2023, 6:11 PMconfig:
mongodb_connection_string: ${TESTING_MONGODB_CONNECTION_STRING}
database_includes:
- database: Testing
collection: *
^ something like thatMatt Menzenski
04/27/2023, 6:11 PMMatt Menzenski
04/27/2023, 6:12 PMmatt_elgazar
04/27/2023, 6:12 PMmatt_elgazar
04/28/2023, 4:29 AMmatt_elgazar
04/28/2023, 4:56 AMif self.config.get("database_includes", []) != '*.*':
collections_to_tap = self.config.get("database_includes", [])
else:
collections_to_tap = client.get_collections()
for included in collections_to_tap:
db_name = included["database"]
collection = included["collection"]
Location in tap.py: https://github.com/menzenski/tap-mongodb/blob/main/tap_mongodb/tap.py#L227user
04/28/2023, 1:05 PMmatt_elgazar
04/28/2023, 2:46 PMtap.py
debug mode it breaks at this point
if raise_errors:
raise ConfigValidationError(summary)
`'Config validation failed: \'database_includes\' is a required property`…etcuser
04/28/2023, 3:06 PM.secrets/
directory thats gitignored where people store their config.json containing real credentialsuser
04/28/2023, 3:06 PM"args": ["--config", ".secrets/config.json", "--discover"],
matt_elgazar
04/28/2023, 3:15 PM.secrets/config.json
◦ {'mongodb_connection_string': <string>}
• Go in my IDE and debug tap.py
user
04/28/2023, 3:37 PMdatabase_includes
so you need that in your configmatt_elgazar
04/28/2023, 3:52 PMmeltano.yml
, are you saying that should be in config.json?user
04/28/2023, 3:55 PMmeltano config tap-x
to print your config contents. Also check out https://hub.meltano.com/singer/spec#taps to see more about SingerMatt Menzenski
04/28/2023, 4:38 PMmeltano config tap-mongodb > config.json
to create that JSON file IIRCMatt Menzenski
04/28/2023, 4:41 PMmatt_elgazar
04/28/2023, 6:12 PMtap.py
I get stopped at the block raise ConfigValidationError(summary)
because it’s missing config params. I think it’s pretty straight forward to add a rule like if config value = '*.*' then grab all collections in database
but I am having a hard time getting into debug mode 😂. Sorry I’ve only been reading about meltano for about 3 days now. Here are my steps
• meltano config tap-mongodb > config.json
and added my credentials
• run tap.py
in debug mode
• get stopped at raise ConfigValidationError(summary)
• Same steps as above but copied config.json
to .secrets/
pat_nadolny
04/28/2023, 7:27 PMmatt_elgazar
04/29/2023, 12:37 AMmatt_elgazar
05/01/2023, 6:51 PMmatt_elgazar
05/01/2023, 6:51 PMMatt Menzenski
05/01/2023, 7:58 PMmatt_elgazar
05/01/2023, 8:01 PMMatt Menzenski
05/01/2023, 8:02 PMMatt Menzenski
05/01/2023, 8:02 PMMatt Menzenski
05/01/2023, 8:04 PMplugins:
extractors:
- name: tap-mongodb-testing
inherit_from: tap-mongodb
config:
database: Testing
select:
- A.*
- B.*
- C.*
- name: tap-mongodb-prod
inherit_from: tap-mongodb
config:
database: Prod
select:
- '*.*' # this is the default behavior so shouldn't need to be specified explicitly like this
matt_elgazar
05/01/2023, 8:06 PMif collection == '*.*':
< tap all collections >
Matt Menzenski
05/01/2023, 8:06 PMmatt_elgazar
05/01/2023, 8:07 PMA.*
with A.field1, A.field2
?Matt Menzenski
05/01/2023, 8:07 PMMatt Menzenski
05/01/2023, 8:08 PMand in database: Testing, I’m assuming you can also replaceThere’s a caveat for this - see the Settings header on https://hub.meltano.com/extractors/tap-mongodb--menzenski/ where I’ve added some notes on this.withA.*
?A.field1, A.field2
Individual database collections may be selected using standard Meltano catalog selection. Note, though, that the field values which may be selected are not the fields on the database document, but rather the fields on the schema used by this tap. That is, while it is possible for example to opt out of thefield:ns
```select:
- '!*.ns````
thefield will always contain the entirety of the database document. This is true for log-based replication as well, as the change stream in that case is opened with the optiondocument
. If you would prefer different behavior, please open an issue with the tap.full_document="updateLookup"
Matt Menzenski
05/01/2023, 8:09 PMdocument
field currently. The field selection syntax will control what other fields in the record schema for this tap (documented here) are present.Matt Menzenski
05/01/2023, 8:10 PMmatt_elgazar
05/01/2023, 8:11 PMmatt_elgazar
05/02/2023, 12:50 AM*.*
means 😅 I looked through your PR and the only place I see it referencing *.*
is in the config. I know generally, *
means everything, so *.*
means select everything with any file type. Does it reference this via glob
or something similar in tap.py
?matt_elgazar
05/02/2023, 4:41 AMtap_mongodb/connector.py
here https://github.com/menzenski/tap-mongodb/blob/main/tap_mongodb/connector.py#L43
Error
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
I got it to run by changing prefix: str | None = None
to prefix=None
but it does not only run one collection. Here’s my meltano.yml
file
version: 1
send_anonymous_usage_stats: true
project_id: tap-mongodb
default_environment: dev
plugins:
extractors:
- name: tap-mongodb
namespace: tap_mongodb
pip_url: git+<https://github.com/menzenski/tap-mongodb.git@74c80ab38db6a607b6d121ea9c720bbe1a93241c>
capabilities:
- state
- catalog
- discover
- about
- stream-maps
config:
add_record_metadata: true
allow_modify_change_streams: true
metadata:
'*':
replication-key: _id
replication-method: INCREMENTAL
loaders:
- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl
- name: target-snowflake
variant: transferwise
pip_url: pipelinewise-target-snowflake
config:
account: ${SNOWFLAKE_ACCOUNT}
user: ${SNOWFLAKE_USER}
warehouse: ${SNOWFLAKE_WAREHOUSE}
dbname: ${SNOWFLAKE_DB}
file_format: CSV_FORMAT
role: ${SNOWFLAKE_ROLE}
default_target_schema: MONGODB_DEV
environments:
- name: dev
config:
plugins:
extractors:
- name: tap-csv
files:
- entity: ban_exemptions
path: csvs/ice_ban_exemptions.csv
keys: [date_exempted, wallet]
- name: tap-mongodb
config:
mongodb_connection_string: ${DG_DEV_MONGODB_CONNECTION_STRING}
database: DG_Dev
select:
- tournamentnftinfos.*
loaders:
- name: target-snowflake
env:
TARGET_SNOWFLAKE_DEFAULT_TARGET_SCHEMA: MONGODB_DEV
- name: testing
config:
plugins:
extractors:
- name: tap-mongodb
config:
mongodb_connection_string: ${DG_TESTING_MONGODB_CONNECTION_STRING}
database: DG_Testing
select:
- bannedusers.*
loaders:
- name: target-snowflake
env:
TARGET_SNOWFLAKE_DEFAULT_TARGET_SCHEMA: MONGODB_TESTING
Matt Menzenski
05/02/2023, 4:55 AMprefix: str | None = None,
syntax I think might be Python 3.10+ only. Changing that to Optional[str]
(with from typing import Optional
) should work if you’re on an older version of Python. (Feel free to log an issue, I should change this)Matt Menzenski
05/02/2023, 4:57 AMselect
key should be a sibling of config
, not nested beneath it. The form is entity.field
, where “entity” for this tap is equal to the name of the collection in lower case (plus the prefix if you are using it, which you don’t appear to be).
so fi you have a collection bannedUsers
in the DG_Testing
database, you could specify
select:
- bannedusers.*
to include that collection with all possible fields.matt_elgazar
05/02/2023, 5:10 AMmatt_elgazar
05/02/2023, 3:15 PMmeltano run tap-mongodb target-jsonl
is hitting all collections given my yml above?Matt Menzenski
05/02/2023, 3:27 PMMatt Menzenski
05/02/2023, 3:27 PMMatt Menzenski
05/02/2023, 4:18 PMmatt_elgazar
05/02/2023, 4:48 PMModuleNotFoundError: No module named 'backports'
Here are my steps:
• Go in meltano and change pip_url: git+<https://github.com/menzenski/tap-mongodb.git@80341d4a7f6d9ba3f7224e14f7492f17bd6f212f>
• in terminal run meltano add extractor tap-mongodb
• run meltano run tap-mongodb target-jsonl
matt_elgazar
05/02/2023, 4:50 PMfrom backports.cached_property import cached_property
matt_elgazar
05/02/2023, 4:50 PMMatt Menzenski
05/02/2023, 4:51 PMMatt Menzenski
05/02/2023, 4:51 PMMatt Menzenski
05/02/2023, 4:51 PMMatt Menzenski
05/02/2023, 5:06 PMmatt_elgazar
05/02/2023, 5:08 PMMatt Menzenski
05/02/2023, 5:08 PMmatt_elgazar
05/02/2023, 5:09 PMenvironments:
- name: dev
config:
plugins:
extractors:
- name: tap-mongodb
config:
mongodb_connection_string: ${DG_DEV_MONGODB_CONNECTION_STRING}
database: DG_Dev
select:
- tournamentnftinfos.*
loaders:
- name: target-snowflake
env:
TARGET_SNOWFLAKE_DEFAULT_TARGET_SCHEMA: MONGODB_DEV
Matt Menzenski
05/02/2023, 5:09 PMconfig:
mongodb_connection_string: ${DG_DEV_MONGODB_CONNECTION_STRING}
database: DG_Dev
select:
- tournamentnftinfos.*
should be
config:
mongodb_connection_string: ${DG_DEV_MONGODB_CONNECTION_STRING}
database: DG_Dev
select:
- tournamentnftinfos.*
matt_elgazar
05/02/2023, 5:11 PMmatt_elgazar
05/02/2023, 5:11 PMmatt_elgazar
05/08/2023, 4:34 PMfor collection in self.database.list_collection_names():
• to for collection in <get collections from select in yml file>:
Matt Menzenski
05/08/2023, 4:36 PMlist_collection_names()
would error if you didn’t have permission to all collections.Matt Menzenski
05/08/2023, 4:36 PMmatt_elgazar
05/08/2023, 4:52 PMmatt_elgazar
05/08/2023, 4:53 PMlist_collection_names
matt_elgazar
05/08/2023, 5:00 PMmatt_elgazar
05/08/2023, 7:42 PMmatt_elgazar
05/08/2023, 7:46 PMpoetry run tap-mongodb --config .secrets/config.json
user
05/08/2023, 7:49 PMmatt_elgazar
05/08/2023, 7:50 PMmatt_elgazar
05/08/2023, 7:50 PMmatt_elgazar
05/08/2023, 8:03 PMmatt_elgazar
05/08/2023, 9:36 PM--full-refresh
Here’s the link.
https://github.com/melgazar9/tap-mongodb/commit/03c8e978a2f9c286594454e95e6cc5b7abbc04e8matt_elgazar
05/08/2023, 9:37 PMmeltano run tap-mongodb target-jsonl
i get the everything ran successfully but no data is populated in output/
Matt Menzenski
05/08/2023, 9:38 PMcollections = self.config.get(‘select’)select is not within the config object, it’s a sibling of config. In theory, that line of code isn’t hit if you have the
select
key defined, because it should be using that provided catalog https://github.com/melgazar9/tap-mongodb/commit/03c8e978a2f9c286594454e95e6cc5b7abbc04e8#diff-a3089305435827b54212373[…]3c254329e112cdd8f4b68686L191Matt Menzenski
05/08/2023, 9:38 PMinput_catalog
is being set toMatt Menzenski
05/08/2023, 9:38 PMselect
they should be present in that input_catalog objectmatt_elgazar
05/08/2023, 9:41 PMextractors:
- name: tap-mongodb
config:
mongodb_connection_string: ${MONGODB_CONNECTION_STRING}
database: Testing
select:
- clientConfigs.*
- bannedusers.*
Matt Menzenski
05/08/2023, 9:41 PMconfig:
mongodb_connection_string: ${MONGODB_CONNECTION_STRING}
database: Testing
select:
- clientConfigs.*
- bannedusers.*
should be
config:
mongodb_connection_string: ${MONGODB_CONNECTION_STRING}
database: Testing
select:
- clientConfigs.*
- bannedusers.*
matt_elgazar
05/08/2023, 9:44 PMmeltano config tap-mongodb > .secrets/config.json
no select
appeared in the config using this way. It’s going to select everything thenMatt Menzenski
05/08/2023, 9:44 PMMatt Menzenski
05/08/2023, 9:45 PMselect
is not part of the config
- it’ll be passed by Meltano to the tap as the --catalog
argumentmatt_elgazar
05/08/2023, 9:45 PMSkipping
itmatt_elgazar
05/08/2023, 9:45 PMlist_collection_names()
Matt Menzenski
05/08/2023, 9:45 PMMatt Menzenski
05/08/2023, 9:46 PMclientConfigs
should be clientconfigs
(lower case) toomatt_elgazar
05/08/2023, 9:46 PMmatt_elgazar
05/08/2023, 9:47 PMselect
is not under the config
portion?Matt Menzenski
05/08/2023, 9:50 PMtap --config CONFIG [--state STATE] [--catalog CATALOG]
is the Singer interface for calling a tap. The select
settings that you define in meltano.yml are parsed by Meltano and passed to the tap as that --catalog
parameter (not as --config
)matt_elgazar
05/08/2023, 9:50 PMmatt_elgazar
05/08/2023, 9:50 PMmatt_elgazar
05/08/2023, 9:51 PMcollections = self.config.get('select')
if collections:
result["streams"].extend(self.connector.discover_catalog_entries(collections=collections))
else:
result["streams"].extend(self.connector.discover_catalog_entries())
Matt Menzenski
05/08/2023, 9:53 PMselect
options to meltano, it should pass them to the tap-mongodb tap as that self.input_catalog
property.
In theory, if that input_catalog property exists, the tap will use that and not execute that discover_catalog_entries
at allMatt Menzenski
05/08/2023, 9:53 PMself.input_catalog
just before the return in return self.<http://input_catalog.to|input_catalog.to>_dict()
on line 192 here https://github.com/melgazar9/tap-mongodb/commit/03c8e978a2f9c286594454e95e6cc5b7abbc04e8#diff-a3089305435827b54212373[…]3c254329e112cdd8f4b68686L191Matt Menzenski
05/08/2023, 9:54 PMselect
key on the same level as config
(as a sibling of it, not as a child), does that self.input_catalog
show those collections ?matt_elgazar
05/08/2023, 9:57 PMmatt_elgazar
05/08/2023, 10:06 PMMatt Menzenski
05/08/2023, 10:23 PMfor collection in self.database.list_collection_names():
to
for collection in self.database.list_collection_names(authorizedCollections=True, nameOnly=True):
that this should resolve your permissions error.
Users without the required privilege can run the command with bothper the docs: • https://www.mongodb.com/docs/manual/reference/command/listCollections/#required-access • https://pymongo.readthedocs.io/en/stable/api/pymongo/database.html#pymongo.database.Database.list_collection_names • https://www.mongodb.com/docs/manual/reference/command/listCollections/#command-fieldsandauthorizedCollections
options set tonameOnly
. In this case, the command returns just the name and type of the collection(s) to which the user has privileges.true
matt_elgazar
05/08/2023, 10:34 PMmatt_elgazar
05/09/2023, 3:39 AMMatt Menzenski
05/09/2023, 3:42 AMmatt_elgazar
05/09/2023, 3:34 PM_id
objectId column in mongo is not set properly. Take a look at the screen shot. I thought this was the section in meltano.yml
that accounts for it? Is there
config:
add_record_metadata: true
allow_modify_change_streams: true
metadata:
'*':
replication-key: _id
replication-method: INCREMENTAL
Here is the prod section of meltano.yml
- name: production
config:
plugins:
extractors:
- name: tap-mongodb
config:
mongodb_connection_string: ${PRODUCTION_MONGODB_CONNECTION_STRING}
database: Production
select:
- bannedusers.*
- arcadehandanalyticsdata.*
loaders:
- name: target-snowflake
env:
TARGET_SNOWFLAKE_DEFAULT_TARGET_SCHEMA: MONGODB_PRODUCTION
matt_elgazar
05/09/2023, 3:35 PMMatt Menzenski
05/09/2023, 3:43 PM_id
column is set to the string value of the document’s object ID (that is, to str(document["_id"])
, the tap doesn’t work well - that hex string is not alphanumerically sortable, and that means that the tap cannot resume from a saved checkpoint. It must start over from the beginning each time it’s run.
My first attempt at a workaround (which you’re seeing there) involved using an ISO-8601 datetime string as the _id field, and setting that to the timestamp value of the document’s object ID (that is, to document["_id"].generation_time.isoformat()
). That works, in that it is sortable (so the tap can now resume from a saved checkpoint if it errors out, for instance), but the timestamp component of the ObjectID is only granular to whole seconds. This is unlikely to cause an issue, but I was concerned about the potential for an individual record to be missed, if multiple records were added during the same second and the tap had an error when processing those records.
I pushed a change last night that sets the _id
field (which probably should have been named replication_key
or something instead, in hindsight) to a string with the format 2021-09-22T01:02:48+00:00|614a80b81ad8c60001b7d5f3
- this is the timestamp, then a pipe |
delimiter, then the hex object ID. This should account for everything from the perspective of the tap’s replication key - it is both alphanumerically sortable and it uniquely identifies a record.
Last night I also updated the schema to add an explicit object_id
field in the tap’s output. This column will always contain the hex ObjectID string, like 614a80b81ad8c60001b7d5f3
.Matt Menzenski
05/09/2023, 3:44 PMMatt Menzenski
05/09/2023, 3:44 PMobject_id
column will have the value you expectmatt_elgazar
05/09/2023, 3:49 PM_id
object in mongodb to a time sensitive format in pythonmatt_elgazar
05/09/2023, 3:49 PMMatt Menzenski
05/09/2023, 3:49 PMmatt_elgazar
05/09/2023, 3:49 PMMatt Menzenski
05/09/2023, 3:51 PMmatt_elgazar
05/09/2023, 3:53 PMmatt_elgazar
05/09/2023, 3:56 PMMatt Menzenski
05/09/2023, 3:57 PM.meltano/logs
?matt_elgazar
05/09/2023, 4:01 PMmongodb_staging
in snowflake
2. ran meltano --environment=staging run tap-mongodb target-snowflake
- errored out
3. ran meltano --environment=staging run tap-mongodb target-snowflake --full-refresh
- no error
4. ran meltano --environment=staging run tap-mongodb target-snowflake
- no error
I do see in snowflake as the _id
column: `2023-03-13T201934+00:00|640f855606ce35ab100be533`so it pipes the timestamp and the object IDmatt_elgazar
05/09/2023, 4:01 PMmatt_elgazar
05/09/2023, 4:02 PMraise ValueError("Invalid IncrementalId string")
Matt Menzenski
05/09/2023, 4:05 PMMatt Menzenski
05/09/2023, 4:08 PMMatt Menzenski
05/09/2023, 4:09 PM_id
column set to the hex string like 640f855606ce35ab100be533so
the last time you ran the tap?Matt Menzenski
05/09/2023, 4:09 PMmatt_elgazar
05/09/2023, 4:10 PM6656fe34e3a8ebf2d7b8e26552f0ca9d591444b3
matt_elgazar
05/09/2023, 4:11 PMmatt_elgazar
05/09/2023, 4:11 PMMatt Menzenski
05/09/2023, 4:12 PMmatt_elgazar
05/09/2023, 4:12 PMmeltano --environment=staging run tap-mongodb target-snowflake
and it worked after dropping the schema. I think the issue was the previous run was from a prior commit.matt_elgazar
05/09/2023, 4:30 PMMatt Menzenski
05/09/2023, 4:34 PMMatt Menzenski
05/09/2023, 4:35 PMmatt_elgazar
05/09/2023, 4:46 PMMatt Menzenski
05/09/2023, 4:46 PMMatt Menzenski
05/09/2023, 4:46 PMmatt_elgazar
05/09/2023, 5:07 PMmatt_elgazar
05/16/2023, 2:25 PMMatt Menzenski
05/16/2023, 4:08 PMmatt_elgazar
05/16/2023, 4:23 PMMatt Menzenski
05/16/2023, 5:02 PM_id
column in the output has been renamed to replication_key
likewise,
operationType
-> operation_type
clusterTime
-> cluster_time
ns
-> namespace
(and child fields db
-> database
, coll
-> collection
)matt_elgazar
05/18/2023, 8:02 PMmeltano run
sometimes it updates the table but sometimes no rows are sent to the target. Is this just me?matt_elgazar
05/18/2023, 8:03 PMmatt_elgazar
05/18/2023, 8:24 PMMatt Menzenski
05/18/2023, 8:27 PMMatt Menzenski
05/18/2023, 8:28 PMmatt_elgazar
05/18/2023, 9:31 PMmatt_elgazar
05/19/2023, 3:27 PMKeyError: '_id'
Matt Menzenski
05/19/2023, 3:29 PMreplication-key: replication_key
?matt_elgazar
05/19/2023, 4:06 PMMatt Menzenski
05/19/2023, 4:06 PMmatt_elgazar
05/19/2023, 4:07 PMmatt_elgazar
05/19/2023, 5:28 PMmatt_elgazar
05/19/2023, 5:28 PMMatt Menzenski
05/19/2023, 5:32 PMMatt Menzenski
05/19/2023, 5:33 PMMatt Menzenski
05/19/2023, 5:33 PMmatt_elgazar
05/19/2023, 5:33 PMandrej_gerasimov
11/09/2023, 7:43 AM