Hi I m trying to use tap mongodb and I already get some step Meltano #plugins-general

Hi, I’m trying to use tap-mongodb, and I already g...

helder_rossa

03/17/2021, 5:59 PM

Hi, I’m trying to use tap-mongodb, and I already get some steps done, but I’m stuck in the ‘selection’ and not sure what to do next. 1. I can connecto to database and ‘inspect’ database by using

meltano invoke tap-mongodb -d

2. Not sure what to do with the command `meltano select tap-mongodb --list --all`because does not show much information 3. I’ve tried `meltano elt tap-mongodb target-postgres`but not much success. It does connect but the

Sync Summary

is always empty. Can anyone help get this ‘final’ steps done? Thanks!!

helder_rossa

03/17/2021, 6:40 PM

After some trial and error I get something …

Copy code

meltano         | Incremental state has been updated at 2021-03-17 18:37:55.095300.
tap-mongodb     | INFO Must complete full table sync before starting oplog replication for eattasty-prd-Allergie
tap-mongodb     | INFO Starting full table sync for eattasty-prd-Allergie
meltano         | Incremental state has been updated at 2021-03-17 18:37:55.167207.
target-postgres | ERROR Allergie - Table for stream does not exist
tap-mongodb     | INFO Querying eattasty-prd-Allergie with:
tap-mongodb     | 	Find Parameters: {'$lte': 'sulphites'}
tap-mongodb     | INFO Syncd 14 records for eattasty-prd-Allergie
tap-mongodb     | INFO Starting oplog sync for eattasty-prd-Allergie
tap-mongodb     | INFO Querying eattasty-prd-Allergie with:
tap-mongodb     | 	Find Parameters: {'ts': {'$gte': Timestamp(1616006266, 1)}}
tap-mongodb     | 	Projection: {'ts': 1, 'ns': 1, 'op': 1, 'o2': 1, 'o': 1}
tap-mongodb     | 	oplog_replay: True
target-postgres | INFO Stream Allergie (allergie) with max_version 1616006275161 targetting 1616006275161
target-postgres | INFO Root table name Allergie
target-postgres | INFO Writing batch with 14 records for `Allergie` with `key_properties`: `['_id']`
target-postgres | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 0, "tags": {"count_type": "batch_rows_persisted", "path": ["Allergie"], "database": "postgres", "schema": "test"}}
target-postgres | INFO METRIC: {"type": "timer", "metric": "job_duration", "value": 0.0007932186126708984, "tags": {"job_type": "batch", "path": ["Allergie"], "database": "postgres", "schema": "test", "status": "failed"}}
target-postgres | ERROR Exception writing records
target-postgres | Traceback (most recent call last):
target-postgres |   File "/Users/kimus/Develop/eattasty/meltano/mongo2pg/mongo2pg/.meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/postgres.py", line 295, in write_batch
...
target-postgres |   File "/Users/kimus/Develop/eattasty/meltano/mongo2pg/mongo2pg/.meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/postgres.py", line 309, in write_batch
target-postgres |     raise PostgresError(message, ex)
target-postgres | target_postgres.exceptions.PostgresError: ('Exception writing records', KeyError('_id'))
meltano         | Loading failed (1): target_postgres.exceptions.PostgresError: ('Exception writing records', KeyError('_id'))
meltano         | ELT could not be completed: Loader failed
ELT could not be completed: Loader failed

douwe_maan

03/17/2021, 8:29 PM

@helder_rossa Do your collections have an

_id

column? Did you make sure this column is selected and extracted? The

KeyError('_id')

we see in the logs suggests that the tap is telling the target to use the

_id

column as the primary key, but the key is actually missing from the extracted records

helder_rossa

03/17/2021, 8:31 PM

@douwe_maan this specific colleciton has a _id column of type string

douwe_maan

03/17/2021, 8:34 PM

@helder_rossa Can you run again with

meltano --log-level=debug

so that we can see all of the SCHEMA and RECORD messages printed?

douwe_maan

03/17/2021, 8:35 PM

I think the issue is that the SCHEMA message doesn't actually describe the fields

douwe_maan

03/17/2021, 8:36 PM

I looked into this a little while ago; you're seeing https://gitlab.com/meltano/meltano/-/issues/2517#note_489278113

helder_rossa

03/17/2021, 8:36 PM

Btw, I’m not sure if I did something wrong, but just configured some exclusions in the “select:” option.

douwe_maan

03/17/2021, 8:37 PM

Which target-postgres are you using? the transferwise variant? You may have better luck with the datamill-co or meltano variant (which I used in that issue), or with https://github.com/transferwise/pipelinewise-tap-mongodb instead of https://github.com/singer-io/tap-mongodb as https://github.com/singer-io/tap-mongodb/issues/48#issuecomment-758451877 suggests

helder_rossa

03/17/2021, 8:37 PM

And, I have used stitchdata service, with this database, and worked fine.

douwe_maan

03/17/2021, 8:37 PM

I don't think you're doing anything wrong

helder_rossa

03/17/2021, 8:38 PM

Copy code

loaders:
  - name: target-postgres
    variant: datamill-co
    pip_url: singer-target-postgres

helder_rossa

03/17/2021, 8:38 PM

This collection only has 14 records

helder_rossa

03/17/2021, 8:39 PM

so, It’s strange that id didn’t figured out the schema 😄

douwe_maan

03/17/2021, 8:39 PM

The issue appears to be that tap-mongodb sends one empty SCHEMA message before it starts sending real ones, and the target is tripping over that empty one

douwe_maan

03/17/2021, 8:40 PM

I suggest using the meltano variant (see https://meltano.com/docs/plugin-management.html#switching-from-one-variant-to-another)

helder_rossa

03/17/2021, 8:40 PM

```tap-mongodb (out) | {"type": "STATE", "value": {"bookmarks": {"eattasty-prd-Allergie": {"last_replication_method": "LOG_BASED"}}, "currently_syncing": "eattasty-prd-Allergie"}} tap-mongodb (out) | {"type": "SCHEMA", "stream": "Allergie", "schema": {"type": "object"}, "key_properties": ["_id"]} target-postgres (out) | {"bookmarks": {"eattasty-prd-Allergie": {"last_replication_method": "LOG_BASED"}}, "currently_syncing": "eattasty-prd-Allergie"} meltano | INFO Incremental state has been updated at 2021-03-17 203949.421761. meltano | DEBUG Incremental state: {'bookmarks': {'eattasty-prd-Allergie': {'last_replication_method': 'LOG_BASED'}}, 'currently_syncing': 'eattasty-prd-Allergie'} tap-mongodb | INFO Must complete full table sync before starting oplog replication for eattasty-prd-Allergie tap-mongodb | INFO Starting full table sync for eattasty-prd-Allergie tap-mongodb (out) | {"type": "STATE", "value": {"bookmarks": {"eattasty-prd-Allergie": {"last_replication_method": "LOG_BASED", "oplog_ts_time": 1616013589, "oplog_ts_inc": 1, "version": 1616013589491}}, "currently_syncing": "eattasty-prd-Allergie"}} tap-mongodb (out) | {"type": "ACTIVATE_VERSION", "stream": "Allergie", "version": 1616013589491} target-postgres (out) | {"bookmarks": {"eattasty-prd-Allergie": {"last_replication_method": "LOG_BASED", "oplog_ts_time": 1616013589, "oplog_ts_inc": 1, "version": 1616013589491}}, "currently_syncing": "eattasty-prd-Allergie"} meltano | INFO Incremental state has been updated at 2021-03-17 203949.497529. meltano | DEBUG Incremental state: {'bookmarks': {'eattasty-prd-Allergie': {'last_replication_method': 'LOG_BASED', 'oplog_ts_time': 1616013589, 'oplog_ts_inc': 1, 'version': 1616013589491}}, 'currently_syncing': 'eattasty-prd-Allergie'} target-postgres | ERROR Allergie - Table for stream does not exist tap-mongodb | INFO Querying eattasty-prd-Allergie with: tap-mongodb | Find Parameters: {'$lte': 'sulphites'} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "celery", "locales": {"0": {"lang": "en", "name": "celery free"}, "1": {"lang": "pt", "name": "sem aipo"}}}, "version": 1616013589491, "time_extracted": "2021-03-17T203949.526536Z"} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "crustaceans", "locales": {"0": {"lang": "en", "name": "crustaceans free"}, "1": {"lang": "pt", "name": "sem crust\u00e1ceos"}}}, "version": 1616013589491, "time_extracted": "2021-03-17T203949.526536Z"} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "dairy", "locales": {"0": {"lang": "en", "name": "lactose free"}, "1": {"lang": "pt", "name": "sem lactose"}}}, "version": 1616013589491, "time_extracted": "2021-03-17T203949.526536Z"} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "egg", "locales": {"0": {"lang": "en", "name": "egg free"}, "1": {"lang": "pt", "name": "sem ovo"}}}, "version": 1616013589491, "time_extracted": "2021-03-17T203949.526536Z"} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "fish", "locales": {"0": {"lang": "en", "name": "fish free"}, "1": {"lang": "pt", "name": "sem peixe"}}}, "version": 1616013589491, "time_extracted": "2021-03-17T203949.526536Z"} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "gluten", "locales": {"0": {"lang": "en", "name": "gluten free"}, "1": {"lang": "pt", "name": "sem glutten"}}}, "version": 1616013589491, "time_extracted": "2021-03-17T203949.526536Z"} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "lupine", "locales": {"0": {"lang": "en", "name": "lupine free"}, "1": {"lang": "pt", "name": "sem tremo\u00e7o"}}}, "version": 1616013589491, "time_extracted": "2021-03-17T203949.526536Z"} tap-mongodb (out) | {"type": "RECORD",…

douwe_maan

03/17/2021, 8:40 PM

Or https://github.com/transferwise/pipelinewise-tap-mongodb + https://github.com/transferwise/pipelinewise-target-postgres (https://meltano.com/plugins/loaders/postgres--transferwise.html)

douwe_maan

03/17/2021, 8:40 PM

That confirms the issue is with the first SCHEMA message:

Copy code

tap-mongodb (out)     | {"type": "SCHEMA", "stream": "Allergie", "schema": {"type": "object"}, "key_properties": ["_id"]}

helder_rossa

03/17/2021, 8:41 PM

what’s wrong? 🙂 most of the catalog is like that

helder_rossa

03/17/2021, 8:41 PM

I’ll try the other plugin variants then

douwe_maan

03/17/2021, 8:42 PM

Yep, but the target expects the property identified by

key_properties

(

_id

) to actually exist inside the

schema

object, which is empty in this case 😬

douwe_maan

03/17/2021, 8:42 PM

So the tap is behaving in a way that's incompatible with what targets expect. This is the kind of inconsistency we're going to address with https://gitlab.com/meltano/singer-sdk 🙂

helder_rossa

03/17/2021, 8:42 PM

type object, could have ‘any’ key

helder_rossa

03/17/2021, 8:43 PM

object => { _id: … }

douwe_maan

03/17/2021, 8:43 PM

Right, but the target wants to know what columns it should create in the new table, and it gets confused when that's missing 😄 tap-mongodb sends more complete SCHEMA messages once it has seen some records, but the tap falls over the first empty SCHEMA message

helder_rossa

03/17/2021, 8:44 PM

ok… so discovery it’s wrong

helder_rossa

03/17/2021, 8:44 PM

getting the schema wrongly

helder_rossa

03/17/2021, 8:44 PM

I’ll test the other variant then

douwe_maan

03/17/2021, 8:44 PM

👍

helder_rossa

03/17/2021, 8:45 PM

I’m surprised that any one get this running. it’s just a simple collection :-|

douwe_maan

03/17/2021, 8:45 PM

This tap-mongodb definitely isn't documented as clearly as it should be, your issue will help fix that 🙂

helder_rossa

03/17/2021, 8:46 PM

I hope I can make it working. And by that, the document could be ok also.

helder_rossa

03/17/2021, 8:48 PM

I also took a look to pipelinewise today… let’s check that out then 😄

douwe_maan

03/17/2021, 8:49 PM

Their taps and targets are great, but Meltano is the best way to run them 😄

helder_rossa

03/17/2021, 8:49 PM

😉

helder_rossa

03/17/2021, 8:52 PM

this does not work:

Copy code

- name: tap-mongodb
    variant: pipelinewise
    pip_url: git+<https://github.com/transferwise/pipelinewise-tap-mongodb>

do I need to do meltano install --custom … instead?

douwe_maan

03/17/2021, 8:52 PM

Correct, Meltano doesn't know that variant yet

helder_rossa

03/17/2021, 8:53 PM

helder_rossa

03/17/2021, 8:59 PM

nice, this already has auth_database

douwe_maan

03/17/2021, 9:00 PM

Great! Sounds like we may want to make that the new default variant in Meltano 🙂

helder_rossa

03/17/2021, 9:00 PM

maybe..

helder_rossa

03/17/2021, 9:00 PM

not started testing yet 😛

helder_rossa

03/17/2021, 9:04 PM

ahh… select --list now works

douwe_maan

03/17/2021, 9:04 PM

Great

helder_rossa

03/17/2021, 9:05 PM

I was thinking I was dumb 😛

helder_rossa

03/17/2021, 9:05 PM

nothing seamed to work

douwe_maan

03/17/2021, 9:05 PM

Sounds like the issue was with that variant of tap-mongodb, and this one's much better!

helder_rossa

03/17/2021, 9:06 PM

not sure what’s this:

Copy code

[automatic] eattasty-prd-Address._id
	[automatic] eattasty-prd-Address._sdc_deleted_at
	[automatic] eattasty-prd-Address.document

helder_rossa

03/17/2021, 9:06 PM

_sdc_* is from stitchdata migration

douwe_maan

03/17/2021, 9:06 PM

Interesting

douwe_maan

03/17/2021, 9:07 PM

The pipelinewise tap probably uses that same prefix for metadata columns

douwe_maan

03/17/2021, 9:07 PM

That makes it look like the entire document will be extracted in a single

document

object, which may end up being a single

jsonb

document

column once loaded with pipelinewise-target-postgres

helder_rossa

03/17/2021, 9:08 PM

I hope not 😄

helder_rossa

03/17/2021, 9:08 PM

do I need to test it?

helder_rossa

03/17/2021, 9:08 PM

Copy code

{
  "table_name": "User",
  "stream": "User",
  "metadata": [
    {
      "breadcrumb": [],
      "metadata": {
        "table-key-properties": [
          "_id"
        ],
        "database-name": "eattasty-prd",
        "row-count": 57089,
        "is-view": false,
        "valid-replication-keys": [
          "_id",
          "email"
        ]
      }
    }
  ],
  "tap_stream_id": "eattasty-prd-User",
  "schema": {
    "type": "object",
    "properties": {
      "_id": {
        "type": [
          "string",
          "null"
        ]
      },
      "document": {
        "type": [
          "object",
          "array",
          "string",
          "null"
        ]
      },
      "_sdc_deleted_at": {
        "type": [
          "string",
          "null"
        ]
      }
    }
  }
}

douwe_maan

03/17/2021, 9:09 PM

Yeah looks like that's the behavior it'd show...

helder_rossa

03/17/2021, 9:09 PM

I don’t want this

helder_rossa

03/17/2021, 9:09 PM

need to check other tap then?

douwe_maan

03/17/2021, 9:10 PM

Let's go back to the other tap and try the workaround I described in https://gitlab.com/meltano/meltano/-/issues/2517 that uses https://meltano.com/docs/plugins.html#schema-extra to make the schema explicit, if that's an option for you

helder_rossa

03/17/2021, 9:13 PM

hum, this means that I need to explicit put the schema by hand?

douwe_maan

03/17/2021, 9:14 PM

OK, easiest next option: try out the original tap-mongodb with https://meltano.com/plugins/loaders/postgres--meltano.html (the

meltano

variant of

target-postgres

) which doesn't care about an empty initial SCHEMA

helder_rossa

03/17/2021, 9:15 PM

or fork the tap-mongodb and fix it 🙂

douwe_maan

03/17/2021, 9:15 PM

Or that 😄

helder_rossa

03/17/2021, 9:16 PM

I’ve got a output of the catalog, and some tables are ‘ok’ some not

helder_rossa

03/17/2021, 9:17 PM

I’ll try a differente target… and then think about the options that I have 😐

helder_rossa

03/17/2021, 9:18 PM

I’ll try again later/tomorrow

helder_rossa

03/17/2021, 9:18 PM

many thanks

douwe_maan

03/17/2021, 9:22 PM

Glad I could be of help, we'll figure it out tomorrow!

helder_rossa

03/17/2021, 10:35 PM

@douwe_maan it seams that pipelinewise handles ‘document’ and creates multiple-columns and metadata columns. https://transferwise.github.io/pipelinewise/user_guide/metadata_columns.html Also, handles schema changes: https://transferwise.github.io/pipelinewise/user_guide/schema_changes.html

douwe_maan

03/17/2021, 10:35 PM

Ah, cool! All of that also applies when using their taps and targets with Meltano instead of their own runner

helder_rossa

03/17/2021, 10:36 PM

but, only works when extractor and loader are both pipelinwise rigjt?

douwe_maan

03/17/2021, 10:37 PM

Yep

helder_rossa

03/17/2021, 10:40 PM

extractors and loaders should be interchangable

douwe_maan

03/17/2021, 10:41 PM

Yep, they should be, and we're arguably looking at a bug in tap-mongodb here since targets aren't expected to be able to handle SCHEMA messages that don't actually list any properties

helder_rossa

03/17/2021, 10:44 PM

looking at the code from tap-mongodb and what I see is:

Copy code

return {
        'table_name': collection_name,
        'stream': collection_name,
        'metadata': metadata.to_list(mdata),
        'tap_stream_id': "{}-{}".format(collection_db_name, collection_name),
        'schema': {
            'type': 'object'
        }
    }

helder_rossa

03/17/2021, 10:45 PM

the “schema” property is always type object in all collections

helder_rossa

03/17/2021, 10:45 PM

it only looks at ‘keys’ (aka indexes)

douwe_maan

03/17/2021, 10:46 PM

Right, which breaks the expectations described in https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md#schemas. Which is why one solution I suggested above is to explicitly define the schema, but that's obviously not ideal. https://github.com/singer-io/tap-mongodb/pull/40 lets the tap dynamically generate

SCHEMA

messages based on sampling, and that logic should really be used in discovery mode as well, and in the very first

SCHEMA

message

helder_rossa

03/17/2021, 10:47 PM

that’s in master branch

douwe_maan

03/17/2021, 10:48 PM

Right, but it's only used for later SCHEMA message, not for discovery mode or the first SCHEMA message

helder_rossa

03/17/2021, 10:48 PM

ok, so in teory could be reused in both

douwe_maan

03/17/2021, 10:49 PM

Yep, but I haven't looked into the code to see how easy/hard that would be

helder_rossa

03/17/2021, 10:50 PM

ok, it’s in common… and the code for the SHEMA it’s in init

helder_rossa

03/17/2021, 10:55 PM

so, it should collect a sample of the database, ex: 1000 records, and try to figure out the schema by interating each row

helder_rossa

03/17/2021, 11:03 PM

I would need to take a better look, but in oplog, for example, it does call the common.row_to_schema(() in the sync_collection method

helder_rossa

03/17/2021, 11:31 PM

@douwe_maan

douwe_maan

03/17/2021, 11:32 PM

Is that victory I smell?

helder_rossa

03/17/2021, 11:33 PM

it’s just a test… but it sort of worked

helder_rossa

03/17/2021, 11:33 PM

just added the column for testng

Copy code

'schema': {
            'type': 'object',
            'properties': {
                "_id": { 'type': ['null', 'string'] }
            }
        }

douwe_maan

03/17/2021, 11:34 PM

Was that enough to fix it? 🤔

helder_rossa

03/17/2021, 11:34 PM

but only syncs that I’m afraid

douwe_maan

03/17/2021, 11:34 PM

Hmm

helder_rossa

03/17/2021, 11:35 PM

so, I need to check ‘when’ the postgres receives the schema and why it’s not updated

douwe_maan

03/17/2021, 11:35 PM

I think the target may create the table when it receives the first

SCHEMA

message, with only the

_id

field, and ignore any

SCHEMA

messages that follow, as well as any fields with other names in

RECORD

messages

helder_rossa

03/17/2021, 11:35 PM

I’m new here… it’s the first time I’m using this

douwe_maan

03/17/2021, 11:36 PM

Which is correct behavior, since each stream is only supposed to have a single

SCHEMA

message, but

tap-mongodb

is not following that rule

helder_rossa

03/17/2021, 11:36 PM

the schema is updating in the mongodb tab

douwe_maan

03/17/2021, 11:36 PM

I appreciate your patience 🙂 We'll figure it out!

helder_rossa

03/17/2021, 11:36 PM

humm… but to figure the right schema, and because this is mongodb 😛

helder_rossa

03/17/2021, 11:37 PM

we need to do a first pass on the rows to figure the ‘better’ shcma

douwe_maan

03/17/2021, 11:37 PM

Yep

douwe_maan

03/17/2021, 11:37 PM

And never send an empty SCHEMA

helder_rossa

03/17/2021, 11:37 PM

so, the only one that counts it’s the first message…

douwe_maan

03/17/2021, 11:37 PM

Correct

helder_rossa

03/17/2021, 11:38 PM

so, it’s ‘dumb’ updating per each row like it is?

douwe_maan

03/17/2021, 11:38 PM

Yeah, some targets may know how to deal with that, but targets are not required to and most don't, since the spec only describes one SCHEMA message per stream

helder_rossa

03/17/2021, 11:38 PM

message has been deleted

douwe_maan

03/17/2021, 11:38 PM

So the tap may have worked with the target it was developed alongside, but not the ones we're using

douwe_maan

03/17/2021, 11:39 PM

Does that immediately write the SCHEMA as well?

douwe_maan

03/17/2021, 11:39 PM

You should be able to see with `meltano --log-level=debug elt ...`how often SCHEMA messages come along

helder_rossa

03/17/2021, 11:40 PM

if this is the msg:

Copy code

tap-mongodb (out)     | {"type": "SCHEMA", "stream": "Allergie", "schema": {"properties": {"_id": {"type": ["null", "string"]}}, "type": "object"}, "key_properties": ["_id"]}

helder_rossa

03/17/2021, 11:40 PM

then… only once

douwe_maan

03/17/2021, 11:41 PM

No more schemas after that? Then the target is never learning about the real properties, even though the "watching rows to determine the schema" approach is implemented

douwe_maan

03/17/2021, 11:41 PM

That's weird

helder_rossa

03/17/2021, 11:42 PM

+++ INFO … it’s mine for ‘debugging’ ```tap-mongodb (out) | {"type": "STATE", "value": {"bookmarks": {"eattasty-prd-Allergie": {"last_replication_method": "LOG_BASED", "oplog_ts_time": 1616023844, "oplog_ts_inc": 1, "version": 1616023844615}}, "currently_syncing": "eattasty-prd-Allergie"}} tap-mongodb (out) | {"type": "ACTIVATE_VERSION", "stream": "Allergie", "version": 1616023844615} target-postgres (out) | {"bookmarks": {"eattasty-prd-Allergie": {"last_replication_method": "LOG_BASED", "oplog_ts_time": 1616023844, "oplog_ts_inc": 1, "version": 1616023844615}}, "currently_syncing": "eattasty-prd-Allergie"} meltano | INFO Incremental state has been updated at 2021-03-17 233044.620836. meltano | DEBUG Incremental state: {'bookmarks': {'eattasty-prd-Allergie': {'last_replication_method': 'LOG_BASED', 'oplog_ts_time': 1616023844, 'oplog_ts_inc': 1, 'version': 1616023844615}}, 'currently_syncing': 'eattasty-prd-Allergie'} target-postgres | ERROR Allergie - Table for stream does not exist tap-mongodb | INFO Querying eattasty-prd-Allergie with: tap-mongodb | Find Parameters: {'$lte': 'sulphites'} tap-mongodb | INFO ++++ tap-mongodb | INFO {'type': 'object', 'properties': {}} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "celery", "locales": {"0": {"lang": "en", "name": "celery free"}, "1": {"lang": "pt", "name": "sem aipo"}}}, "version": 1616023844615, "time_extracted": "2021-03-17T233044.650529Z"} tap-mongodb | INFO ++++ tap-mongodb | INFO {'type': 'object', 'properties': {'locales': {'anyOf': [{}]}}} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "crustaceans", "locales": {"0": {"lang": "en", "name": "crustaceans free"}, "1": {"lang": "pt", "name": "sem crust\u00e1ceos"}}}, "version": 1616023844615, "time_extracted": "2021-03-17T233044.650529Z"} tap-mongodb | INFO ++++ tap-mongodb | INFO {'type': 'object', 'properties': {'locales': {'anyOf': [{}]}}} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "dairy", "locales": {"0": {"lang": "en", "name": "lactose free"}, "1": {"lang": "pt", "name": "sem lactose"}}}, "version": 1616023844615, "time_extracted": "2021-03-17T233044.650529Z"} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "egg", "locales": {"0": {"lang": "en", "name": "egg free"}, "1": {"lang": "pt", "name": "sem ovo"}}}, "version": 1616023844615, "time_extracted": "2021-03-17T233044.650529Z"} tap-mongodb | INFO ++++ tap-mongodb | INFO {'type': 'object', 'properties': {'locales': {'anyOf': [{}]}}} tap-mongodb | INFO ++++ tap-mongodb | INFO {'type': 'object', 'properties': {'locales': {'anyOf': [{}]}}} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "fish", "locales": {"0": {"lang": "en", "name": "fish free"}, "1": {"lang": "pt", "name": "sem peixe"}}}, "version": 1616023844615, "time_extracted": "2021-03-17T233044.650529Z"} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "gluten", "locales": {"0": {"lang": "en", "name": "gluten free"}, "1": {"lang": "pt", "name": "sem glutten"}}}, "version": 1616023844615, "time_extracted": "2021-03-17T233044.650529Z"} tap-mongodb | INFO ++++ tap-mongodb | INFO {'type': 'object', 'properties': {'locales': {'anyOf': [{}]}}} tap-mongodb | INFO ++++ tap-mongodb | INFO {'type': 'object', 'properties': {'locales': {'anyOf': [{}]}}} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "lupine", "locales": {"0": {"lang": "en", "name": "lupine free"}, "1": {"lang": "pt", "name": "sem tremo\u00e7o"}}}, "version": 1616023844615, "time_extracted": "2021-03-17T233044.650529Z"} tap-mongodb (out) | {"type": "RECORD", "stream": "Allergie", "rec…

helder_rossa

03/17/2021, 11:43 PM

so, the ‘properties’ of the schema var are changing… but not sure for

douwe_maan

03/17/2021, 11:43 PM

It did find the

locales

property at some point, but not what was inside it

douwe_maan

03/17/2021, 11:43 PM

The target should be able to turn that into a

jsonb

column, or some targets may denest it into a separate joined table

helder_rossa

03/17/2021, 11:44 PM

this is the output of the tap-mongodb from stitchdata

helder_rossa

03/17/2021, 11:45 PM

it creates and flattens the array

douwe_maan

03/17/2021, 11:45 PM

Ah all right, I'm sure one of the target-postgres's has the same behavior 😅

douwe_maan

03/17/2021, 11:45 PM

Looks like our

meltano

variant does: https://github.com/meltano/target-postgres/blob/master/target_postgres/db_sync.py#L64

helder_rossa

03/17/2021, 11:45 PM

right… postgres.. confusing where the names 😄

douwe_maan

03/17/2021, 11:46 PM

So if you fix the tap to no longer send an empty schema, it should work fine with the meltano target!

helder_rossa

03/17/2021, 11:47 PM

ok… I could spend some time if it’s ‘only’ that

douwe_maan

03/17/2021, 11:48 PM

There's a good chance the tap would also need to "fill in" the

'anyOf': [{}]

it found for

locales

with actual details abouts its properties. But that should "just" be a matter of running the current schema detection logic recursively

helder_rossa

03/17/2021, 11:49 PM

isn’t this a message of type SCHEMA?

Copy code

if common.row_to_schema(schema, row):
                singer.write_message(singer.SchemaMessage(
                    stream=common.calculate_destination_stream_name(stream),
                    schema=schema,
                    key_properties=['_id']))

douwe_maan

03/17/2021, 11:50 PM

it is

helder_rossa

03/17/2021, 11:50 PM

it’s per row

helder_rossa

03/17/2021, 11:53 PM

ok there’s a message initial, with STATE and SCHEMA

helder_rossa

03/17/2021, 11:53 PM

then, that one, never goes to the logs

douwe_maan

03/17/2021, 11:54 PM

It never gets printed with the

tap-mongodb (out)

prefix indicating it's actual output going to the target? So the target indeed never gets the full schema?

helder_rossa

03/17/2021, 11:54 PM

the log that I sent you, I do not see any SCHEMA in the log

douwe_maan

03/17/2021, 11:55 PM

Nor do I 😕 I wonder what's stopping that line from actually executing. Maybe additional INFO logs will help figure that out? Or drop into a pdb

helder_rossa

03/17/2021, 11:56 PM

I could put a LOGGER there

helder_rossa

03/17/2021, 11:58 PM

so, never enters there: if common.row_to_schema(schema, row): would be always false

helder_rossa

03/17/2021, 11:59 PM

so, row_to_schema it’s saying that it did no change anything

douwe_maan

03/18/2021, 12:01 AM

😒

helder_rossa

03/18/2021, 12:03 AM

it’s a bit odd because it did change:

Copy code

tap-mongodb           | INFO ++++
tap-mongodb           | INFO {'type': 'object', 'properties': {}}
tap-mongodb (out)     | {"type": "RECORD", "stream": "Allergie", "record": {"_id": "celery", "locales": {"0": {"lang": "en", "name": "celery free"}, "1": {"lang": "pt", "name": "sem aipo"}}}, "version": 1616023844615, "time_extracted": "2021-03-17T23:30:44.650529Z"}
tap-mongodb           | INFO ++++
tap-mongodb           | INFO {'type': 'object', 'properties': {'locales': {'anyOf': [{}]}}}

helder_rossa

03/18/2021, 12:03 AM

at least once

helder_rossa

03/18/2021, 12:05 AM

ok, but the ‘problem’ is the initial SCHEMA that’s empty right?

douwe_maan

03/18/2021, 12:05 AM

Correct

helder_rossa

03/18/2021, 12:06 AM

is there any documentation on how the ‘schema’ should look?

helder_rossa

03/18/2021, 12:06 AM

just type and format and properites?

douwe_maan

03/18/2021, 12:07 AM

https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md#schema-message

helder_rossa

03/18/2021, 12:14 AM

ok it’s a JSON Schema…

douwe_maan

03/18/2021, 12:14 AM

That's right

helder_rossa

03/18/2021, 12:21 AM

ok, I’m going to sleep 🙂 I’ll try to check if I can do something tomorrow

douwe_maan

03/18/2021, 12:21 AM

Sleep well!

helder_rossa

03/18/2021, 10:36 AM

@douwe_maan sort of working… too many anyOf for me 😄

Copy code

{
      "table_name": "Zone",
      "stream": "Zone",
      "metadata": [
        {
          "breadcrumb": [],
          "metadata": {
            "table-key-properties": [
              "_id"
            ],
            "database-name": "eattasty-prd",
            "row-count": 199,
            "is-view": false,
            "valid-replication-keys": [
              "_id"
            ]
          }
        }
      ],
      "tap_stream_id": "eattasty-prd-Zone",
      "schema": {
        "type": "object",
        "properties": {
          "_id": {
            "type": [
              "null",
              "string"
            ]
          },
          "delivery": {
            "anyOf": [
              {}
            ]
          },
          "coordinates": {
            "anyOf": [
              {
                "type": "array",
                "items": {
                  "anyOf": [
                    {
                      "type": "object",
                      "properties": {
                        "lat": {
                          "anyOf": [
                            {
                              "type": "number"
                            },
                            {}
                          ]
                        },
                        "lng": {
                          "anyOf": [
                            {
                              "type": "number"
                            },
                            {}
                          ]
                        }
                      }
                    },
                    {}
                  ]
                }
              },
              {}
            ]
          }
        }
      }
    },

helder_rossa

03/18/2021, 11:59 AM

Output on the Allergie table

douwe_maan

03/18/2021, 2:57 PM

That looks like success!

helder_rossa

03/18/2021, 3:11 PM

@douwe_maan not sure if I can call this a success. ```tap-mongodb (out) | {"type": "RECORD", "stream": "Order", "record": {"_id": "5a993a3bfbcb7a00c8190de3", "orderdate": "2018-03-02T000000.000000Z", "modifieddate": "2018-03-02T122355.713000Z", "payment_status": "PAID", "status": "DELIVERED", "alerted": true, "fail": false, "fail_reason": "NONE", "customerId": "597f04ee5431d600a6b35dff", "createddate": "2018-03-02T114926.995000Z", "cutlery": false, "driverId": "5899f0ef9874d1aa7e323a03", "delivered": "2018-03-02T124457.947000Z", "deliveryEnded": "2018-03-02T124500.000000Z", "areaId": "5d13407be54b0000cf7090b6", "organizationId": "594a8aa2acad7f856c48e01e", "routeId": "5ddfa7b81cf91300e36daf9d", "delivery": "lunch"}, "version": 1616080261266, "time_extracted": "2021-03-18T151101.303316Z"} tap-mongodb (out) | {"type": "RECORD", "stream": "Order", "record": {"_id": "5a993aabbf662e00bb51b019", "orderdate": "2018-03-02T000000.000000Z", "modifieddate": "2018-03-02T122204.560000Z", "payment_status": "PAID", "status": "DELIVERED", "alerted": true, "fail": false, "fail_reason": "NONE", "customerId": "595f669e2052ba00a5fecc9c", "createddate": "2018-03-02T115114.652000Z", "cutlery": true, "driverId": "5a464e64ad17d8a721e63135", "delivered": "2018-03-02T122204.560000Z", "deliveryEnded": "2018-03-02T124500.000000Z", "areaId": "5d13407be54b0000cf7090b6", "organizationId": "58170b1d28c9082c04ff0fde", "routeId": "5d2d99691a1c8000c98ead5f", "delivery": "lunch"}, "version": 1616080261266, "time_extracted": "2021-03-18T151101.303316Z"} tap-mongodb (out) | {"type": "RECORD", "stream": "Order", "record": {"_id": "5a993aaefbcb7a00c8190de7", "orderdate": "2018-03-06T000000.000000Z", "modifieddate": "2018-03-06T123450.919000Z", "payment_status": "PAID", "status": "DELIVERED", "alerted": true, "fail": false, "fail_reason": "NONE", "customerId": "5a70a186e769e200c075879d", "promocodes": ["5a9937debf662e00bb51b012"], "discount": 5.9, "createddate": "2018-03-02T115115.711000Z", "cutlery": true, "driverId": "59f70fd1acad7f856ccdfaed", "delivered": "2018-03-06T123450.919000Z", "deliveryEnded": "2018-03-06T131500.000000Z", "areaId": "5d13407be54b0000cf7090b6", "organizationId": "576baf2524748e8060000000", "routeId": "5cd461f6799fa00152397fcf", "delivery": "lunch"}, "version": 1616080261266, "time_extracted": "2021-03-18T151101.303316Z"} tap-mongodb (out) | {"type": "RECORD", "stream": "Order", "record": {"_id": "5a993ac0fbcb7a00c8190de9", "orderdate": "2018-03-07T000000.000000Z", "modifieddate": "2018-03-07T124504.929000Z", "payment_status": "PAID", "status": "DELIVERED", "alerted": true, "fail": false, "fail_reason": "NONE", "customerId": "5a70a186e769e200c075879d", "promocodes": ["5a9937debf662e00bb51b012"], "discount": 5.9, "createddate": "2018-03-02T115136.369000Z", "cutlery": true, "driverId": "59f70fd1acad7f856ccdfaed", "delivered": "2018-03-07T124504.928000Z", "deliveryEnded": "2018-03-07T131500.000000Z", "areaId": "5d13407be54b0000cf7090b6", "organizationId": "576baf2524748e8060000000", "routeId": "5cd461f6799fa00152397fcf", "delivery": "lunch"}, "version": 1616080261266, "time_extracted": "2021-03-18T151101.303316Z"} meltano | DEBUG Deleted configuration at /Users/kimus/Develop/eattasty/meltano/mongo2pg/mongo2pg/.meltano/run/elt/2021-03-18T151057--tap-mongodb--target-postgres/02a4934f-5255-4762-9475-fb1a22bfb4e0/target.config.json meltano | DEBUG Deleted configuration at /Users/kimus/Develop/eattasty/meltano/mongo2pg/mongo2pg/.meltano/run/elt/2021-03-18T151057--tap-mongodb--target-postgres/02a4934f-5255-4762-9475-fb1a22bfb4e0/tap.config.json meltano | ERROR Loading failed (1): target_postgres.exceptions.SingerStreamError: ('Invalid records detected above threshold: 0. See

.args

for details.', [(<ValidationError: 'False is not valid under any of the given schemas'>, {'type': 'RECORD', 'stream': 'Order', 'record': {'_id': '5a9842308257…

douwe_maan

03/18/2021, 3:14 PM

All right:

Copy code

target_postgres.exceptions.SingerStreamError: ('Invalid records detected above threshold: 0. See `.args` for details.', [(<ValidationError: 'False is not valid under any of the given schemas'>, {'type': 'RECORD', 'stream': 'Order', 'record': {'_id': '5a9842308257a200c3fa5e8b', 'orderdate': '2018-03-02T00:00:00.000000Z', 'modifieddate': '2018-03-02T12:49:25.978000Z', 'payment_status': 'PAID', 'status': 'DELIVERED', 'alerted': True, 'fail': False, 'fail_reason': 'NONE', 'customerId': '5a7840418d350100c22e6445', 'promocodes': ['5a9696bfc270f700c13b9d95'], 'discount': Decimal('5.9'), 'createddate': '2018-03-01T18:11:05.739000Z', 'cutlery': True, 'driverId': '59f70fd1acad7f856ccdfaed', 'delivered': '2018-03-02T12:49:25.978000Z', 'deliveryEnded': '2018-03-02T12:45:00.000000Z', 'areaId': '5d13407be54b0000cf7090b6', 'organizationId': '58f79b4e325a7145ba47e6ce', 'routeId': '58a2fe719874d1aa7e482e95', 'delivery': 'lunch'}, 'version': 1616080261266, 'time_extracted': '2021-03-18T15:11:01.303316Z', '__raw_line_size': 797})])

douwe_maan

03/18/2021, 3:14 PM

You can turn this "feature" off if you like: http://meltano.com/plugins/loaders/postgres.html#invalid-records-detect

helder_rossa

03/18/2021, 3:15 PM

no, thats fine… needs to be valid records

helder_rossa

03/18/2021, 3:15 PM

fixed… running again

helder_rossa

03/18/2021, 3:16 PM

I see plenty of ‘record’ messages, they go to where? no queue or databbase?

helder_rossa

03/18/2021, 3:17 PM

success!

Copy code

target-postgres       | INFO Writing table batch with 109874 rows for `('Order__1616080549258',)`...

douwe_maan

03/18/2021, 3:17 PM

Meltano passes the RECORD message straight from the tap to the target. Targets typically do their own batching

helder_rossa

03/18/2021, 3:25 PM

So, what I did: https://github.com/kimus/tap-mongodb/commit/69a73aa421eb9bf71e894ee0c57d0b479d23e086

helder_rossa

03/18/2021, 3:26 PM

still need to check this errors to get this done:

Copy code

target_postgres.exceptions.SingerStreamError: ('Invalid records detected above threshold: 0. See `.args` for details.', [(<ValidationError: "{'reason': 'WRONG_DAY', 'observations': '', 'promoId': '5d66ee3b0807d200cc647cc5', 'promoCodeValue': '5,90'} is not valid under any of the given schemas">,

helder_rossa

03/18/2021, 3:31 PM

so, increasing the sample of records to 1000 will resolve this. it’s is better then yesterday … but…

helder_rossa

03/18/2021, 3:32 PM

and I’m guessing target-postgres does not handles schema changes

douwe_maan

03/18/2021, 3:33 PM

It does

helder_rossa

03/18/2021, 3:35 PM

I don’t see the ‘new’ columns in the table… but ok, good

helder_rossa

03/19/2021, 5:31 PM

@douwe_maan I’m migrating all the databases, and so far so good. Millions of records are getting to postgres from mongodb! Thanks a lot for your help. Just some questions need to be clarified like this but so far I’m much more confident now 😄 I would like to get the ‘select --list’ done also, wasn’t working. I’ll get in to it in my fork. Thanks again!

andrew_stewart

03/30/2021, 4:40 AM

Wow, this is great!

andrew_stewart

03/30/2021, 4:41 AM

I spent a few hours trying to figure out

tap-mongodb

, wish i had read this first!

andrew_stewart

03/30/2021, 4:50 AM

Another fork you might be interested in is https://github.com/Tolsto/tap-mongodb , which adds the option to specify a db url instead of individual config components:

Copy code

{
  "database_url": "<mongodb+srv://user:myRealPassword@cluster0.mongodb.net/test?w=majority&tls=true>"
}

It also installs

dnspython

, which looks like a necessary dependency for certain mongodb hosts (like Atlas)

2 Views

Open in Slack

Previous Next