Hello everyone I m trying to apply INCREMENTAL loading to my Meltano #troubleshooting

Hello everyone, I'm trying to apply INCREMENTAL l...

Afonso Diniz

04/07/2024, 3:03 PM

Hello everyone, I'm trying to apply INCREMENTAL loading to my tap-jira extractor. To do so, I've added the 'metadata' step on my tap-jira configuration

Copy code

- name: tap-jira
        config:
          auth:
            flow: password
            username: karel@rauva.com
          domain: rauva.atlassian.net
          stream_maps:
            issues:
              __filter__: key.startswith('DATA')
              updated_test: fields.updated
        select:
          - issues.key
          - issues.fields
          - issues.fields.updated
          - updated_test
        metadata:
          issues:
            replication-method: INCREMENTAL
            # replication-key: updated_test       # no
            # replication-key: fields.updated         # no
            # replication-key: fields__updated        # no
            replication-key: issues.fields.updated    # no

issues.fields.updated - is the field I want to use to set the state of the loading. Initially I've tried with this:

replication-key: issues.fields.updated

Copy code

2024-04-07T14:59:54.185389Z [info     ] singer_sdk.exceptions.InvalidReplicationKeyException: Field 'issues.fields.updated' is not in schema for stream 'issues' cmd_type=elb consumer=False name=tap-jira producer=True stdio=stderr string_id=tap-jira

Then I thought that the problem was that the column I'm trying to use to the

INCREMENTAL

loading is not flattened. To fix that I've created a new column named

updated_test

(using

stream_maps

). When I select it (without the metadata step), I'm getting exactly what I wanted. A copy of

fields.updated

column But then when I try to use that new column on metadata step, I get the same error as before :

Copy code

2024-04-07T14:55:32.185389Z [info     ] singer_sdk.exceptions.InvalidReplicationKeyException: Field 'updated_test' is not in schema for stream 'issues' cmd_type=elb consumer=False name=tap-jira producer=True stdio=stderr string_id=tap-jira

What am I doing wrongly here? Is the creation of the column not being correctly done? If not, how should it be done? And, is there a way to set replication-key to a not flattened column? Let me know. Thanks in advance, all the help is welcomed 🙂

Reuben (Matatika)

04/07/2024, 7:25 PM

A replication key has to be a top-level stream property (i.e. not nested), as far as I am aware. The invalid replication key error is pretty self-explanatory to me: you have created a new property

updated_test

that isn't defined in the tap

issue

stream schema and are trying to set it as the replication key. You probably want to provide a schema override for

updated_test

👍 1

Afonso Diniz

04/07/2024, 7:42 PM

After reading the docu, 'A replication key has to be a top-level stream property (i.e. not nested), as far as I am aware.' now I also get this idea. I'll then follow your advice on overriding the schema, adding the

updated_test

, then with the

stream_map

assign

*updated_test:* fields.updated

, then applying the replication-key. I'll keep you posted. Thanks a lot for the help

👍 1

Afonso Diniz

04/07/2024, 7:58 PM

But maybe the stream_map executes after everything, just before the loader

Afonso Diniz

04/07/2024, 8:17 PM

Copy code

- name: tap-jira
        schema:
          issues:
            updated_test_1:
              type: ["string", "null"]
            updated_test_2:
              type: ["string", "null"]
        config:
          stream_maps:
            issues:
              updated_test: fields.updated
        select:
          - issues.key
          - updated_test
          - updated_test_1
          - updated_test_2

When I run

meltano select tap-jira --list --all > list_jira.txt

I get on the console:

Copy code

2024-04-07T20:12:39.108944Z [warning  ] Stream `updated_test` was not found in the catalog
2024-04-07T20:12:39.109025Z [warning  ] Stream `updated_test_1` was not found in the catalog
2024-04-07T20:12:39.109082Z [warning  ] Stream `updated_test_2` was not found in the catalog

Then on

list_jira.txt

Copy code

Enabled patterns:
	issues.key
	updated_test
	updated_test_1
	updated_test_2

...

	[automatic] issues.updated_test_1
	[automatic] issues.updated_test_2

Which indicates that

updated_test_1

and

updated_test_2

, where added to the schema, right? And

updated_test

was not, because it was only set on the

stream_map

But then, When I add

Copy code

metadata:
          issues:
            replication-method: INCREMENTAL
            replication-key: updated_test_1     # no

I keep getting the error:

Copy code

2024-04-07T20:17:11.599170Z [info     ] singer_sdk.exceptions.InvalidReplicationKeyException: Field 'updated_test_1' is not in schema for stream 'issues' cmd_type=extractor name=tap-jira run_id=2d93dcda-09f8-4c0e-8d97-67c35b1fb51c state_id=2024-04-07T201709--tap-jira--target-s3 stdio=stderr

Afonso Diniz

04/07/2024, 8:25 PM

'An extractor's

schema

extra holds an object describing Singer stream schema override rules that are applied to the extractor's discovered catalog file when the extractor is run using

meltano elt

meltano invoke

. These rules are not applied when a catalog is provided manually.' From what I read here it should override the catalog file. It's strange why it's not working 😕

Reuben (Matatika)

04/07/2024, 9:00 PM

Have a look at https://sdk.meltano.com/en/latest/stream_maps.html#automatic-schema-detection and https://sdk.meltano.com/en/latest/stream_maps.html#schema-detection-capabilities-are-limited? Sounds like it could be related.

Afonso Diniz

04/07/2024, 9:44 PM

I'm not having issues on assigning

updated_test: fields.updated

The issue I'm having is that when

Copy code

metadata:
          issues:
            replication-method: INCREMENTAL
            replication-key: updated_test_1     # no
            updated_test_1:
              is-replication-key: true

tries to get

updated_test_1

as replication-key, it says it does not exist on the catalog. Which does not originally, but I'm creating and overwriting it with:

Copy code

- name: tap-jira
        schema:
          issues:
            updated_test_1:
              type: ["string", "null"]

😕

Reuben (Matatika)

04/07/2024, 10:32 PM

Yeah sorry, I didn't really read those docs links before sending.

If a schema is specified for a property that does not yet exist in the discovered stream's schema, the property (and its schema) will be added to the catalog. This allows you to define a full schema for taps such as
tap-dynamodb
that do not themselves have the ability to discover the schema of their streams.

This makes it sound like it should work,..

Reuben (Matatika)

04/08/2024, 12:00 AM

https://meltano.slack.com/archives/C069CQNHDNF/p1712521060260589?thread_ts=1712502185.897449&cid=C069CQNHDNF In your example here, your

select

rules are wrong also. Currently, it is identifying

updated_test

updated_test_1

and

updated_test_2

as streams to select - not properties of a stream (hence

Stream was not found in the catalog

warnings), So, everything that's not the

key

property of the

issues

stream will be ignored - most likely including your stream map property

updated_test_1

. You probably want

Copy code

select:
- issues.key
- issues.updated_test_1

Afonso Diniz

04/08/2024, 9:20 AM

Hi @Reuben (Matatika) Yes, yes. Just changed it, and the following message does not show up anymore:

Copy code

2024-04-07T20:12:39.108944Z [warning  ] Stream `updated_test` was not found in the catalog
2024-04-07T20:12:39.109025Z [warning  ] Stream `updated_test_1` was not found in the catalog
2024-04-07T20:12:39.109082Z [warning  ] Stream `updated_test_2` was not found in the catalog

The question here is the order of execution of the commands:

schema

select

metadata

and

stream_maps

. And how to configure

metadata

to use the newly created columns and not only the tables on the

catalog

. Cause I'm still getting the:

Copy code

2024-04-07T20:17:11.599170Z [info     ] singer_sdk.exceptions.InvalidReplicationKeyException: Field 'updated_test_1' is not in schema for stream 'issues' cmd_type=extractor name=tap-jira run_id=2d93dcda-09f8-4c0e-8d97-67c35b1fb51c state_id=2024-04-07T201709--tap-jira--target-s3 stdio=stderr

When

Copy code

metadata:
          issues:
            replication-method: INCREMENTAL
            replication-key: updated_test_1     # no

Reuben (Matatika)

04/08/2024, 10:07 AM

This is about at my Meltano/Singer knowledge limit, sorry - @Edgar Ramírez (Arch.dev) might have more to say.

🙏 1

Afonso Diniz

04/08/2024, 11:25 AM

Thanks a lot for all the help provided @Reuben (Matatika) 🙌 Let's see if @Edgar Ramírez (Arch.dev) can help us here 😄

Edgar Ramírez (Arch.dev)

04/08/2024, 3:34 PM

Catching up! Is

fields.updated

always present or is it particular to your Jira installation?

Afonso Diniz

04/08/2024, 3:35 PM

Hi Edgar!

Edgar Ramírez (Arch.dev)

04/08/2024, 3:35 PM

(I wouldn't rely on

stream_maps

for setting a replication key, so I'm trying to see if we can change it upstream in the tap itself)

Afonso Diniz

04/08/2024, 3:36 PM

fields.updated

is always present yes

Afonso Diniz

04/08/2024, 3:37 PM

(I wouldn't rely on
stream_maps
for setting a replication key, so I'm trying to see if we can change it upstream in the tap itself)

Got it.

Edgar Ramírez (Arch.dev)

04/08/2024, 3:42 PM

https://github.com/MeltanoLabs/tap-jira/pull/71

Afonso Diniz

04/08/2024, 3:43 PM

issues.updated

-> this is column is empty (at least on my side)

issues.fields.updated

-> this is the columns I'm talking about, and I want to use as replication-key

Edgar Ramírez (Arch.dev)

04/08/2024, 3:44 PM

Yup, the PR is ensuring the column is not empty

Afonso Diniz

04/08/2024, 3:45 PM

Got it

Afonso Diniz

04/08/2024, 3:46 PM

and column

id

cannot come in null right? you're removing it from

replication_key

, but it is

primary_key

, and therefore is never

null

Edgar Ramírez (Arch.dev)

04/08/2024, 3:46 PM

Correct

Afonso Diniz

04/08/2024, 3:46 PM

Ok perfect

Afonso Diniz

04/08/2024, 3:46 PM

I'll approve then

Edgar Ramírez (Arch.dev)

04/08/2024, 3:46 PM

Can you try that PR with

pip_url: git+<https://github.com/MeltanoLabs/tap-jira@refs/pull/71/head>

Afonso Diniz

04/08/2024, 3:46 PM

ah ok

Afonso Diniz

04/08/2024, 3:46 PM

Afonso Diniz

04/08/2024, 3:47 PM

I can't do it now now. Only later. Is that okay for you?

Afonso Diniz

04/08/2024, 3:47 PM

The PR opened until then

Edgar Ramírez (Arch.dev)

04/08/2024, 3:47 PM

Yeah no prob

🙌 1

Afonso Diniz

04/08/2024, 3:48 PM

Thanks a lot @Edgar Ramírez (Arch.dev) for the help. I'll try it out later on, and get you posted. 🙌

Afonso Diniz

04/09/2024, 7:22 AM

hello @Edgar Ramírez (Arch.dev) like this?

Afonso Diniz

04/09/2024, 7:25 AM

Copy code

2024-04-09T07:22:10.277277Z [debug    ] {"type": "RECORD", "stream": "issues", "record": {"id": "__", "self": "<https://___.atlassian.net/rest/api/3/issue/__>", "key": "__-3795", "fields": {"parent": {"id": "__", "key": "___", "fields": {"summary": "___"}}, "status": {"description": "", "name": "Done", "id": "10041"}, "creator": {"accountId": "__", "displayName": "__", "active": true, "timeZone": "Europe/__", "accountType": "atlassian"}, "reporter": {"accountId": "___", "displayName": "__", "active": true, "accountType": "atlassian"}, "issuetype": {"id": "10014", "name": "Sub-task"}, "project": {"id": "10008", "key": "__", "name": "__", "projectTypeKey": "software", "simplified": false}, "resolutiondate": "2024-03-21T16:47:44.769+0000", "updated": "2024-03-21T16:47:44.775+0000", "summary": "__", "duedate": null}}, "time_extracted": "2024-04-09T07:22:10.277080+00:00"} cmd_type=extractor name=tap-jira (out) run_id=ed16db38-22d1-4426-bd35-1541d5bf58eb state_id=2024-04-09T072105--tap-jira--target-s3-csv stdio=stdout

from what I can see, the updated column is not coming with values, only fields.updated

Edgar Ramírez (Arch.dev)

04/09/2024, 1:41 PM

Make sure you run

meltano install extractor tap-jira --clean

Afonso Diniz

04/09/2024, 4:47 PM

Hi Edgar, I'll try that later on. Thanks a lot for your help!

np 1

4 Views

Open in Slack

Previous Next