Hello everyone, I'm trying to apply INCREMENTAL l...
# troubleshooting
a
Hello everyone, I'm trying to apply INCREMENTAL loading to my tap-jira extractor. To do so, I've added the 'metadata' step on my tap-jira configuration
Copy code
- name: tap-jira
        config:
          auth:
            flow: password
            username: karel@rauva.com
          domain: rauva.atlassian.net
          stream_maps:
            issues:
              __filter__: key.startswith('DATA')
              updated_test: fields.updated
        select:
          - issues.key
          - issues.fields
          - issues.fields.updated
          - updated_test
        metadata:
          issues:
            replication-method: INCREMENTAL
            # replication-key: updated_test       # no
            # replication-key: fields.updated         # no
            # replication-key: fields__updated        # no
            replication-key: issues.fields.updated    # no
issues.fields.updated - is the field I want to use to set the state of the loading. Initially I've tried with this:
replication-key: issues.fields.updated
Copy code
2024-04-07T14:59:54.185389Z [info     ] singer_sdk.exceptions.InvalidReplicationKeyException: Field 'issues.fields.updated' is not in schema for stream 'issues' cmd_type=elb consumer=False name=tap-jira producer=True stdio=stderr string_id=tap-jira
Then I thought that the problem was that the column I'm trying to use to the
INCREMENTAL
loading is not flattened. To fix that I've created a new column named
updated_test
(using
stream_maps
). When I select it (without the metadata step), I'm getting exactly what I wanted. A copy of
fields.updated
column But then when I try to use that new column on metadata step, I get the same error as before :
Copy code
2024-04-07T14:55:32.185389Z [info     ] singer_sdk.exceptions.InvalidReplicationKeyException: Field 'updated_test' is not in schema for stream 'issues' cmd_type=elb consumer=False name=tap-jira producer=True stdio=stderr string_id=tap-jira
What am I doing wrongly here? Is the creation of the column not being correctly done? If not, how should it be done? And, is there a way to set replication-key to a not flattened column? Let me know. Thanks in advance, all the help is welcomed 🙂
r
A replication key has to be a top-level stream property (i.e. not nested), as far as I am aware. The invalid replication key error is pretty self-explanatory to me: you have created a new property
updated_test
that isn't defined in the tap
issue
stream schema and are trying to set it as the replication key. You probably want to provide a schema override for
updated_test
.
👍 1
a
After reading the docu, 'A replication key has to be a top-level stream property (i.e. not nested), as far as I am aware.' now I also get this idea. I'll then follow your advice on overriding the schema, adding the
updated_test
, then with the
stream_map
assign
*updated_test:* fields.updated
, then applying the replication-key. I'll keep you posted. Thanks a lot for the help
👍 1
But maybe the stream_map executes after everything, just before the loader
Copy code
- name: tap-jira
        schema:
          issues:
            updated_test_1:
              type: ["string", "null"]
            updated_test_2:
              type: ["string", "null"]
        config:
          stream_maps:
            issues:
              updated_test: fields.updated
        select:
          - issues.key
          - updated_test
          - updated_test_1
          - updated_test_2
When I run
meltano select tap-jira --list --all > list_jira.txt
I get on the console:
Copy code
2024-04-07T20:12:39.108944Z [warning  ] Stream `updated_test` was not found in the catalog
2024-04-07T20:12:39.109025Z [warning  ] Stream `updated_test_1` was not found in the catalog
2024-04-07T20:12:39.109082Z [warning  ] Stream `updated_test_2` was not found in the catalog
Then on
list_jira.txt
Copy code
Enabled patterns:
	issues.key
	updated_test
	updated_test_1
	updated_test_2

...

	[automatic] issues.updated_test_1
	[automatic] issues.updated_test_2
Which indicates that
updated_test_1
and
updated_test_2
, where added to the schema, right? And
updated_test
was not, because it was only set on the
stream_map
But then, When I add
Copy code
metadata:
          issues:
            replication-method: INCREMENTAL
            replication-key: updated_test_1     # no
I keep getting the error:
Copy code
2024-04-07T20:17:11.599170Z [info     ] singer_sdk.exceptions.InvalidReplicationKeyException: Field 'updated_test_1' is not in schema for stream 'issues' cmd_type=extractor name=tap-jira run_id=2d93dcda-09f8-4c0e-8d97-67c35b1fb51c state_id=2024-04-07T201709--tap-jira--target-s3 stdio=stderr
'An extractor's
schema
extra holds an object describing Singer stream schema override rules that are applied to the extractor's discovered catalog file when the extractor is run using
meltano elt
or
meltano invoke
. These rules are not applied when a catalog is provided manually.' From what I read here it should override the catalog file. It's strange why it's not working 😕
a
I'm not having issues on assigning
updated_test: fields.updated
The issue I'm having is that when
Copy code
metadata:
          issues:
            replication-method: INCREMENTAL
            replication-key: updated_test_1     # no
            updated_test_1:
              is-replication-key: true
tries to get
updated_test_1
as replication-key, it says it does not exist on the catalog. Which does not originally, but I'm creating and overwriting it with:
Copy code
- name: tap-jira
        schema:
          issues:
            updated_test_1:
              type: ["string", "null"]
😕
r
Yeah sorry, I didn't really read those docs links before sending.
If a schema is specified for a property that does not yet exist in the discovered stream's schema, the property (and its schema) will be added to the catalog. This allows you to define a full schema for taps such as
tap-dynamodb
that do not themselves have the ability to discover the schema of their streams.
This makes it sound like it should work,..
https://meltano.slack.com/archives/C069CQNHDNF/p1712521060260589?thread_ts=1712502185.897449&cid=C069CQNHDNF In your example here, your
select
rules are wrong also. Currently, it is identifying
updated_test
,
updated_test_1
and
updated_test_2
as streams to select - not properties of a stream (hence
Stream was not found in the catalog
warnings), So, everything that's not the
key
property of the
issues
stream will be ignored - most likely including your stream map property
updated_test_1
. You probably want
Copy code
select:
- issues.key
- issues.updated_test_1
a
Hi @Reuben (Matatika) Yes, yes. Just changed it, and the following message does not show up anymore:
Copy code
2024-04-07T20:12:39.108944Z [warning  ] Stream `updated_test` was not found in the catalog
2024-04-07T20:12:39.109025Z [warning  ] Stream `updated_test_1` was not found in the catalog
2024-04-07T20:12:39.109082Z [warning  ] Stream `updated_test_2` was not found in the catalog
The question here is the order of execution of the commands:
schema
,
select
,
metadata
and
stream_maps
. And how to configure
metadata
to use the newly created columns and not only the tables on the
catalog
. Cause I'm still getting the:
Copy code
2024-04-07T20:17:11.599170Z [info     ] singer_sdk.exceptions.InvalidReplicationKeyException: Field 'updated_test_1' is not in schema for stream 'issues' cmd_type=extractor name=tap-jira run_id=2d93dcda-09f8-4c0e-8d97-67c35b1fb51c state_id=2024-04-07T201709--tap-jira--target-s3 stdio=stderr
When
Copy code
metadata:
          issues:
            replication-method: INCREMENTAL
            replication-key: updated_test_1     # no
r
This is about at my Meltano/Singer knowledge limit, sorry - @Edgar Ramírez (Arch.dev) might have more to say.
🙏 1
a
Thanks a lot for all the help provided @Reuben (Matatika) 🙌 Let's see if @Edgar Ramírez (Arch.dev) can help us here 😄
e
Catching up! Is
fields.updated
always present or is it particular to your Jira installation?
a
Hi Edgar!
e
(I wouldn't rely on
stream_maps
for setting a replication key, so I'm trying to see if we can change it upstream in the tap itself)
a
fields.updated
is always present yes
(I wouldn't rely on
stream_maps
for setting a replication key, so I'm trying to see if we can change it upstream in the tap itself)
Got it.
e
a
issues.updated
-> this is column is empty (at least on my side)
issues.fields.updated
-> this is the columns I'm talking about, and I want to use as replication-key
e
Yup, the PR is ensuring the column is not empty
a
Got it
and column
id
cannot come in null right? you're removing it from
replication_key
, but it is
primary_key
, and therefore is never
null
e
Correct
a
Ok perfect
I'll approve then
e
Can you try that PR with
pip_url: git+<https://github.com/MeltanoLabs/tap-jira@refs/pull/71/head>
?
a
ah ok
hm
I can't do it now now. Only later. Is that okay for you?
The PR opened until then
e
Yeah no prob
🙌 1
a
Thanks a lot @Edgar Ramírez (Arch.dev) for the help. I'll try it out later on, and get you posted. 🙌
hello @Edgar Ramírez (Arch.dev) like this?
Copy code
2024-04-09T07:22:10.277277Z [debug    ] {"type": "RECORD", "stream": "issues", "record": {"id": "__", "self": "<https://___.atlassian.net/rest/api/3/issue/__>", "key": "__-3795", "fields": {"parent": {"id": "__", "key": "___", "fields": {"summary": "___"}}, "status": {"description": "", "name": "Done", "id": "10041"}, "creator": {"accountId": "__", "displayName": "__", "active": true, "timeZone": "Europe/__", "accountType": "atlassian"}, "reporter": {"accountId": "___", "displayName": "__", "active": true, "accountType": "atlassian"}, "issuetype": {"id": "10014", "name": "Sub-task"}, "project": {"id": "10008", "key": "__", "name": "__", "projectTypeKey": "software", "simplified": false}, "resolutiondate": "2024-03-21T16:47:44.769+0000", "updated": "2024-03-21T16:47:44.775+0000", "summary": "__", "duedate": null}}, "time_extracted": "2024-04-09T07:22:10.277080+00:00"} cmd_type=extractor name=tap-jira (out) run_id=ed16db38-22d1-4426-bd35-1541d5bf58eb state_id=2024-04-09T072105--tap-jira--target-s3-csv stdio=stdout
from what I can see, the updated column is not coming with values, only fields.updated
e
Make sure you run
meltano install extractor tap-jira --clean
a
Hi Edgar, I'll try that later on. Thanks a lot for your help!
np 1