Trying to run <https hub meltano com extractors tap dbt |tap Meltano #troubleshooting

Trying to run <tap-dbt> and I'm running into some ...

chris_schmid

09/06/2023, 4:34 PM

Trying to run tap-dbt and I'm running into some issues that I think would apply more generally to other taps too, but I couldn't find an answer in the forums or slack: 1. Allowing NULL types globally: The expected schema for most of the fields in tap-dbt does not allow NULL values. For example in the runs stream, the field dbt_version is configured to expect a type of string like:

Copy code

"dbt_version": {
            "type": "string"
          }

The config above results in an error whenever meltano encounters a null value in that field, like the below run: ```2023-09-06T160752.170412Z [info ] 2023-09-06 160752,169 | INFO | singer_sdk.metrics | METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.229858, "tags": {"stream": "jobs", "endpoint": "/accounts/{account_id}/jobs", "http_status_code": 200, "status": "succeeded", "context": {"account_id": "10"}}} cmd_type=elb consumer=False name=tap-dbt producer=True stdio=stderr string_id=tap-dbt 2023-09-06T160752.209022Z [info ] 2023-09-06 160752,208 | WARNING | tap-dbt | Properties ('execution', 'run_generate_sources', 'raw_dbt_version', 'created_at', 'updated_at', 'deactivated', 'run_failure_count', 'deferring_job_definition_id', 'deferring_environment_id', 'lifecycle_webhooks', 'lifecycle_webhooks_url', 'is_deferrable', 'job_type', 'triggers_on_draft_pr', 'generate_sources', 'cron_humanized', 'next_run', 'next_run_humanized') were present in the 'jobs' stream but not found in catalog schema. Ignoring. cmd_type=elb consumer=False name=tap-dbt producer=True stdio=stderr string_id=tap-dbt 2023-09-06T160752.210651Z [info ] Traceback (most recent call last): cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.211157Z [info ] File "/project/.meltano/loaders/target-jsonl/venv/bin/target-jsonl", line 8, in <module> cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.211339Z [info ] sys.exit(main()) cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.211741Z [info ] File "/project/.meltano/loaders/target-jsonl/venv/lib/python3.9/site-packages/target_jsonl.py", line 92, in main cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.212592Z [info ] state = persist_messages( cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.212920Z [info ] File "/project/.meltano/loaders/target-jsonl/venv/lib/python3.9/site-packages/target_jsonl.py", line 54, in persist_messages cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.213027Z [info ] validators[o['stream']].validate((o['record'])) cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.213166Z [info ] File "/project/.meltano/loaders/target-jsonl/venv/lib/python3.9/site-packages/jsonschema/validators.py", line 130, in validate cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.213307Z [info ] raise error cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.213450Z [info ] jsonschema.exceptions.ValidationError: None is not of type 'string' cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.213837Z [info ] cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.214001Z [info ] Failed validating 'type' in schema['propert…

chris_schmid

09/06/2023, 4:36 PM

2. Incremental replication where the replication key is sometimes null: Again in tap-dbt, the runs stream is set to replicate incrementally using

finished_at

as the replication key. However this field is sometimes NULL for our runs. Are there any workarounds for this, aside from tweaking our local code to replicate the runs table with full table replication?

edgar_ramirez_mondragon

09/08/2023, 7:39 PM

Hi @chris_schmid! Sorry for the delay in responding. I'm the creator and maintainer of MeltanoLabs/tap-dbt, but I no longer pay for dbt Cloud so I don't have access to the API for testing.

1. Allowing NULL types globally

We probably need to either go and manually the

"null"

type in the schemas, or refactor them to use the SDK typing helpers, which make fields nullable by default.

2. Incremental replication where the replication key is sometimes null:

I'm not sure what's the best approach here and whether it's really unexpected that

finished_at

is null, or it can't be used as a replication key. Regardless, PRs are more than welcome!

chris_schmid

09/08/2023, 8:49 PM

Thanks for the reply @edgar_ramirez_mondragon! I'm not to confident in my ability to submit a PR for this, in particular for part 2 (replication method). For us it is relatively common to have a null finished_at timestamp. I think this occurs when a run is queued but cancels or errors before it actually starts. In this case a run_id is created, but it's never tied to a start_at or finished_at timestamp. The run stream also has an updated_at timestamp which I think would be the more logical option for incremental replication, as this is never null, and allows you to capture post-run revisions to records if that occurs, but I don't know how easy it would be to switch that in the code.

edgar_ramirez_mondragon

09/08/2023, 11:03 PM

No problem.

is there a way to inject logic into the metadata file from the meltano.yml file that replaces all instances of "type": string to "type": ["string", "null"]?

There's no way to replace all instances of a type, but you can use the schema setting to override the types.

edgar_ramirez_mondragon

09/08/2023, 11:08 PM

• https://github.com/MeltanoLabs/tap-dbt/issues/212 • https://github.com/MeltanoLabs/tap-dbt/issues/213 I can probably get to #212 some time next week, but #213 needs a deeper dive. In the meantime, if you're using Meltano (which I assumed in my comment above btw) you could try overriding the replication method with the metadata setting.

mark_johnston

09/10/2023, 11:16 PM

Hi @chris_schmid - thanks for raising those issues, I coded up the incremental replication method for the dbt runs endpoint based on observations with our implementation. I'm not sure if we've encountered any jobs that behave in the way you've described, so it's good to have some more information about how jobs run on other users' platforms. I'm keen to help out with both of these issues.

mark_johnston

09/10/2023, 11:22 PM

I've remembered the problem with using

updated_at

- the API does not allow ordering by that key, so it is not possible to perform the reverse-sort method that I'm describing in the

README.md

- I will update with a more detailed description on: https://github.com/MeltanoLabs/tap-dbt/issues/213

mark_johnston

09/10/2023, 11:40 PM

@chris_schmid - The way the incremental replication on

finished_at

works, I think all records with

finished_at=null

would be extracted first. This might lead to a lot of records being extracted repeatedly if you have a lot of 'runs' where that is the case. The example in the

README.md

illustrates how the replication works: https://github.com/MeltanoLabs/tap-dbt#incremental-run-stream Trying to work out how the null values coming through are causing issues in the replication so we can work on a fix.

edgar_ramirez_mondragon

09/11/2023, 2:22 PM

Thanks for chiming in @mark_johnston! I've published v0.8.2 of tap-dbt with your

null

fix.

chris_schmid

09/11/2023, 7:44 PM

Thanks everyone! We've only been dipping our feet into Meltano here and there, but I love how collaborative the community is. I suspected that dbt didn't support ordering API calls by the updated_at timestamp based on their docs. Seems like an odd choice on their side, and perhaps is something they would consider adjusting. I'll reach out to our support team there. @mark_johnston Replicating all runs where finished_at=null is the ideal behavior in my mind. That allows you to pull information on ongoing runs before they complete. In our dbt instance we don't have many runs where finished_at remains NULL, and I assume that would be the same for most other people as well. This only happens when there's a problem with our queue, which preferably wouldn't happen at all.

chris_schmid

09/11/2023, 7:44 PM

Let me know if it would help for me to provide additional logs on failed runs when finished_at is null. I can post to the github issue or here

mark_johnston

09/11/2023, 7:59 PM

Thanks @chris_schmid - it would be good to know if dbt Cloud goes and tidies up these records later, do they just stay failed with

finished_at=null

forever? If they do stay like that forever then the way the incremental method is written, the set of runs where

finished_at=null

will appear first on each run, but you could deduplicate them based on records you've already 'seen' by

id

- we do this with target-snowflake using

id

as the primary_key.As you say,

updated_at

would be the ideal field to use for replication_key, thanks for reaching out to dbt - keep us updated if they come back with anything.

chris_schmid

09/22/2023, 2:26 PM

DBT got back to me. They said their discovery api allows for sorting results by updated_at, and that we should be using that instead: https://docs.getdbt.com/docs/dbt-cloud-apis/discovery-api

mark_johnston

09/25/2023, 9:10 AM

Looks like we might have a bit of a refactor on our hands here. I'll see if our implementation supports this tomorrow if I get chance.

mark_johnston

09/27/2023, 8:36 PM

Our platform does seem to support this, but I've had no joy obtaining any data from the GraphQL based API. Apologies, I can't devote much time to this at the moment, I have made a very early attempt at pulling from the GraphQL API here: https://github.com/mjsqu/tap-dbt/tree/feat/discovery_api_graphql

chris_schmid

10/03/2023, 3:15 PM

Understood! We tabled this on our side for now, using webhooks to capture the events we care about instead. But we would definitely be interested in using a graphql version of tap-dbt once it's supported

edgar_ramirez_mondragon

10/03/2023, 3:37 PM

Thanks for getting the ball rolling on the graphql implementation @mark_johnston! I've logged https://github.com/MeltanoLabs/tap-dbt/issues/223 and invited you to be a member of the repo 😁 (no time or work commitments expected, just wanted to let you to create branches in the repo 🙂)

6 Views

Open in Slack

Previous Next