Trying to run <tap-dbt> and I'm running into some ...
# troubleshooting
c
Trying to run tap-dbt and I'm running into some issues that I think would apply more generally to other taps too, but I couldn't find an answer in the forums or slack: 1. Allowing NULL types globally: The expected schema for most of the fields in tap-dbt does not allow NULL values. For example in the runs stream, the field dbt_version is configured to expect a type of string like:
Copy code
"dbt_version": {
            "type": "string"
          }
The config above results in an error whenever meltano encounters a null value in that field, like the below run: ```2023-09-06T160752.170412Z [info ] 2023-09-06 160752,169 | INFO | singer_sdk.metrics | METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.229858, "tags": {"stream": "jobs", "endpoint": "/accounts/{account_id}/jobs", "http_status_code": 200, "status": "succeeded", "context": {"account_id": "10"}}} cmd_type=elb consumer=False name=tap-dbt producer=True stdio=stderr string_id=tap-dbt 2023-09-06T160752.209022Z [info ] 2023-09-06 160752,208 | WARNING | tap-dbt | Properties ('execution', 'run_generate_sources', 'raw_dbt_version', 'created_at', 'updated_at', 'deactivated', 'run_failure_count', 'deferring_job_definition_id', 'deferring_environment_id', 'lifecycle_webhooks', 'lifecycle_webhooks_url', 'is_deferrable', 'job_type', 'triggers_on_draft_pr', 'generate_sources', 'cron_humanized', 'next_run', 'next_run_humanized') were present in the 'jobs' stream but not found in catalog schema. Ignoring. cmd_type=elb consumer=False name=tap-dbt producer=True stdio=stderr string_id=tap-dbt 2023-09-06T160752.210651Z [info ] Traceback (most recent call last): cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.211157Z [info ] File "/project/.meltano/loaders/target-jsonl/venv/bin/target-jsonl", line 8, in <module> cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.211339Z [info ] sys.exit(main()) cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.211741Z [info ] File "/project/.meltano/loaders/target-jsonl/venv/lib/python3.9/site-packages/target_jsonl.py", line 92, in main cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.212592Z [info ] state = persist_messages( cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.212920Z [info ] File "/project/.meltano/loaders/target-jsonl/venv/lib/python3.9/site-packages/target_jsonl.py", line 54, in persist_messages cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.213027Z [info ] validators[o['stream']].validate((o['record'])) cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.213166Z [info ] File "/project/.meltano/loaders/target-jsonl/venv/lib/python3.9/site-packages/jsonschema/validators.py", line 130, in validate cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.213307Z [info ] raise error cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.213450Z [info ] jsonschema.exceptions.ValidationError: None is not of type 'string' cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.213837Z [info ] cmd_type=elb consumer=True name=target-jsonl producer=False stdio=stderr string_id=target-jsonl 2023-09-06T160752.214001Z [info ] Failed validating 'type' in schema['propert…
2. Incremental replication where the replication key is sometimes null: Again in tap-dbt, the runs stream is set to replicate incrementally using
finished_at
as the replication key. However this field is sometimes NULL for our runs. Are there any workarounds for this, aside from tweaking our local code to replicate the runs table with full table replication?
e
Hi @chris_schmid! Sorry for the delay in responding. I'm the creator and maintainer of MeltanoLabs/tap-dbt, but I no longer pay for dbt Cloud so I don't have access to the API for testing.
1. Allowing NULL types globally
We probably need to either go and manually the
"null"
type in the schemas, or refactor them to use the SDK typing helpers, which make fields nullable by default.
2. Incremental replication where the replication key is sometimes null:
I'm not sure what's the best approach here and whether it's really unexpected that
finished_at
is null, or it can't be used as a replication key. Regardless, PRs are more than welcome!
c
Thanks for the reply @edgar_ramirez_mondragon! I'm not to confident in my ability to submit a PR for this, in particular for part 2 (replication method). For us it is relatively common to have a null finished_at timestamp. I think this occurs when a run is queued but cancels or errors before it actually starts. In this case a run_id is created, but it's never tied to a start_at or finished_at timestamp. The run stream also has an updated_at timestamp which I think would be the more logical option for incremental replication, as this is never null, and allows you to capture post-run revisions to records if that occurs, but I don't know how easy it would be to switch that in the code.
e
No problem.
is there a way to inject logic into the metadata file from the meltano.yml file that replaces all instances of "type": string to "type": ["string", "null"]?
There's no way to replace all instances of a type, but you can use the schema setting to override the types.
https://github.com/MeltanoLabs/tap-dbt/issues/212https://github.com/MeltanoLabs/tap-dbt/issues/213 I can probably get to #212 some time next week, but #213 needs a deeper dive. In the meantime, if you're using Meltano (which I assumed in my comment above btw) you could try overriding the replication method with the metadata setting.
m
Hi @chris_schmid - thanks for raising those issues, I coded up the incremental replication method for the dbt runs endpoint based on observations with our implementation. I'm not sure if we've encountered any jobs that behave in the way you've described, so it's good to have some more information about how jobs run on other users' platforms. I'm keen to help out with both of these issues.
I've remembered the problem with using
updated_at
- the API does not allow ordering by that key, so it is not possible to perform the reverse-sort method that I'm describing in the
README.md
- I will update with a more detailed description on: https://github.com/MeltanoLabs/tap-dbt/issues/213
@chris_schmid - The way the incremental replication on
finished_at
works, I think all records with
finished_at=null
would be extracted first. This might lead to a lot of records being extracted repeatedly if you have a lot of 'runs' where that is the case. The example in the
README.md
illustrates how the replication works: https://github.com/MeltanoLabs/tap-dbt#incremental-run-stream Trying to work out how the null values coming through are causing issues in the replication so we can work on a fix.
e
Thanks for chiming in @mark_johnston! I've published v0.8.2 of tap-dbt with your
null
fix.
c
Thanks everyone! We've only been dipping our feet into Meltano here and there, but I love how collaborative the community is. I suspected that dbt didn't support ordering API calls by the updated_at timestamp based on their docs. Seems like an odd choice on their side, and perhaps is something they would consider adjusting. I'll reach out to our support team there. @mark_johnston Replicating all runs where finished_at=null is the ideal behavior in my mind. That allows you to pull information on ongoing runs before they complete. In our dbt instance we don't have many runs where finished_at remains NULL, and I assume that would be the same for most other people as well. This only happens when there's a problem with our queue, which preferably wouldn't happen at all.
Let me know if it would help for me to provide additional logs on failed runs when finished_at is null. I can post to the github issue or here
m
Thanks @chris_schmid - it would be good to know if dbt Cloud goes and tidies up these records later, do they just stay failed with
finished_at=null
forever? If they do stay like that forever then the way the incremental method is written, the set of runs where
finished_at=null
will appear first on each run, but you could deduplicate them based on records you've already 'seen' by
id
- we do this with target-snowflake using
id
as the primary_key.As you say,
updated_at
would be the ideal field to use for replication_key, thanks for reaching out to dbt - keep us updated if they come back with anything.
c
DBT got back to me. They said their discovery api allows for sorting results by updated_at, and that we should be using that instead: https://docs.getdbt.com/docs/dbt-cloud-apis/discovery-api
m
Looks like we might have a bit of a refactor on our hands here. I'll see if our implementation supports this tomorrow if I get chance.
Our platform does seem to support this, but I've had no joy obtaining any data from the GraphQL based API. Apologies, I can't devote much time to this at the moment, I have made a very early attempt at pulling from the GraphQL API here: https://github.com/mjsqu/tap-dbt/tree/feat/discovery_api_graphql
c
Understood! We tabled this on our side for now, using webhooks to capture the events we care about instead. But we would definitely be interested in using a graphql version of tap-dbt once it's supported
e
Thanks for getting the ball rolling on the graphql implementation @mark_johnston! I've logged https://github.com/MeltanoLabs/tap-dbt/issues/223 and invited you to be a member of the repo 😁 (no time or work commitments expected, just wanted to let you to create branches in the repo 🙂)