Data Tests for Taps Didn t want to blow up <https meltano sl Meltano #singer-tap-development

Join Slack

Data Tests for Taps (Didn't want to blow up <https...

# singer-tap-development

visch

10/29/2021, 3:04 PM

Data Tests for Taps (Didn't want to blow up https://meltano.slack.com/archives/C01PKLU5D1R/p1635450531125300 )

visch

10/29/2021, 3:08 PM

I have a need to run tests on data that comes in. I'm thinking data tests, like those defined in DBT. I'm honestly thinking the easiest way to do this might be with DBT. Idea is 1. Run

pytest

2. Spin up local container with Postgres 3.

meltano elt tap-name target-postgres --transform dbt:test

(wrong command for dbt I"m sure) 4. Profit? Anyone done this / have pointers? Big thing for me is making sure the code is testable without needing a server setup before hand

stephen_bailey

10/29/2021, 3:08 PM

i am doing this now! literally testing out dbt like tests last night

visch

10/29/2021, 3:09 PM

nice, are you using a container to spin up a local postgres?

stephen_bailey

10/29/2021, 3:09 PM

no, running a limited sync and doing data tests on the records coming in: https://gitlab.com/meltano/sdk/-/merge_requests/197

visch

10/29/2021, 3:09 PM

Does

meltano elt tap-name target-postgres --transform dbt:test

work? I could just run

meltano dbt:test

as well

stephen_bailey

10/29/2021, 3:10 PM

are you trying to run tests as part of the tap development, or post-sync?

stephen_bailey

10/29/2021, 3:10 PM

i am working more on the tap development side

visch

10/29/2021, 3:11 PM

tap development actually. But I was thinking the overlap here with dbt is large, if the overhead isn't too much why not just do the same test I'd do in production?

stephen_bailey

10/29/2021, 3:11 PM

because you'd have to test after load somewhere, and the load configuration could change the way the data is represented

visch

10/29/2021, 3:12 PM

True, you're coupling your testing to something you don't control.

target-*

, and

dbt

stephen_bailey

10/29/2021, 3:13 PM

but i think the ergonomics of the dbt testing are right -- the tests should just live with the schema definition. it's very natural to think of this;

Copy code

th.Property("updated_at", th.DateTimeType, expectations=["is_valid_timestamp", "not_null"])

and then run

pytest

and it runs the

not_null

and

is_valid_timestamp

tests

visch

10/29/2021, 3:14 PM

Schema tests, that's nice! My use case at the moment is data tests. ie

"select (*) from tasks where archived = 'true'

count should be >= 1

stephen_bailey

10/29/2021, 3:15 PM

is that a custom data test in dbt?

visch

10/29/2021, 3:15 PM

I haven't wrote it yet, was thinking about just doing that as a dbt test yes!

visch

10/29/2021, 3:15 PM

Debating on doing that in

dbt

or not as a part of the test run with the tap

stephen_bailey

10/29/2021, 3:16 PM

with the MR above, you could write:

Copy code

def test_archived_tasks_greater_than_zero(test_util):
    records = self.records["tasks"]
    assert len(r for r in records if r["archived"]) > 0

taylor

10/29/2021, 7:18 PM

The work @florian.hines has been doing will enable this workflow soon! (I know I keep promising - it’s coming!)

taylor

10/29/2021, 7:18 PM

At GitLab I had a DAG that basically did this - did tests on the source data post loading, ran dbt to build the base models, then ran tests on the built models. I think that’s a pattern Meltano could really enable

taylor

10/29/2021, 7:18 PM

https://gitlab.com/gitlab-data/analytics/-/blob/master/dags/sources/source_sfdc_validation_and_run.py is an example dag

visch

10/29/2021, 7:19 PM

Nice! @stephen_bailey that looks pretty nice @taylor what do you think about runing tests right in the tap for tap developers

taylor

10/29/2021, 7:20 PM

with dbt? that’s harder to imagine. I could maybe see great expectatinos… are you thinking this is part of the tap codebase?

visch

10/29/2021, 7:21 PM

I think it's a bit nutty, after @stephen_bailey called me on how much easier it'd be not to But yeah, in the tests dir, throw a meltano elt tap-name target-postgres, then run your dbt tests against it

visch

10/29/2021, 7:23 PM

trick is something like podman / docker to spin up a ephemeral postgres instance ie

Copy code

podman run -e POSTGRES_PASSWORD=postgres -p 5432:5432 -h postgres -d postgres

🤷

taylor

10/29/2021, 7:23 PM

I’d almost rather have a SQLite or DuckDB instance

taylor

10/29/2021, 7:23 PM

don’t know if dbt runs against those lol

visch

10/29/2021, 7:23 PM

dbt no likey

visch

10/29/2021, 7:23 PM

Yeah 😕

taylor

10/29/2021, 7:23 PM

😬

visch

10/29/2021, 7:23 PM

https://meltano.slack.com/archives/C01PKLU5D1R/p1635520594130900?thread_ts=1635519888.127100&cid=C01PKLU5D1R is a pretty good argument not to

visch

10/29/2021, 7:24 PM

But when it gets more complicated it seems like dbt would be the way to go

stephen_bailey

11/01/2021, 6:28 PM

so i was playing around with this weekend, and there's a really nice pattern that could be implemented to build "self-testing taps". basically, all tests could be based off of the tap / stream / schema attributes, and dynamically generated at test run time, with each test being run as a separate test. it's pretty neat! currently have it implemented as

test_manifest

array and a

pytest.mark.parameterize

decorator. the test_manifest is a list of

test_name

and `params`:

Copy code

TEST_MANIFEST = [
    ("tap__cli", {}),
    ("tap__discovery", {}),
    ("tap__stream_connections", {}),
    ("stream__catalog_schema_matches_record", {"stream_name": "channels"}),
    ("stream__record_schema_matches_catalog", {"stream_name": "channels"}),
    ...
    ("stream__returns_record", {"stream_name": "users"}),
    ("stream__primary_key", {"stream_name": "users"}),
    ("attribute__unique", {"stream_name": "channels", "attribute_name": "id"}),
    ("attribute__not_null", {"stream_name": "channels", "attribute_name": "id"}),
]

and then you parameterize a single function that pulls the test function from the test utility and runs it.

Copy code

@pytest.mark.parametrize(
    "test_config", TEST_MANIFEST
)
def test_builtin_tap_tests(test_util, test_config):
    test_name, params = test_config
    test_func = test_util.available_tests[test_name]
    test_func(**params)

will carve out some time and demo it next week. it works surprisingly well

stephen_bailey

11/01/2021, 6:28 PM

wouldn't handle the custom data tests, but it could handle a lot of things really well, and it would require basically no changes to a cookiecutter test template

visch

11/02/2021, 1:07 PM

Curious stuff! Still debating how I want to test my tap 😕

stephen_bailey

11/02/2021, 3:59 PM

what are the main things you want to test?

Open in Slack

Previous Next