Data Tests for Taps (Didn't want to blow up <https...
# singer-tap-development
v
Data Tests for Taps (Didn't want to blow up https://meltano.slack.com/archives/C01PKLU5D1R/p1635450531125300 )
I have a need to run tests on data that comes in. I'm thinking data tests, like those defined in DBT. I'm honestly thinking the easiest way to do this might be with DBT. Idea is 1. Run
pytest
2. Spin up local container with Postgres 3.
meltano elt tap-name target-postgres --transform dbt:test
(wrong command for dbt I"m sure) 4. Profit? Anyone done this / have pointers? Big thing for me is making sure the code is testable without needing a server setup before hand
s
i am doing this now! literally testing out dbt like tests last night
v
nice, are you using a container to spin up a local postgres?
s
no, running a limited sync and doing data tests on the records coming in: https://gitlab.com/meltano/sdk/-/merge_requests/197
v
Does
meltano elt tap-name target-postgres --transform dbt:test
work? I could just run
meltano dbt:test
as well
s
are you trying to run tests as part of the tap development, or post-sync?
i am working more on the tap development side
v
tap development actually. But I was thinking the overlap here with dbt is large, if the overhead isn't too much why not just do the same test I'd do in production?
s
because you'd have to test after load somewhere, and the load configuration could change the way the data is represented
v
True, you're coupling your testing to something you don't control.
target-*
, and
dbt
s
but i think the ergonomics of the dbt testing are right -- the tests should just live with the schema definition. it's very natural to think of this;
Copy code
th.Property("updated_at", th.DateTimeType, expectations=["is_valid_timestamp", "not_null"])
and then run
pytest
and it runs the
not_null
and
is_valid_timestamp
tests
v
Schema tests, that's nice! My use case at the moment is data tests. ie
"select (*) from tasks where archived = 'true'
count should be >= 1
s
is that a custom data test in dbt?
v
I haven't wrote it yet, was thinking about just doing that as a dbt test yes!
Debating on doing that in
dbt
or not as a part of the test run with the tap
s
with the MR above, you could write:
Copy code
def test_archived_tasks_greater_than_zero(test_util):
    records = self.records["tasks"]
    assert len(r for r in records if r["archived"]) > 0
t
The work @florian.hines has been doing will enable this workflow soon! (I know I keep promising - it’s coming!)
At GitLab I had a DAG that basically did this - did tests on the source data post loading, ran dbt to build the base models, then ran tests on the built models. I think that’s a pattern Meltano could really enable
v
Nice! @stephen_bailey that looks pretty nice @taylor what do you think about runing tests right in the tap for tap developers
t
with dbt? that’s harder to imagine. I could maybe see great expectatinos… are you thinking this is part of the tap codebase?
v
I think it's a bit nutty, after @stephen_bailey called me on how much easier it'd be not to But yeah, in the tests dir, throw a meltano elt tap-name target-postgres, then run your dbt tests against it
trick is something like podman / docker to spin up a ephemeral postgres instance ie
Copy code
podman run -e POSTGRES_PASSWORD=postgres -p 5432:5432 -h postgres -d postgres
🤷
t
I’d almost rather have a SQLite or DuckDB instance
don’t know if dbt runs against those lol
v
dbt no likey
Yeah 😕
t
😬
But when it gets more complicated it seems like dbt would be the way to go
s
so i was playing around with this weekend, and there's a really nice pattern that could be implemented to build "self-testing taps". basically, all tests could be based off of the tap / stream / schema attributes, and dynamically generated at test run time, with each test being run as a separate test. it's pretty neat! currently have it implemented as
test_manifest
array and a
pytest.mark.parameterize
decorator. the test_manifest is a list of
test_name
and `params`:
Copy code
TEST_MANIFEST = [
    ("tap__cli", {}),
    ("tap__discovery", {}),
    ("tap__stream_connections", {}),
    ("stream__catalog_schema_matches_record", {"stream_name": "channels"}),
    ("stream__record_schema_matches_catalog", {"stream_name": "channels"}),
    ...
    ("stream__returns_record", {"stream_name": "users"}),
    ("stream__primary_key", {"stream_name": "users"}),
    ("attribute__unique", {"stream_name": "channels", "attribute_name": "id"}),
    ("attribute__not_null", {"stream_name": "channels", "attribute_name": "id"}),
]
and then you parameterize a single function that pulls the test function from the test utility and runs it.
Copy code
@pytest.mark.parametrize(
    "test_config", TEST_MANIFEST
)
def test_builtin_tap_tests(test_util, test_config):
    test_name, params = test_config
    test_func = test_util.available_tests[test_name]
    test_func(**params)
will carve out some time and demo it next week. it works surprisingly well
wouldn't handle the custom data tests, but it could handle a lot of things really well, and it would require basically no changes to a cookiecutter test template
v
Curious stuff! Still debating how I want to test my tap 😕
s
what are the main things you want to test?