I have run into an issue with two taps this week where the s Meltano #singer-tap-development

I have run into an issue with two taps this week w...

stephen_bailey

10/07/2021, 12:00 PM

I have run into an issue with two taps this week where the streams are outputting records that have null values in primary key columns. They are primary key columns (

id

, for example), but the source system apparently does not have strict limitations on them, or has special cases where they might be omitted. The tap completes successfully, but there are errors on load due to the null primary keys. I'd like to just remove these from the yielded records, and was wondering if others have suggestions. Right now, I'm just adding a filter into `parse_response`:

Copy code

def parse_response(self, response: requests.Response) -> Iterable[dict]:
        records = extract_jsonpath(self.records_jsonpath, input=response.json())
        yield from [
            row for row in records
            if all(row.get(k) is not None for k in self.primary_keys)
        ]

But wondering if others have tackled this before, or if this would be a generally useful tap feature?

stephen_bailey

10/07/2021, 12:03 PM

For a generalizable approach, I was thinking that having some default enforcement on the

required=True

attribute in the catalog may make sense, with something like a

filter_records_with_schema_violations=True

flag

edgar_ramirez_mondragon

10/07/2021, 4:10 PM

Hi @stephen_bailey! Have you looked at stream maps? About validating every record against a schema and drop mismatches, would you mind creating an issue

stephen_bailey

10/07/2021, 4:12 PM

No, I haven't looked at that -- is this a config-level setting?

stephen_bailey

10/07/2021, 4:15 PM

whaaaaaaaaaaaaaaaaaaat

stephen_bailey

10/07/2021, 4:15 PM

this is amazing

stephen_bailey

10/07/2021, 4:22 PM

stream_maps definitely would have solved my initial "broken pipe" problem and is a great tool for the end user. I'll make an issue that calls out the idea that maybe we could use

stream_maps

+ the catalog

required=True

to do this automatically.

aaronsteers

10/07/2021, 4:26 PM

@stephen_bailey - As @edgar_ramirez_mondragon mentions, Stream Maps should give you filtering capability. Since you own this tap, you could also return "None" from

post_process()

to just always filter out certain records based on a condition.

aaronsteers

10/07/2021, 4:27 PM

The behavior of

post_process()

is that you can alter the record or just not return it.

stephen_bailey

10/07/2021, 4:28 PM

Oh, I was thinking that it would return an empty record if I went the post_process route. But if I return

None

and that simply omits the record, that would be perfect

aaronsteers

10/07/2021, 4:29 PM

Yeah - sorry, that might not be 100% clear from the docs. But returning

None

has the effect of filtering out the record.

aaronsteers

10/07/2021, 4:32 PM

Definitely not clear from the docs. I'll open an issue.

stephen_bailey

10/07/2021, 4:55 PM

This works nicely

Copy code

def post_process(self, row: dict, context: Optional[dict] = None) -> dict:
        if any(row.get(k) == None for k in self.primary_keys):
            return None
        return row

aaronsteers

10/07/2021, 7:06 PM

Undocumented feature: filter records from stream by returning `None` from `post_process()` (#233) · Issues · Meltano / Meltano SDK for Singer Taps and Targets · GitLab

2 Views

Open in Slack

Previous Next