Does someone have documentation / a code example h...
# singer-tap-development
j
Does someone have documentation / a code example how to copy over RDS/PSQL column comments over to another target (Snowflake, in our case)?
s
I don't think it's implemented in the Postgres tap and Snowflake target, but theoretically if the tap emits a schema message with
comment: <column_comment>
and the target had logic for adding the comment based on that schema property, then this would work out of the box. we did just add support for descriptions in the singer sdk, so maybe this is a future opportunity? cc/ @aaronsteers
j
@stephen_bailey Thanks for the information, I was thinking along the same lines in case there wasn’t something for this already. We’ll explore this further! For context: our product engineers are responsible for data sensitivity annotations. These annotations are stored in the PGs column comments. We’re planning to replicate those into our data warehouse so that we can anonymize the data according to that schema.
s
i have been thinking about this problem for a long time, and have annoyed the meltano team before 🙂 . it would be great to have standard vocabulary and implementation so that tap developers can annotate the catalog and data loaders can pass them in a standard format. i know that postgres does not have column tags, but are you thinking of using object tags in snowflake?
do your postgres comments look something like:
person address. @PII
or some similar annotation to flag tags?
j
It comes from this gem: https://github.com/Hacker0x01/data_classification. We store it in a JSON blob in the comments, like:
Copy code
{
  "tags": [
    "classification:confidential"
  ]
}
I was thinking to simply copy the comments to Snowflake and then parse them in an Airflow DAG or with dbt (although I’m not sure how trivial table reflection is with dbt). We ingest the entire database into Snowflake and then create an anonymized dataset for our product and engineering members to visualize data in Looker. For other pipelines like machine learning and our “gold” tier data, we use manual code review to avoid sensitive data to end up there (since we’re in full control of that).
Early thinking though!
a
Hi, @jobert_abma. Would you mind logging an issue on this topic if we don't have one already, or perhaps a couple? 😉 A few things we can do: 1. Tap developers can annotate their JSON schema during discovery. 2. Target developers can look for markers in the JSON Schema and use those markers to annotate the target columns. 3. We can optionally start with the
description
property in JSON schema since we've recently improved support there in the SDK, and/or we can look at using other types of markers. 4. The tap's catalog could also (optionally) be used in autogenerating sources.yml for dbt. (This one already has an issue but I don't have the link handy.) Seems like the first and second items are the best next steps if we wanted to get this rolling. 3 and 4 are probably just icing on the cake. Wdyt?