I wonder aloud sometimes if the next step for a product like Meltano #random

I wonder aloud sometimes if the next step for a pr...

emcp

11/17/2021, 1:41 AM

I wonder aloud sometimes if the next step for a product like Meltano .. or the data eco system.. is building the community in an enterprise around datasets..? What I mean is, building way to first connect to data.. tap it.. but also share it, convey compliance .. or is this maybe already existing in some level of open source world

emcp

11/17/2021, 1:44 AM

A lot of the next difficulty I am dealing with is legality and lineage of a dataset.. to be able to quickly and easily .. interms of governance , discoverability.. I know right now it’s maybe starting with the data catalog.. this is a very tabular data concept but wondering how I’d build around image data next.. sharing and discovering say .. image datasets in S3

stephen_bailey

11/17/2021, 2:59 PM

I think that's fairly in line with a "data as a product" architecture. So, if you set up a pipe from Salesforce to Snowflake, it's the

db.salesforce.*

tables in Snowflake that I would want to document heavily, tag with lineage, purpose restrictions, etc. There has been discussion around building a

singer-catalog-to-dbt-documentation

utility, and now that taps can have descriptions in them, it might be good to revisit that

emcp

11/17/2021, 6:31 PM

I guess from my end, working in a regulated environment we’re almost needing to place the products into a ecosystem that speaks permissions

emcp

11/17/2021, 6:32 PM

and that cannot come from a tap alone nor a target.. but maybe I am still coming up to speed with what taps can do

aaronsteers

11/17/2021, 6:57 PM

@emcp - For governance and monitoring, you might consider committing your catalogs to source control. As an alternative though, you could also use the output of something like

meltano select <TAP_ID> --list --all

to create a more grokable/auditable artifact for code reviews and drift detection. I think something like this could make a nice git artifact 🙂:

Copy code

Enabled patterns:
    tags.*
    commits.id
    commits.project_id
    commits.created_at
    commits.author_name
    commits.message
    !*.*_url

Selected attributes:
    [selected ] commits.author_name
    [selected ] commits.created_at
    [automatic] commits.id
    [selected ] commits.message
    [selected ] commits.project_id
    [automatic] tags.commit_id
    [selected ] tags.message
    [automatic] tags.name
    [automatic] tags.project_id
    [selected ] tags.target

wdyt?

aaronsteers

11/17/2021, 7:02 PM

...building the community in an enterprise around datasets...

I do think this is important in the industry. When I was at Slalom, we treated our data project as an open source project that everyone in the company had read/fork access to, and anyone could submit PRs. We published something very much like the above so everyone in the company could see which fields and tables were available, which we currently imported into our project and which were "left on the table" (pun intended!) 😅

aaronsteers

11/17/2021, 7:03 PM

Then if anyone wanted us to add or remove something, we didn't need to spend engineering hours on finding it. We just asked the product manager, or analyst, or whoever: "here's everything the source declares. let us know if you need anything we're not pulling."

emcp

11/19/2021, 9:58 PM

problem with that model of access is, when you work in GDPR PII sensitive data.. a lot of the friction in an enterprise becomes legal or risk dept.. thinking how can meltano interface with this area

aaronsteers

11/19/2021, 10:19 PM

Can you say more about how you see the GDPR/PII topic come into this? When we started openly publishing the include/exclude lists in my past life, the Security team loved it. We basically just included them in the PR conversation of "here's what's changing from excluded to included" and we'd schedule monthly review with Security and the Business stakeholders to let them hash it out. 😅

aaronsteers

11/19/2021, 10:21 PM

We called it the "DUR" - Data Use Review simple smile

aaronsteers

11/19/2021, 10:29 PM

Logged here for continued discussion. Provide an artifact in Git for exclude/include snapshots (#3074) · Issues · Meltano / Meltano · GitLab

emcp

11/21/2021, 4:14 PM

for the financial institution I work at, I want to find a way that.. governance of data or datasets.. can be managed a bit better.. seems meltano tap configurations is one piece.. another is something like Immuta (who I see some employees here and there working with Immuta slack, whcih I take as a good sign) I need to learn more about this sector.. but the resistance from our internal teams is that.. data scientists want R&R or Rock & Roll... all data.. all the time.. and the compliance dept basically needs to know data was masked.. that it's being used for a proper business purpose... and the traceability to know.. say that .. meltano is helping feed that data lineage story

stephen_bailey

11/22/2021, 12:22 PM

nice @emcp! i'm a big believer in exactly what you're describing: a separation of duties between "compliance policy" and "data processing" workflows. for example, an approach where meltano should be able to pull in as much data as possible, and exposes all the metadata necessary to implement compliance policies. but then compliance can actually go in and implement or update compliance policy as needed. the metadata is the interface layer. i've actually written about this before and i call it the separation of "value" and "governance" transforms: https://www.immuta.com/articles/separating-value-governance-transformations-with-dbt-immuta/

stephen_bailey

11/22/2021, 12:25 PM

i'll note that an interesting trend is that we are actually seeing the big vendors start implementing things like this (google big query policy tags, snowflake access policies). immuta's snowflake integration, for example, now leverages the builtin

row access policy

and

masking policies

and basically will create / apply these things based on catalog metadata. but the challenge of getting quality, up to date metadata about all your data sources can't be overstated. that's why im such a big advocate of having that quality information live as far upstream (i.e. in meltano taps) as possible

Open in Slack

Previous Next