I wonder aloud sometimes if the next step for a pr...
# random
e
I wonder aloud sometimes if the next step for a product like Meltano .. or the data eco system.. is building the community in an enterprise around datasets..? What I mean is, building way to first connect to data.. tap it.. but also share it, convey compliance .. or is this maybe already existing in some level of open source world
A lot of the next difficulty I am dealing with is legality and lineage of a dataset.. to be able to quickly and easily .. interms of governance , discoverability.. I know right now it鈥檚 maybe starting with the data catalog.. this is a very tabular data concept but wondering how I鈥檇 build around image data next.. sharing and discovering say .. image datasets in S3
s
I think that's fairly in line with a "data as a product" architecture. So, if you set up a pipe from Salesforce to Snowflake, it's the
db.salesforce.*
tables in Snowflake that I would want to document heavily, tag with lineage, purpose restrictions, etc. There has been discussion around building a
singer-catalog-to-dbt-documentation
utility, and now that taps can have descriptions in them, it might be good to revisit that
e
I guess from my end, working in a regulated environment we鈥檙e almost needing to place the products into a ecosystem that speaks permissions
and that cannot come from a tap alone nor a target.. but maybe I am still coming up to speed with what taps can do
a
@emcp - For governance and monitoring, you might consider committing your catalogs to source control. As an alternative though, you could also use the output of something like
meltano select <TAP_ID> --list --all
to create a more grokable/auditable artifact for code reviews and drift detection. I think something like this could make a nice git artifact 馃檪:
Copy code
Enabled patterns:
    tags.*
    commits.id
    commits.project_id
    commits.created_at
    commits.author_name
    commits.message
    !*.*_url

Selected attributes:
    [selected ] commits.author_name
    [selected ] commits.created_at
    [automatic] commits.id
    [selected ] commits.message
    [selected ] commits.project_id
    [automatic] tags.commit_id
    [selected ] tags.message
    [automatic] tags.name
    [automatic] tags.project_id
    [selected ] tags.target
wdyt?
...building the community in an enterprise around datasets...
I do think this is important in the industry. When I was at Slalom, we treated our data project as an open source project that everyone in the company had read/fork access to, and anyone could submit PRs. We published something very much like the above so everyone in the company could see which fields and tables were available, which we currently imported into our project and which were "left on the table" (pun intended!) 馃槄
Then if anyone wanted us to add or remove something, we didn't need to spend engineering hours on finding it. We just asked the product manager, or analyst, or whoever: "here's everything the source declares. let us know if you need anything we're not pulling."
e
problem with that model of access is, when you work in GDPR PII sensitive data.. a lot of the friction in an enterprise becomes legal or risk dept.. thinking how can meltano interface with this area
a
Can you say more about how you see the GDPR/PII topic come into this? When we started openly publishing the include/exclude lists in my past life, the Security team loved it. We basically just included them in the PR conversation of "here's what's changing from excluded to included" and we'd schedule monthly review with Security and the Business stakeholders to let them hash it out. 馃槄
We called it the "DUR" - Data Use Review simple smile
e
for the financial institution I work at, I want to find a way that.. governance of data or datasets.. can be managed a bit better.. seems meltano tap configurations is one piece.. another is something like Immuta (who I see some employees here and there working with Immuta slack, whcih I take as a good sign) I need to learn more about this sector.. but the resistance from our internal teams is that.. data scientists want R&R or Rock & Roll... all data.. all the time.. and the compliance dept basically needs to know data was masked.. that it's being used for a proper business purpose... and the traceability to know.. say that .. meltano is helping feed that data lineage story
s
nice @emcp! i'm a big believer in exactly what you're describing: a separation of duties between "compliance policy" and "data processing" workflows. for example, an approach where meltano should be able to pull in as much data as possible, and exposes all the metadata necessary to implement compliance policies. but then compliance can actually go in and implement or update compliance policy as needed. the metadata is the interface layer. i've actually written about this before and i call it the separation of "value" and "governance" transforms: https://www.immuta.com/articles/separating-value-governance-transformations-with-dbt-immuta/
i'll note that an interesting trend is that we are actually seeing the big vendors start implementing things like this (google big query policy tags, snowflake access policies). immuta's snowflake integration, for example, now leverages the builtin
row access policy
and
masking policies
and basically will create / apply these things based on catalog metadata. but the challenge of getting quality, up to date metadata about all your data sources can't be overstated. that's why im such a big advocate of having that quality information live as far upstream (i.e. in meltano taps) as possible