adam_rudd
09/24/2021, 11:35 AM{ "baseid": 001, "datestamp": "20210924T00:00:00Z", "users": [<user array here>], "groups": [<groups array here>] }
would become:
//company entity (root)
{ "baseid": 001, "datestamp": "20210924T00:00:00Z" } --> s3://<my_s3_bucket>/company/<timestamp>_company.json (or csv)
//users entity
{ "company_id": 001, <other user data here for each in array> } --> s3://<my_s3_bucket>/company/<timestamp>_users.json (or csv)
//groups entity
{ "company_id": 001, <other user data here for each in array> } --> s3://<my_s3_bucket>/company/<timestamp>_groups.json (or csv)
My understanding is there may be 2 approaches we could take to get this outcome:
• child streams which do the appropriate transformations + sample specific values from the schema - This looks promising but i’m seeing some challenges unnesting. eg $.users.[*].<field>
without specific transformation in the get_records()
function on each child stream
• Using inline stream maps, which i confess i’m a bit in the dark on, since I haven’t found an example project i can pick apart where it’s working. From the docs it looks like it might be more appropriate for transformation, but I am unsure how i’d split out the streams so I can use an out of the box sink like json-ld
TL;DR: If i’m looking at splitting a parent dataset into multiple streams which are sent to multiple locations, should I be looking at using Child streams, inline stream maps, both, or something else entirely?
Thanks!visch
09/24/2021, 1:06 PMvisch
09/24/2021, 1:07 PMadam_rudd
09/24/2021, 9:10 PMvisch
09/25/2021, 12:27 AMadam_rudd
10/01/2021, 3:29 AMadam_rudd
10/01/2021, 3:30 AMTap parent stream -> 4 child streams ---> target for each {stream} write to location with specific path```
visch
10/01/2021, 2:27 PMgenerating new GUIDs for content which needs it
You're getting into the land of Transformations which we try to keep after the load into the DB. This is almost 100% something we wouldn't want a tap/target to do
primary & foreign keys need to be confirmed
Transformation layer (DBT) is where we try to keep this stuff, starts to get a bit fuzzy here as you technically have the primary/foreign key information in your tap (normally) that you'd atleast want to provide to the target DB
Does the business logic required to keep data joinable if required + optimal for downstream tools (like Tableau Cloud)
- I'd love to understand this point a bit more! Could you expand on what you mean here, maybe this is tied to GUIDs? A data example would probably help me grok it!visch
10/01/2021, 2:30 PMedgar_ramirez_mondragon
10/01/2021, 3:05 PMget_child_context()
method? That'd allow you to pass the nested array as well as any primary keys from the parent record as the context for a child stream, which then would combine each item in the array with the right primary keys to produce the child recordsadam_rudd
10/03/2021, 1:12 AMgenerating new GUIDs for content which needs it
- 100% agree with your point @visch
I had a play with some differenet scenarios and as soon as a child table in the DB needs to be rerun or changes, any GUIDs generated as part of the source could be quite dicy.
Agree that any GUID generation should be part of a transform step, and it should be in the target db which should handle WID generation etc.
Action here: Tap is pulling in known parent keys only. Any time there is a lack of PK for any child stream, we need to look into the scenario manually.
primary & foreign keys need to be confirmed
We were motivated to investigate this mostly because we don’t have access to DBT just yet as part of our POC, and because our agreed perspective of a data producer
is one who delivers immediately useable data. I personally really liked the idea of a tap being able to provide some level of customisation as to how the data is presented, because it’s nice and modular.
In the specific cases we looked into (Currently we’re looking at Genesys Cloud and are building off the existing purecloud singer tap) there is quite a lot of documentation around how specific payloads should be used.
In our child stream to deal with that we’re defining the pk & fks in the stream definition. Then we just tac it onto the unnested or cross joined entry of the child stream.
I’m keen to see how this works with DBT as part of the pipeline. I really liked the look of the gitlab example with models included, though it appeared like the models listed were more for tranformation for reporting rather than transformation for relational database compatibility
Does the business logic required to keep data joinable if required + optimal for downstream tools (like Tableau Cloud)
Happy to expand on this.
_I’m not sure I’ve been clear by using the term business logic
. Basically I just mean identifying fields we care about and pulling them in so the entity created is valuable and joinable. If a child array we’re going to turn into an entity doesn’t have a parent key to link to, we work out what it is and push it in as part of the child stream get_records()
step._
The guiding requirement we’ve been using when approaching this tap creation as been around relational database requirements. The data we’re extracting is extremely structured, and the tools and reporting platforms in place in the company don’t like working with structured data.
in particular, in my experience, tableau cloud basically falls over when you try doing any sort of non-relational (ie csv) formatted table. Because we’re reporting with tableau, this is the baseline requirement.
So with that in mind, we know we wanted to transform the data into its relational entities, and we wanted to ensure all entities created have a way to join back in.
@edgar_ramirez_mondragon Thanks! I’ll look into get_child_context()
some more. we’re using it but I’m basically pulling in the entire record we we have the most flexibility, and because we’re pushing to 4-6 (sorry brain fart - can’t remember the specific number) child streams.adam_rudd
10/03/2021, 1:40 AM{
"id": "ID001",
"address": {
"street_address": "123 Fake street",
"suburb": "Fakeland",
"state": "VIC",
"country": "Australia"
},
"inspection_times": [
{
"id": "IID001",
"description": "First inspection date on Sunday"
},
{
"id": "IID002",
"description": "Second inspection date on Tuesday"
},
{
"id": "IID003",
"description": "Final inspection date on Friday"
}
]
}
should become:
// `properties` stream (json line delimited)
{
"id": "ID001",
"address_street_address": "123 Fake street",
"address_suburb": "Fakeland",
"address_state": "VIC",
"address_country": "Australia"
}
// `inspection_times` stream (formatted for readability, but assume json line delimited)
{
"properties_id": "ID001",
"id": "IID001",
"description":"First inspection date on Sunday"
},
{
"properties_id": "ID001",
"id": "IID002",
"description":"Second inspection date on Tuesday"
},
{
"properties_id": "ID001",
"id": "IID003",
"description":"Final inspection date on Friday"
}
visch
10/03/2021, 3:30 PM{
"id": "ID001",
"address": {
"street_address": "123 Fake street",
"suburb": "Fakeland",
"state": "VIC",
"country": "Australia"
},
"inspection_times": [
{
"id": "IID001",
"description": "First inspection date on Sunday"
},
{
"id": "IID002",
"description": "Second inspection date on Tuesday"
},
{
"id": "IID003",
"description": "Final inspection date on Friday"
}
]
}
Into your DB as is
Then in your transform layer (To be more clear than saying DBT, I literally mean SQL you can run SQL against this data set to get it into the format you listed next)
The point is that you need a Transform layer somewhere, yes you could put the transform in the Extract portion or the Loader portion. I prefer to push that to the Target, then when that requirement changes for some reason there's no changes on the Extract/Load portion it's just in the DB layer which tends to change based on Business Requirements more anyways
https://gitlab.com/gitlab-data/analytics/-/blob/master/transform/snowflake-dbt/models/sources/discourse/daily_engaged_users.sql
Is one example of many which may help give you an idea!visch
10/03/2021, 7:21 PM