Hi there we re experimenting with Meltano for a way to conso Meltano #best-practices

Hi there, we’re experimenting with Meltano for a w...

adam_rudd

09/24/2021, 11:35 AM

Hi there, we’re experimenting with Meltano for a way to consolidate a bunch of bespoke etl adapters our team has built in the past. Currently I’m looking at splitting a json up into multiple relational ‘tables’ which we want to write out to separate s3 paths (which end up in our warehouses). Simply unnesting all the things and putting it into a single table is not something we want to pursue. . I’ve read about stream mappings and parent-child jobs, and i’m not sure what the recommended approach here would be. We’ll be splitting out every array into its own table and pulling a parent key in so all child entities know how to map back to the parent. eg. for “company” table

Copy code

{ "baseid": 001, "datestamp": "20210924T00:00:00Z", "users": [<user array here>], "groups": [<groups array here>] }

would become:

Copy code

//company entity (root)
{ "baseid": 001, "datestamp": "20210924T00:00:00Z" } --> s3://<my_s3_bucket>/company/<timestamp>_company.json (or csv)

//users entity
{ "company_id": 001, <other user data here for each in array> } --> s3://<my_s3_bucket>/company/<timestamp>_users.json (or csv)

//groups entity
{ "company_id": 001, <other user data here for each in array> } --> s3://<my_s3_bucket>/company/<timestamp>_groups.json (or csv)

My understanding is there may be 2 approaches we could take to get this outcome: • child streams which do the appropriate transformations + sample specific values from the schema - This looks promising but i’m seeing some challenges unnesting. eg

$.users.[*].<field>

without specific transformation in the

get_records()

function on each child stream • Using inline stream maps, which i confess i’m a bit in the dark on, since I haven’t found an example project i can pick apart where it’s working. From the docs it looks like it might be more appropriate for transformation, but I am unsure how i’d split out the streams so I can use an out of the box sink like

json-ld

TL;DR: If i’m looking at splitting a parent dataset into multiple streams which are sent to multiple locations, should I be looking at using Child streams, inline stream maps, both, or something else entirely? Thanks!

visch

09/24/2021, 1:06 PM

You want to do this dynamically right? Sounds like a new target-s3 or maybe an expansion on target csv is what you're after, you can actually bypass a bunch of work if your taps can already get the data. It's fairly common for targets to denest the data for you

visch

09/24/2021, 1:07 PM

Take a look at how target postgres / snowflake / big query do this. I think some may use s3 as an intermediary anyway (snowflake?) Maybe starting there would be helpful

adam_rudd

09/24/2021, 9:10 PM

Thanks Derek. Sounds like I should be doing this in the target then.

visch

09/25/2021, 12:27 AM

Yeah denesting tends to happen there I'm not super fluent in Denesting as of now! Evenetually 😄

adam_rudd

10/01/2021, 3:29 AM

So had a look into target design and I wasn’t sure it felt like it was the right domain for anything beyond automated denesting. Because we’re pulling parent IDs and generating new GUIDs for content which needs it, this really feels like a feature which the tap should offer, or sit somewhere tightly coupled to the expected datasource. Thinking is mostly based around the unpredictability of which entities would be pulled out. If a Json object has like 4 arrays which we want to pull out into new ‘tables’ which go into bespoke buckets or RDS tables, primary & foreign keys need to be confirmed, as the data must be joinable to be useful. For now we’ve opted to keep the relational entities for this specific data (Genesys Cloud data) within the tap and providing the option to choose how the data presented should be passed to the target. This is how we’re approaching the Meltano features Tap: • Provides source of truth for source • Does the business logic required to keep data joinable if required + optimal for downstream tools (like Tableau Cloud) • Provides schema info, metadata etc • Outputs predicable datasets from source Target • expects data of a specific type • takes in 1 or more streams • can be denested or not - if denesting it’ll be simple and all associated keys are present • Does all the authentication and writing to the expected single location

adam_rudd

10/01/2021, 3:30 AM

So our flow is:

Copy code

Tap parent stream -> 4 child streams ---> target for each {stream} write to location with specific path```

visch

10/01/2021, 2:27 PM

Most of this sounds good! I have a few things I"ll comment on (someone else could take a stab!)

generating new GUIDs for content which needs it

You're getting into the land of Transformations which we try to keep after the load into the DB. This is almost 100% something we wouldn't want a tap/target to do

primary & foreign keys need to be confirmed

Transformation layer (DBT) is where we try to keep this stuff, starts to get a bit fuzzy here as you technically have the primary/foreign key information in your tap (normally) that you'd atleast want to provide to the target DB

Does the business logic required to keep data joinable if required + optimal for downstream tools (like Tableau Cloud)

- I'd love to understand this point a bit more! Could you expand on what you mean here, maybe this is tied to GUIDs? A data example would probably help me grok it!

visch

10/01/2021, 2:30 PM

I reread your first question and it looks like I need to take a deeper dive, again if someone else wants to hop in here that'd be great, this is a lot!

edgar_ramirez_mondragon

10/01/2021, 3:05 PM

@adam_rudd It looks like indeed child streams is the way to solve your problem here. Did you take a look at the

get_child_context()

method? That'd allow you to pass the nested array as well as any primary keys from the parent record as the context for a child stream, which then would combine each item in the array with the right primary keys to produce the child records

adam_rudd

10/03/2021, 1:12 AM

Hey guys thanks for the excellent through provoking points from all of you!

generating new GUIDs for content which needs it

- 100% agree with your point @visch I had a play with some differenet scenarios and as soon as a child table in the DB needs to be rerun or changes, any GUIDs generated as part of the source could be quite dicy. Agree that any GUID generation should be part of a transform step, and it should be in the target db which should handle WID generation etc. Action here: Tap is pulling in known parent keys only. Any time there is a lack of PK for any child stream, we need to look into the scenario manually.

primary & foreign keys need to be confirmed

We were motivated to investigate this mostly because we don’t have access to DBT just yet as part of our POC, and because our agreed perspective of a

data producer

is one who delivers immediately useable data. I personally really liked the idea of a tap being able to provide some level of customisation as to how the data is presented, because it’s nice and modular. In the specific cases we looked into (Currently we’re looking at Genesys Cloud and are building off the existing purecloud singer tap) there is quite a lot of documentation around how specific payloads should be used. In our child stream to deal with that we’re defining the pk & fks in the stream definition. Then we just tac it onto the unnested or cross joined entry of the child stream. I’m keen to see how this works with DBT as part of the pipeline. I really liked the look of the gitlab example with models included, though it appeared like the models listed were more for tranformation for reporting rather than transformation for relational database compatibility

Does the business logic required to keep data joinable if required + optimal for downstream tools (like Tableau Cloud)

Happy to expand on this. _I’m not sure I’ve been clear by using the term

business logic

. Basically I just mean identifying fields we care about and pulling them in so the entity created is valuable and joinable. If a child array we’re going to turn into an entity doesn’t have a parent key to link to, we work out what it is and push it in as part of the child stream

get_records()

step._ The guiding requirement we’ve been using when approaching this tap creation as been around relational database requirements. The data we’re extracting is extremely structured, and the tools and reporting platforms in place in the company don’t like working with structured data. in particular, in my experience, tableau cloud basically falls over when you try doing any sort of non-relational (ie csv) formatted table. Because we’re reporting with tableau, this is the baseline requirement. So with that in mind, we know we wanted to transform the data into its relational entities, and we wanted to ensure all entities created have a way to join back in. @edgar_ramirez_mondragon Thanks! I’ll look into

get_child_context()

some more. we’re using it but I’m basically pulling in the entire record we we have the most flexibility, and because we’re pushing to 4-6 (sorry brain fart - can’t remember the specific number) child streams.

adam_rudd

10/03/2021, 1:40 AM

In terms of sample data, i’ll see if i can anonymise some extracts to demo it. As a stopgap, I’ve included a dummy set of data which mostly illustrates the scenario:

Copy code

{
  "id": "ID001",
  "address": {
    "street_address": "123 Fake street",
    "suburb": "Fakeland",
    "state": "VIC",
    "country": "Australia"
  },
  "inspection_times": [
    {
      "id": "IID001",
      "description": "First inspection date on Sunday"
    },
    {
      "id": "IID002",
      "description": "Second inspection date on Tuesday"
    },
    {
      "id": "IID003",
      "description": "Final inspection date on Friday"
    }
  ]
}

should become:

Copy code

// `properties` stream (json line delimited)

{
         "id": "ID001",
         "address_street_address": "123 Fake street",
         "address_suburb": "Fakeland",
         "address_state": "VIC",
         "address_country": "Australia"
}


// `inspection_times` stream (formatted for readability, but assume json line delimited)

{
         "properties_id": "ID001",
         "id": "IID001",
         "description":"First inspection date on Sunday"
},
{
         "properties_id": "ID001",
         "id": "IID002",
         "description":"Second inspection date on Tuesday"
},
{
         "properties_id": "ID001",
         "id": "IID003", 
         "description":"Final inspection date on Friday"
}

visch

10/03/2021, 3:30 PM

The lean for ELT is to load

Copy code

{
  "id": "ID001",
  "address": {
    "street_address": "123 Fake street",
    "suburb": "Fakeland",
    "state": "VIC",
    "country": "Australia"
  },
  "inspection_times": [
    {
      "id": "IID001",
      "description": "First inspection date on Sunday"
    },
    {
      "id": "IID002",
      "description": "Second inspection date on Tuesday"
    },
    {
      "id": "IID003",
      "description": "Final inspection date on Friday"
    }
  ]
}

Into your DB as is Then in your transform layer (To be more clear than saying DBT, I literally mean SQL you can run SQL against this data set to get it into the format you listed next) The point is that you need a Transform layer somewhere, yes you could put the transform in the Extract portion or the Loader portion. I prefer to push that to the Target, then when that requirement changes for some reason there's no changes on the Extract/Load portion it's just in the DB layer which tends to change based on Business Requirements more anyways https://gitlab.com/gitlab-data/analytics/-/blob/master/transform/snowflake-dbt/models/sources/discourse/daily_engaged_users.sql Is one example of many which may help give you an idea!

visch

10/03/2021, 7:21 PM

There's tons of conflicting advice on this point so I guess my advice here is pretty opinionated! So the answer is it depends! Adding a GUID I think most would agree to do that in the transormation layer. The Deduplication piece it looks like a lot of people say just denest it in the beginning as otherwise you have to denest later and it costs more money/time.

3 Views

Open in Slack

Previous Next