Good new name We run a Saas and are trying to push data into Meltano #singer-target-development

Good new name. We run a Saas and are trying to pus...

johann_du_toit

04/15/2021, 10:33 PM

Good new name. We run a Saas and are trying to push data into our own system. My most promising proof of concept idea to use singer so far has put a transformer in between the tap and target, so

input_tap | saas_transformer | output_target

. In our case the transformer does stuff like inserting a tenant id into the schemas and records. Or remapping the stream names to match our internal tables. Of course we could write our own custom target, but this way we can use ANY tap and ANY target. We use postgresql, so for instance our sync processes might run

tap-pipedrive | saas_transformer | target-postgres

to get our clients' pipedrive data into our multi-tenant postgres db. And then we run dbt to clean it up even a bit further if needed (or create a consistent set of views for our saas app to use). But you could eg create a

target-rest

or something as well to push into an API. Thoughts?

aaronsteers

04/15/2021, 10:37 PM

We do have an issue logged for meltano-defined transformations between the tap and target: Allow extractor entity/attribute (stream/property)-level transformation rules to be defined in `meltano.yml` (#2299) · Issues · meltano / Meltano · GitLab

aaronsteers

04/15/2021, 10:40 PM

Re:

Of course we could write our own custom target, but this way we can use ANY tap and ANY target .. [and] might run `tap-pipedrive | saas_transformer | target-postgres`…

If you could keep the ability for a custom transformer in the middle, would you be interested to swap the target for a custom one that requires your properly shaped data? Like

tap-pipedrive | saas_transformer | target-wink-reports

? Do I understand correctly, the transformer in the middle would need to be flexible and the target in either case requires a standard input shape?

johann_du_toit

04/15/2021, 10:56 PM

No, I'm thinking if there's a transformer in the middle then there's no need to have a custom tap. I'd like to maintain as few code bases as possible. Transferwise already has a transformer which is configuration driven. It only does some simple "hashing" type stuff, but I don't see why something similar can't be built to "insert column", "modify field" etc, all config driven.

johann_du_toit

04/15/2021, 10:57 PM

so from meltano's perspective, I'm not quite sure if it's possible to specify some intermediary process yet? it might be a quick win? because I'm sure people will come up with plenty of use-cases which can be solved by a "EmTLT" process. "Extract, miniTransform, Load, Transform". Heard it here first! 🤣

johann_du_toit

04/15/2021, 11:01 PM

Do I understand correctly, the transformer in the middle would need to be flexible and the target in either case requires a standard input shape?

I think most of the targets will shape whatever comes in - they're quite flexible. Eg target-postgres will flatten objects, create sub-tables etc. So if the schema-massaging in the middle is good enough, then you can control what the target will do. Along with the config options of the target itself.

aaronsteers

04/15/2021, 11:01 PM

😄 I like the

mT

! But yes, today, you’d have to implement that

mT

using DBT, or a custom target or tap that supports at least minor config-driven inline transforms as I describe here. The universal inline fix is as you describe, with a transformer in the middle accepting some dynamic transformation mappings, like described here, but not yet available.

aaronsteers

04/15/2021, 11:04 PM

@johann_du_toit - To confirm a few things… for the inline transformer use case, we likely would not be able to join across streams, and each stream type would just be able to transform using its own data. The cases where you would need to aggregate or join and lookup between related stream IDs and their properties would probably get us into dbt territory. Does sound right to you to, and is that still viable for your intended use cases?

johann_du_toit

04/15/2021, 11:16 PM

Ah yes - I should've been clear - I'm not talking about any kind of aggregation or joining. That is handled by dbt at the end. I'm talking about things that the typical Saas might need to think of because they are dealing with multiple tenants and need to split up the data somehow. Which means injecting additional meta data into streams, and namespacing streams - that kind of thing. The number of streams an number of records remain exactly the same. That's why I said it's "mini-T", rather than proper T. So I believe we're on the same page, yes. 🙂

aaronsteers

04/15/2021, 11:26 PM

Great. Thanks for confirming the expectations. Whenever you have time, I'm curious on your thoughts regarding the

simplecalc

library: https://github.com/danthedeckie/simpleeval One idea we're evaluating is (rather than write our own expression language) would be to use a library like this one. The transformer in the middle would take a setting with a list/dict of stream transformations and the transformations might be defined using simplecalc expressions. Curious to your thoughts on of this approach...

johann_du_toit

04/16/2021, 2:18 AM

That sounds fine. I don't think transforming a value is the hard part - simpleeval will do. It's the filtering - quick and easy ways to decide which records to operate on which might be trickier. Let's say I want to insert a new field into a stream. The transformer needs to update schema and record messages - in completely different ways. There's no expression to evaluate at all in that example. But you need to be able to say "if type=schema, and stream=invoices, and whatever else, then add a field at a certain spot in the object with a certain value". I imagine we should also make sure it's not slowing things down significantly when ingesting larges volumes of data.

visch

04/16/2021, 12:49 PM

I'm curious if you're pushing directly into your target_db today or using your own api? API would be fairly trivial for something like tennat id. Today for doing any transformation work the recommendation is really to load the data into your DW then use DBT. Which really does/can work for all the use cases you've mentioned. And also for wink reports it seems like a great fit

visch

04/16/2021, 12:53 PM

I like the idea of the transformation layer, but it gets complicated pretty fast. Not to mention you end up wanting a low code front end which means you really should architect from the beginning how a UI would build the transformations and output them in readable yaml. Then there becomes this big line between do you care about readable yaml or your UI being clean and decisions get made around that can bog down things if your focus is too much on the UI. Gitlabs approach with .gitlab-ci.yml is good I think but it's not easy for anyone to use via a ui.

visch

04/16/2021, 12:53 PM

At the end of the day I think it needs to be code otherwise you end up in the N+1 saas apps that go for no-low code apps. While they have their place they're not useful to everyone broadly

pnadolny

04/16/2021, 2:14 PM

Jumping in here - I agree with @visch we've been using this pattern of Tap->DW->DBT Transform/View -> SaaS Target. This allows us to keep the targets relatively simple doing some basic field name mapping and also pass everything through our data platform/DW before sending to a saas target

visch

04/16/2021, 2:18 PM

Nice, an idea we've been tossing around is to do TAP->DW->DBT->Transform/View/etc -> Tap(select just the stream you care about) -> Target (Real strict schema for the SaaS Target)

visch

04/16/2021, 2:18 PM

Same idea though, but we think we could do saas-targets with Singer

visch

04/16/2021, 2:19 PM

Main benefit there is all your data movement is done with the same framework, version controlled in the same place etc etc

pnadolny

04/16/2021, 2:34 PM

yeah we have a custom singer manager thing we built so we dont use meltano yet but thats the exact pattern we landed on too. Usually a SaaS target sync job is a single DBT view to a single SaaS endpoint, not multiple streams. Its was faster and simpler for us to just make them their own sync job

visch

04/16/2021, 2:38 PM

Nice! Glad to hear about someone doing this with Singer in the wild. I have been dabling, and want to make the jump to try to bring IT folks into this framework. Would you / could you show this off in the next office hours?

pnadolny

04/16/2021, 5:00 PM

sure im happy to join the next office hours and chime in about what we've doing. whats the schedule for those?

pnadolny

04/16/2021, 5:01 PM

never mind - found it on the website

aaronsteers

04/16/2021, 5:03 PM

👍 Excited to have you share, @pnadolny. We do also have the #C01QS0RV78D channel where we share announcements, links, recaps, etc.

johann_du_toit

04/18/2021, 12:29 PM

I guess for some context on where we're coming from - we're not using singer / meltano at all. But we're a reporting platform and we've written our own custom 30 integrations already over the years. But the maintenance burden etc is high. So being able to leverate Singer (which didn't even exist when we started) would be great. However we have our own scaling, scheduling and orchestration setup, and transformation queries, etc. All our own synching is straight into postgres. The last few days I was able to get a singer tap and target-postgres to run inside our own architecture (mostly celery based for distributing tasks). It doesn't use meltano though, but it runs the tap, transforms data slightly, and then feeds it into the target - all from inside a celery task which pulls the configurations etc from the app database and also stores and manager the tap state. It seems to be running well in my simple tests so far. Obviously far off from being production ready though.

hung_dang

04/21/2021, 9:16 PM

@nico_marlon_haesler

Open in Slack

Previous Next