I'm losing faith in the singer ecosystem. Perhaps ...
# singer-tap-development
p
I'm losing faith in the singer ecosystem. Perhaps others have managed to find a happy-path with singer? How can I find the happy-path? What i'm worried about: • The original singer-python is unmaintained. Taps using it become more and more outdated. It is also flawed in that it pins its deps (a library should not). • Transferwise's pipelinewise keeps a few well-maintained taps, but outside of that you're on your own. It's not really an SDK. • Metano's singer-sdk is an improvement over original singer sdk, but the flaws of the original spec are unfortunately also present. Eg. transforms are part of destinations, hard to pass runtime or source originated params/metadata to destination (for partitioning), scaling is an afterthought. I'm exploring alternatives to singer-based ETL, and all I can say is that all alternatives are flawed in their own ways. Help me restore my faith 🙂.
e
Let's work together to improve it?
v
The original singer-python is unmaintained.
It works pretty well still, although the singer-sdk by meltano is much better (you're here right?)
Taps using it become more and more outdated. It is also flawed in that it pins its deps (a library should not).
I don't think you're thinking very deeply about this one, there is a reason for this, not as simple as "a library should not"
Transferwise's pipelinewise keeps a few well-maintained taps, but outside of that you're on your own. It's not really an SDK.
You're in the Meltano community take a look around 😄 #C0489728QFL looks more active then you're saying
Metano's singer-sdk is an improvement over original singer sdk,
Doesn't this counter your previous points?
but the flaws of the original spec are unfortunately also present. Eg. transforms are part of destinations, hard to pass runtime or source originated params/metadata to destination (for partitioning), scaling is an afterthought.
The spec isn't so bad
transforms are part of destinations
Where in the spec does it say this (I don't believe it has an opinion on this as it really shouldn't)
hard to pass runtime or source originated params/metadata to destination (for partitioning)
Not sure what exactly you're talking about. The source can pass any metadata it wants as a stream if it wishes the spec doesn't say what should / shouldn't happen here (smartly I think) , there is a issue board going over potential singer spec changes as well
scaling is an afterthought.
What does the spec have to do with this exactly? I've scaled pretty well with this 🤷 , although the spec does handle some things very well and others (Billions of records) not so well although it's doable
I'm exploring alternatives to singer-based ETL, and all I can say is that all alternatives are flawed in their own ways.
Yep, most of them aren't open. If they are open a lot of them tend to couple limited numbers of taps/targets to their implementations which is much worse than seperate apps for taps/targets as proven by https://hub.meltano.com/ there's a reason no commercial company supports 600+ sources/targets
e
Taps using it become more and more outdated. It is also flawed in that it pins its deps (a library should not).
@visch I actually do agree with this, and it's the reason why I've been trying to remove any upper dependency constraints from the sdk. It's downstream taps and targets that should pin dependencies, at least to the minor (semver) level. I'm curious what reasons there would be for using a different approach.
transforms are part of destinations
I'm also curious what is meant by this
hard to pass runtime or source originated params/metadata to destination (for partitioning)
IIUC what the expectation is here, either • singer packages would have to be rethought either as importable artifacts, making them no longer language-agnostic or • have the orchestrator do the work of sourcing those params and passing them to an instance of the singer application
v
@edgar_ramirez_mondragon
@visch I actually do agree with this, and it's the reason why I've been trying to remove any upper dependency constraints from the sdk. It's downstream taps and targets that should pin dependencies, at least to the minor (semver) level. I'm curious what reasons there would be for using a different approach.
I understand generally, for jsonschema I assumed there was some thought about Draft 4 being the default re https://github.com/MeltanoLabs/Singer-Working-Group/pull/24
jsonschema version is a thorny portion of the spec imo, I'm good with most of it 🤷
e
oh yeah, though
Draft 4
can be enforced without pinning your dependencies
p
Thanks for the positive responses! I admin that my experience with singer is limited to POCs, and i lack a long-term perspective on this.
Let's work together to improve it?
For sure!
You're in the Meltano community take a look around 😄 #C0489728QFL looks more active then you're saying
Did not know about this channel. Thanks.
> Metano's singer-sdk is an improvement over original singer sdk,
Doesn't this counter your previous points?
Absolutely, for the taps and targets using it. For the rest, the worries are still relevant.
> transforms are part of destinations
I'm also curious what is meant by this
Sorry for not being clear. There's multiple things i'm talking about, but I'll need to think about which ones are relevant here. I won't expand on this point now, but to give you an example, "we have both
target-s3
and `target-s3-csv`".
> hard to pass runtime or source originated params/metadata to destination (for partitioning)
Not sure what exactly you're talking about. The source can pass any metadata it wants as a stream if it wishes the spec doesn't say what should / shouldn't happen here (smartly I think) , there is a issue board going over Spotential singer spec changes as well
Regarding my specific use-cases, we are mainly missing: • "runtime parameters", such as
date
to run loading for. • "source originated", metadata parameters such as
file_name
. Some extractors support this and that's great, but there's no consistency. Filename would make sense for local filesystem, sftp and s3, similarly table name makes sense for databases. • common way of defining partitioning, in source and in destination A trivial example is that i'm loading data, that is partitioned by date from say SFTP to S3. I want to mirror the source partitioning in the destination. I can't seem to figure out a nice way to do it. Another trivial example is that in the above scenario, I want to re-run a pipeline for a specific date in the past. Extractors and loaders barely have a concept of partitioning. Some targets (at least s3) are able to partition data by date, but they only use the current datetime (load date). Overriding this is hard and requires me to use workarounds, such as pass this data in "stream_name" or similar. But, I'll make another POC and look more into
context
, to see if I can figure out some other way. The good thing is: The positive response here prompted me to give the singer ecosystem a second try 🙂 .