HI <@U069XD4GVRP> I read your article here (<https...
# best-practices
a
HI @Sven Balnojan I read your article here (https://meltano.com/blog/5-helpful-extract-load-practices-for-high-quality-raw-data/) and it makes a LOT of sense, thankyou, Hoping to implement them and avoid nasty issues in the future. As a relative newcomer here, I wonder how meltano can help with the issues you mention here? Are there some simple dos/donts for
meltano.yml
in each point? Like for point 1 can I add a stream map to generate a timestamp column or identify the meltano run? In my limited experience different taps do the
_sdc
fields in a different way (or not at all), can I take control of that?
s
@Andy Carter Good question 🙂 To be honest I wrote them without thinking about meltano, but of course meltano does offer more than other tools in this direction: 1. Yes use the meta_data columns when available (always!). Otherwise use a stream map (that also can insert time stamps) on the built in tap, and finally use an extra map-transformer. So you can get these for EVERY data import. (if you need to use the mapper, it might be easier to just use it in all your pipelines) 2. Deduplication means trying to turn off deduplication features on taps that support it; Use log based replication if available. Then consider full syncs or key based. 3. Don't flatten on ingestion - turn that off as well, some taps support that 🙂 4. Indeed try to avoid mappers if possible, however mappers are already implemented in a way that you usually only consider them when the use case is a good one according to this guide. I'm preparing a bunch of examples for the mappers to help with some of these questions. I'll mention you once they are out. Hope this helps! And thanks for asking.
a
Thanks for the response 🙂 good to hear that meltano can handle a lot of these, another feather in its cap. I look forward to seeing the mapper examples, it's an area I don't really understand well so any examples welcome.
p
Regarding “1. Make each EL run uniquely identifiable … Ingestion time: the timestamp indicating when the load process started.“: The target that I’m using populates
_sdc_batched_at
with the extraction time of each record, so its value differs slightly for each record within the same load process. (The taps that I use have the option to do the same thing.) But I’d prefer to have the same timestamp attached to all records from the same load process. Any recommendations for how to do that with Meltano? For example could a stream map somehow reference a timestamp that’s constant for all records belonging to the same load process (but differs between load processes)?
s
@pat_nadolny it should, right?
@peter_s 1. You can take a general look into https://github.com/MeltanoLabs/meltano-map-transform/tree/main/examples#example-3-dropping-columns-aliasing-streams I just added a bunch of examples of using stream maps 2. So one solution would be to simply timestamp not to the microsecond but to the hour/day (depending on your workload) - e.g. like https://github.com/MeltanoLabs/meltano-map-transform/tree/main/examples#example-3-dropping-columns-aliasing-streams.
p
Thanks for that info. I’d still be a bit concerned about a long-running load process spanning an hour or day boundary, and so getting multiple values per load for that timestamp column.
s
@peter_s I am too 😉 I already logged a feature req that would enable that in an easy way. https://github.com/meltano/sdk/issues/1532
p
@Sven Balnojan I think those are all record level timestamps. I'm not sure about sync level timestamps or IDs. I found https://github.com/meltano/sdk/issues/1199 and asked for clarification on
_sdc_table_version
which might be what we're looking for.