HI < Sven Balnojan> I read your article here <https meltano Meltano #best-practices

HI <@U069XD4GVRP> I read your article here (<https...

Andy Carter

03/22/2023, 9:00 AM

HI @Sven Balnojan I read your article here (https://meltano.com/blog/5-helpful-extract-load-practices-for-high-quality-raw-data/) and it makes a LOT of sense, thankyou, Hoping to implement them and avoid nasty issues in the future. As a relative newcomer here, I wonder how meltano can help with the issues you mention here? Are there some simple dos/donts for

meltano.yml

in each point? Like for point 1 can I add a stream map to generate a timestamp column or identify the meltano run? In my limited experience different taps do the

_sdc

fields in a different way (or not at all), can I take control of that?

Sven Balnojan

03/23/2023, 8:42 AM

@Andy Carter Good question 🙂 To be honest I wrote them without thinking about meltano, but of course meltano does offer more than other tools in this direction: 1. Yes use the meta_data columns when available (always!). Otherwise use a stream map (that also can insert time stamps) on the built in tap, and finally use an extra map-transformer. So you can get these for EVERY data import. (if you need to use the mapper, it might be easier to just use it in all your pipelines) 2. Deduplication means trying to turn off deduplication features on taps that support it; Use log based replication if available. Then consider full syncs or key based. 3. Don't flatten on ingestion - turn that off as well, some taps support that 🙂 4. Indeed try to avoid mappers if possible, however mappers are already implemented in a way that you usually only consider them when the use case is a good one according to this guide. I'm preparing a bunch of examples for the mappers to help with some of these questions. I'll mention you once they are out. Hope this helps! And thanks for asking.

Andy Carter

03/23/2023, 9:13 AM

Thanks for the response 🙂 good to hear that meltano can handle a lot of these, another feather in its cap. I look forward to seeing the mapper examples, it's an area I don't really understand well so any examples welcome.

peter_s

03/24/2023, 3:41 PM

Regarding “1. Make each EL run uniquely identifiable … Ingestion time: the timestamp indicating when the load process started.“: The target that I’m using populates

_sdc_batched_at

with the extraction time of each record, so its value differs slightly for each record within the same load process. (The taps that I use have the option to do the same thing.) But I’d prefer to have the same timestamp attached to all records from the same load process. Any recommendations for how to do that with Meltano? For example could a stream map somehow reference a timestamp that’s constant for all records belonging to the same load process (but differs between load processes)?

Sven Balnojan

03/27/2023, 7:32 AM

@pat_nadolny it should, right?

Sven Balnojan

03/27/2023, 7:41 AM

@peter_s 1. You can take a general look into https://github.com/MeltanoLabs/meltano-map-transform/tree/main/examples#example-3-dropping-columns-aliasing-streams I just added a bunch of examples of using stream maps 2. So one solution would be to simply timestamp not to the microsecond but to the hour/day (depending on your workload) - e.g. like https://github.com/MeltanoLabs/meltano-map-transform/tree/main/examples#example-3-dropping-columns-aliasing-streams.

peter_s

03/28/2023, 1:06 AM

Thanks for that info. I’d still be a bit concerned about a long-running load process spanning an hour or day boundary, and so getting multiple values per load for that timestamp column.

Sven Balnojan

03/28/2023, 6:03 AM

@peter_s I am too 😉 I already logged a feature req that would enable that in an easy way. https://github.com/meltano/sdk/issues/1532

pat_nadolny

04/03/2023, 2:42 PM

@Sven Balnojan I think those are all record level timestamps. I'm not sure about sync level timestamps or IDs. I found https://github.com/meltano/sdk/issues/1199 and asked for clarification on

_sdc_table_version

which might be what we're looking for.

Open in Slack

Previous Next