Hi we are building a reporting delivery app to query warehou Meltano #best-practices

Hi, we are building a reporting delivery app to qu...

huiming

03/03/2023, 9:12 PM

Hi, we are building a reporting delivery app to query warehouse database powered by Meltano. Meltano ELT jobs run every N intervals to update the warehouse database. Some of reports require a guarantee of "fresh enough" data up to past N minutes. Curious what are the common practices to check for data freshness or "last ELT success run timestamp" before executing a query in the warehouse? Is there a Meltano built-in metadata table in the warehouse database that we can query?

pat_nadolny

03/03/2023, 9:35 PM

@huiming the system database has a runs table that you could probably get this information from. Or theres also the state get command to check on your job bookmarks if theyre all running using incremental state but the system database is probably a better bet.

huiming

03/03/2023, 9:51 PM

Our reporting delivery app currently has access to query the warehouse database only, not the Meltano system database. Is there a common alternative to get this information directly by querying the warehouse database?

pat_nadolny

03/03/2023, 9:56 PM

One option is to enable the loader to add metadata columns (example for target-snowflake) then you can query the max value of those metadata columns in the warehouse tables to see the most recent

_sdc_extracted_at

_sdc_batched_at

huiming

03/03/2023, 10:11 PM

ah okay, this option will allow us to check the "source table" freshness correct? is there a way to determine the "mart models" freshness, or the "overall warehouse" freshness? some context: we need to generate a few "high-accuracy" summary reports only when all mart models that the reports depend on are "fresh enough". otherwise, the reporting delivery app will pause and check again in few minutes.

thomas_briggs

03/03/2023, 11:12 PM

If you're using dbt you can use something like this to record when dbt models were run and then look at the run start/end timestamps to see if they're "recent enough". We use that approach to show when the source data in our reports was last updated.

Matt Menzenski

03/04/2023, 1:00 AM

We use https://hub.meltano.com/extractors/tap-meltano to pull Meltano run data into the warehouse where we can query it more easily

huiming

03/04/2023, 1:26 AM

@thomas_briggs Do you reuse a specific dbt package to record the start/end timestamp?

huiming

03/04/2023, 1:46 AM

I came across this post about maintaining a "freshness" table: https://bostata.com/staying-fresh-with-freshness-tables/ Do we think models freshness check is a common must-have feature? It looks like everyone use a slightly different solution to solve this problem currently.

aaron_phethean

03/04/2023, 3:38 AM

New idea. ‘meltano run tap target newutility:insert’ Use the last step of the meltano tasks to insert the metadata into a warehouse table. Feels the same as the tap-meltano solution. (we never imagined it being used for this!)

huiming

03/04/2023, 5:05 AM

Do we think this could become an official turn-key feature in Meltano? Or it should be the transformer's responsibility?

aaron_phethean

03/04/2023, 11:42 AM

Either way, it's a good requirement to raise an issue for. Seems like a good approach to reporting. I like the idempotent nature, keep trying until it's ready. I'd guess the more typical approach is pipelines with dependencies.

huiming

03/04/2023, 12:13 PM

typical approach is pipelines with dependencies.

@aaron_phethean do you refer to the reporting scenario? just wanna pick your brain on this: do you mean another common option is to trigger the reports exports as the last step of a ELT job?

aaron_phethean

03/04/2023, 12:43 PM

Yeah. Essentially a sequence of pipelines, or one big pipeline. I like your design a lot more!

Open in Slack

Previous Next