huiming
03/28/2023, 1:24 AMHenning Holgersen
03/28/2023, 1:41 PMandy_carter
03/28/2023, 8:17 PMhuiming
03/28/2023, 11:39 PMaaronsteers
03/28/2023, 11:44 PMI'm also curious if Meltano is interested to eventually provide a built-in reusable solution to address the freshness check use cases?Yes. 100%.
aaronsteers
03/28/2023, 11:44 PMaaronsteers
03/28/2023, 11:47 PMhuiming
03/29/2023, 12:54 AMaaronsteers
03/29/2023, 1:19 AMhuiming
03/29/2023, 1:29 AMhuiming
03/29/2023, 1:32 AMhuiming
03/29/2023, 1:34 AMhuiming
03/29/2023, 1:36 AMaaronsteers
03/29/2023, 1:38 AMaaronsteers
03/29/2023, 1:40 AMaaronsteers
03/29/2023, 1:43 AMbest/latest
, current
, and dated/versioned
. In that approach, current
might have omissions if some datasets have not completed.huiming
03/29/2023, 2:09 AMcurrent
"domain" in Postgres, will it be a new Postgres schema? It's not clear to me how would we copy the user segmentation mart between different domains.huiming
03/29/2023, 2:18 AMaaronsteers
03/29/2023, 6:35 AMfinance_20230321.users
and finance_20230321.orders
. Then, if the users
table doesn't exist, it's obvious that some part of the reverse EL has not completed yet. I believe postgres also supports renaming schemas, so at the very end of your process, you could optionally do something like this to make sure users still have a single place to query the "latest" datasets:
drop schema if exists finance_yesterday_bak;
alter schema finance_latest rename to finance_yesterday_bak;
alter schema finance_20230321 rename to finance_latest;
Since these rename operations are logical operations and don't require any data processing, the "switch" process can complete very quickly.aaronsteers
03/29/2023, 6:44 AMbtw, given there are several options, if Meltano were to design a built-in solution, which approach is more likely to be considered?
This is hard to say. Something like what I describe above is pretty close to being doable with a pre script to initialize env vars and a post command to do the swaps if needed. But there are other ways to solve this problem - such as connecting with dbt's "freshness" feature or Meltano having its own end-to-end dag as a superset of the dag that dbt maintains. We've discussed these internally but there's a lot of complexity in finding a good solution, likely including the need to integrate with while also expanding upon the related core dbt features. (And we'd want to make sure the task isn't already achievable with dbt-level features since these are a lot of effort and time to implement.)
aaronsteers
03/29/2023, 6:47 AMhuiming
03/29/2023, 7:03 AMmultiple reporting services ---(consume)--> Warehouse (Redshift)
- the mart tables of the warehouse are usually huge, and we will use incremental materialization strategy. If we use the "schema swap" approach, does it mean we need to re-materialize full tables to a new schema in every pipeline run? I assume this approach will not work for huge incremental mart tables correct?
2. Meltano job --(publish marts)---> Operational DB (Postgres)
(Reverse EL) - the mart tables to publish are relatively smaller. Therefore I think the "creating a new schema and replicate the marts as full table" approach will still work, but ideally, we want a general solution that can solve both use cases.huiming
03/29/2023, 7:04 AMOr are you thinking to add a "updated_as_of" column to the published tables,this seems like a general solution that solve both use cases, wdyt?
aaronsteers
03/29/2023, 3:28 PM> Or are you thinking to add a "updated_as_of" column to the published tables,
this seems like a general solution that solve both use cases, wdyt?Yes, this is a good solution. It's hard to implement though robustly - and the only solutions that readily come to mind are ones that are implemented (primarily) in the dbt layer (as macros or sql transforms). As noted, it is easier to make freshness a blocking condition than to merge statuses on different tables which might have different levels of adherence to a freshness policy.
If we use the "schema swap" approach, does it mean we need to re-materialize full tables to a new schema in every pipeline run?The options for this depend on the backend database. For Snowflake you have zero-copy clones that can mitigate the duplication, but for Redshift/Postgres, I don't know if that's an option.
aaronsteers
03/29/2023, 3:31 PMhuiming
03/29/2023, 11:53 PMand the only solutions that readily come to mind are ones that are implemented (primarily) in the dbt layer (as macros or sql transforms)Update: @adrian_soltesz from our team built a simple working PoC yesterday using dbt macros. We can share with the community once we have a production-ready version.