nicholas_van_kuren
09/16/2022, 9:23 PM--full-refresh
flag, but on each run all of the records from my source are loaded again so I end up with duplicates. I do not have reliable replication IDs as described in the Singer documentation. There is no modified date or incremental ID in the source tables to track the latest record for upserts.nicholas_van_kuren
09/16/2022, 10:38 PM_sdc_extracted_at
timestamp from the loaded table, but am concerned about tables ballooning in size after multiple runs (if records are just going to continue to be appended). Is a separate scripting solution needed that deletes records on some scheduled basis?edgar_ramirez_mondragon
09/16/2022, 10:51 PMvisch
09/18/2022, 3:18 AMnicholas_van_kuren
09/19/2022, 2:55 PMsource
tables? My issues is that I have some source tables that are >600K records and I must do a full load on these as there is no replication key that works. This means that if I just let the data duplicate in this table and just take the latest by date the table will quickly grow to millions of un-needed rows. I am looking at drop_schema
and other dbt commands, but it also feels pretty risky to just delete my source without knowing the data load will be succesful, especially since it would need to be cascading and would remove any dbt generated tables as well. This means that if there was an issue upstream and data was not ready to load, I would not be able to build my models that day. I suppose you could work around this by adding checks to make sure records exist. Any thoughts on these issues? Is this what you are doing?visch
09/19/2022, 2:57 PMselect * from source
, that way the order looks like
1. Drop source table (everything in DW keeps working)
2. meltano run tap-name target-name
loads data into the source table (If there's an error our pipeline stops here)visch
09/19/2022, 2:58 PMvisch
09/19/2022, 2:58 PMdbt:test
before dbt:run
with whatever makes you feel better in dbt:test
nicholas_van_kuren
09/19/2022, 3:03 PMselect * from source
won't that get dropped when you run #1? I am using postgres and assuming I need to do a drop CASCADE to actually be able to drop the source tables. At least I have seen messages like this in development work when dropping/replacing sources that are being referenced by dbt models.nicholas_van_kuren
09/19/2022, 3:03 PMvisch
09/19/2022, 3:03 PMvisch
09/19/2022, 3:05 PMnicholas_van_kuren
09/19/2022, 3:05 PMvisch
09/19/2022, 3:06 PMnicholas_van_kuren
09/19/2022, 3:06 PMvisch
09/19/2022, 3:07 PM{%- macro autoidm_dropy() -%}
{%- set drop_query -%} DROP TABLE IF EXISTS {{ source('tap_azuread', 'users') }} CASCADE; {%- endset -%} {% do run_query(drop_query) %} {%- endmacro -%}
nicholas_van_kuren
09/19/2022, 3:07 PMvisch
09/19/2022, 3:09 PMtransformers:
- name: dbt
namespace: dbt_postgres
pip_url: dbt-postgres==1.0.4
commands:
pre_python:
args: run -m +stg_azuread +stg_bamboohr
post_python:
args: run
test:
args: test
description: Runs tests on data in deployed models.
description: Serve docs
autoidm_dropy: run-operation autoidm_dropy
I also do this in meltano.yml so I ran use meltano run
to run everything and make a full pipelinevisch
09/19/2022, 3:11 PMnicholas_van_kuren
09/20/2022, 2:39 AMvisch
09/20/2022, 11:42 AM