Hello I am using the transferwise target postgres plugin and Meltano #plugins-general

Hello, I am using the transferwise target-postgres...

nicholas_van_kuren

09/16/2022, 9:23 PM

Hello, I am using the transferwise target-postgres plugin and trying to understand how I can truncate tables in my target postgres database prior to loading data on each run. I have the tap setup as full-replication and have also tried using the

--full-refresh

flag, but on each run all of the records from my source are loaded again so I end up with duplicates. I do not have reliable replication IDs as described in the Singer documentation. There is no modified date or incremental ID in the source tables to track the latest record for upserts.

nicholas_van_kuren

09/16/2022, 10:38 PM

If truncating is not built-in / easy to accomplish, what is the preferred strategy for managing table size over time? I could see working around the duplicate issue by grouping on a unique key and selecting the latest

_sdc_extracted_at

timestamp from the loaded table, but am concerned about tables ballooning in size after multiple runs (if records are just going to continue to be appended). Is a separate scripting solution needed that deletes records on some scheduled basis?

edgar_ramirez_mondragon

09/16/2022, 10:51 PM

Hi @nicholas_degiacomo. The singer community came up with ACTIVATE_VERSION messages to help with that use case, though not every target has adopted them and is seems to be the case for transferwise target-postgres. We’re working on https://github.com/MeltanoLabs/target-postgres (still experimental) based on the Singer SDK, which will make sure supports the activate_version spec.

visch

09/18/2022, 3:18 AM

For full table stuff (most of what I do) I happen to also use dbt so I just drop the tables before the run. Meltano run dbt:droptables tapname targetname dbt:run

nicholas_van_kuren

09/19/2022, 2:55 PM

@edgar_ramirez_mondragon - Thank you for this suggestion. I am looking around and not having any luck finding an example on how to implement this in a Meltano project on both the tap and target side. It seems like I may have to wait anyway given that the existing postgres target does not work with this functionality? @visch - Thanks Derek! Are you saying you use dbt to drop

source

tables? My issues is that I have some source tables that are >600K records and I must do a full load on these as there is no replication key that works. This means that if I just let the data duplicate in this table and just take the latest by date the table will quickly grow to millions of un-needed rows. I am looking at

drop_schema

and other dbt commands, but it also feels pretty risky to just delete my source without knowing the data load will be succesful, especially since it would need to be cascading and would remove any dbt generated tables as well. This means that if there was an issue upstream and data was not ready to load, I would not be able to build my models that day. I suppose you could work around this by adding checks to make sure records exist. Any thoughts on these issues? Is this what you are doing?

visch

09/19/2022, 2:57 PM

@nicholas_van_kuren Yeah to deal with that I have a staging model that's materialized as a table. Which is pretty much just a

select * from source

, that way the order looks like 1. Drop source table (everything in DW keeps working) 2.

meltano run tap-name target-name

loads data into the source table (If there's an error our pipeline stops here)

visch

09/19/2022, 2:58 PM

3. dbt runs(I include this in step 2)

visch

09/19/2022, 2:58 PM

If you're more worried run

dbt:test

before

dbt:run

with whatever makes you feel better in

dbt:test

nicholas_van_kuren

09/19/2022, 3:03 PM

I guess what I am wondering is for that staging model

select * from source

won't that get dropped when you run #1? I am using postgres and assuming I need to do a drop CASCADE to actually be able to drop the source tables. At least I have seen messages like this in development work when dropping/replacing sources that are being referenced by dbt models.

nicholas_van_kuren

09/19/2022, 3:03 PM

maybe the materialized option removes that issue?

visch

09/19/2022, 3:03 PM

A model can be materialized in a number of ways. Views cascade through eachother, a table stands alone. Just materialize it as a table 🙂

visch

09/19/2022, 3:05 PM

I'm assuming a lot of things here, like Cascade can technically delete tables when you have FK's set with other tables in Postgres so tread carefully but with DBT we normally don't' set constraints between tables so you're probably good

nicholas_van_kuren

09/19/2022, 3:05 PM

yeah, I must have had things set to views when I experienced that issue before. Thanks for your help. Would you mind sharing your code for dropping table or schema?

visch

09/19/2022, 3:06 PM

lol prepare yourself it's super complicated (not 🙂 )

nicholas_van_kuren

09/19/2022, 3:06 PM

😄, I am sure its simple. I am just having issues getting it working and wanted another reference. I am getting ROLLBACK in the log so something doesn't seem to be working.

visch

09/19/2022, 3:07 PM

Copy code

{%- macro autoidm_dropy() -%}
    {%- set drop_query -%}                                                                                                                                                                                                                           DROP TABLE IF EXISTS {{ source('tap_azuread', 'users') }} CASCADE;                                                                                                                                                                                                                                                                                                                                               {%- endset -%}                                                                                                                                                                                                                               {% do run_query(drop_query) %}                                                                                                                                                                                                           {%- endmacro -%}

nicholas_van_kuren

09/19/2022, 3:07 PM

Nice, thanks. Will try that out.

visch

09/19/2022, 3:09 PM

Copy code

transformers:
  - name: dbt
    namespace: dbt_postgres
    pip_url: dbt-postgres==1.0.4
    commands:
      pre_python:
        args: run -m +stg_azuread +stg_bamboohr
      post_python:
        args: run
      test:
        args: test
        description: Runs tests on data in deployed models.
        description: Serve docs
      autoidm_dropy: run-operation autoidm_dropy

I also do this in meltano.yml so I ran use

meltano run

to run everything and make a full pipeline

visch

09/19/2022, 3:11 PM

I'd probably remove cascade for your usecase

nicholas_van_kuren

09/20/2022, 2:39 AM

Just want to send a big thanks to you @visch! Really appreciate the time you took to answer my questions. Its really helped me move past some of the issues I was facing and feeling much better about where I am at with my project. Thank you!

visch

09/20/2022, 11:42 AM

@nicholas_van_kuren no problem! Great questions welcome to meltano :D

2 Views

Open in Slack

Previous Next