Hello, I am using the transferwise target-postgres...
# plugins-general
n
Hello, I am using the transferwise target-postgres plugin and trying to understand how I can truncate tables in my target postgres database prior to loading data on each run. I have the tap setup as full-replication and have also tried using the
--full-refresh
flag, but on each run all of the records from my source are loaded again so I end up with duplicates. I do not have reliable replication IDs as described in the Singer documentation. There is no modified date or incremental ID in the source tables to track the latest record for upserts.
If truncating is not built-in / easy to accomplish, what is the preferred strategy for managing table size over time? I could see working around the duplicate issue by grouping on a unique key and selecting the latest
_sdc_extracted_at
timestamp from the loaded table, but am concerned about tables ballooning in size after multiple runs (if records are just going to continue to be appended). Is a separate scripting solution needed that deletes records on some scheduled basis?
e
Hi @nicholas_degiacomo. The singer community came up with ACTIVATE_VERSION messages to help with that use case, though not every target has adopted them and is seems to be the case for transferwise target-postgres. Weโ€™re working on https://github.com/MeltanoLabs/target-postgres (still experimental) based on the Singer SDK, which will make sure supports the activate_version spec.
v
For full table stuff (most of what I do) I happen to also use dbt so I just drop the tables before the run. Meltano run dbt:droptables tapname targetname dbt:run
n
@edgar_ramirez_mondragon - Thank you for this suggestion. I am looking around and not having any luck finding an example on how to implement this in a Meltano project on both the tap and target side. It seems like I may have to wait anyway given that the existing postgres target does not work with this functionality? @visch - Thanks Derek! Are you saying you use dbt to drop
source
tables? My issues is that I have some source tables that are >600K records and I must do a full load on these as there is no replication key that works. This means that if I just let the data duplicate in this table and just take the latest by date the table will quickly grow to millions of un-needed rows. I am looking at
drop_schema
and other dbt commands, but it also feels pretty risky to just delete my source without knowing the data load will be succesful, especially since it would need to be cascading and would remove any dbt generated tables as well. This means that if there was an issue upstream and data was not ready to load, I would not be able to build my models that day. I suppose you could work around this by adding checks to make sure records exist. Any thoughts on these issues? Is this what you are doing?
v
@nicholas_van_kuren Yeah to deal with that I have a staging model that's materialized as a table. Which is pretty much just a
select * from source
, that way the order looks like 1. Drop source table (everything in DW keeps working) 2.
meltano run tap-name target-name
loads data into the source table (If there's an error our pipeline stops here)
3. dbt runs(I include this in step 2)
If you're more worried run
dbt:test
before
dbt:run
with whatever makes you feel better in
dbt:test
n
I guess what I am wondering is for that staging model
select * from source
won't that get dropped when you run #1? I am using postgres and assuming I need to do a drop CASCADE to actually be able to drop the source tables. At least I have seen messages like this in development work when dropping/replacing sources that are being referenced by dbt models.
maybe the materialized option removes that issue?
v
A model can be materialized in a number of ways. Views cascade through eachother, a table stands alone. Just materialize it as a table ๐Ÿ™‚
I'm assuming a lot of things here, like Cascade can technically delete tables when you have FK's set with other tables in Postgres so tread carefully but with DBT we normally don't' set constraints between tables so you're probably good
n
yeah, I must have had things set to views when I experienced that issue before. Thanks for your help. Would you mind sharing your code for dropping table or schema?
v
lol prepare yourself it's super complicated (not ๐Ÿ™‚ )
n
๐Ÿ˜„, I am sure its simple. I am just having issues getting it working and wanted another reference. I am getting ROLLBACK in the log so something doesn't seem to be working.
v
Copy code
{%- macro autoidm_dropy() -%}
    {%- set drop_query -%}                                                                                                                                                                                                                           DROP TABLE IF EXISTS {{ source('tap_azuread', 'users') }} CASCADE;                                                                                                                                                                                                                                                                                                                                               {%- endset -%}                                                                                                                                                                                                                               {% do run_query(drop_query) %}                                                                                                                                                                                                           {%- endmacro -%}
n
Nice, thanks. Will try that out.
v
Copy code
transformers:
  - name: dbt
    namespace: dbt_postgres
    pip_url: dbt-postgres==1.0.4
    commands:
      pre_python:
        args: run -m +stg_azuread +stg_bamboohr
      post_python:
        args: run
      test:
        args: test
        description: Runs tests on data in deployed models.
        description: Serve docs
      autoidm_dropy: run-operation autoidm_dropy
I also do this in meltano.yml so I ran use
meltano run
to run everything and make a full pipeline
I'd probably remove cascade for your usecase
n
Just want to send a big thanks to you @visch! Really appreciate the time you took to answer my questions. Its really helped me move past some of the issues I was facing and feeling much better about where I am at with my project. Thank you!
v
@nicholas_van_kuren no problem! Great questions welcome to meltano :D