Hi, I have a quick question about an issue that i'...
# troubleshooting
f
Hi, I have a quick question about an issue that i'm running into. I'm using meltano to load and transform (with dbt) data form an external database into a local postgres database. This runs on a schedule with the integrated airflow scheduler. Then i have an independent program that uses the data synced into this db, and that also runs on a schedule. Sometimes those schedules overlap, and the independent program seems to see an empty table while dbt is running. Is there a way to prevent this, have dbt replace all data in the target table at once in a transaction or similar without using handrolled locks? Or am i misunderstanding something here?
v
hmm what would cause the table to become empty? Is it during the meltano run?
f
yes thats what it looks like
v
Could you share your meltano.yml? And the command you're using to run
p
Have you been able to isolate whether the table thats empty is a raw table written by a meltano loader or a table generated by dbt? If by dbt is it a table or a view? I have a few ideas depending on the situation
but also +1 to Derek's ask to see the meltano.yml config. For me I'm curious what loader variant youre using
f
Hi, here's my (redacted) config: ```version: 1 default_environment: dev project_id: REDACTED environments: - name: dev config: plugins: extractors: - name: tap-REDACTED_TAP_1 config: host: REDACTED port: REDACTED user: REDACTED dbname: REDACTED ssl: false metadata: "*": replication-method: FULL_TABLE select: - REDACTED.* - REDACTED.* - name: tap-REDACTED_TAP_2 config: host: REDACTED user: REDACTED database: REDACTED ssl: true metadata: "*": replication-method: FULL_TABLE select: - REDACTED.* - REDACTED.* - REDACTED.* - REDACTED.* - REDACTED.* - REDACTED.* - REDACTED.* - REDACTED.* - name: tap-REDACTED_TAP_3 config: host: REDACTED port: REDACTED user: REDACTED database: REDACTED ssl: true metadata: "*": replication-method: FULL_TABLE select: - REDACTED.* - REDACTED.* - REDACTED.* - REDACTED.* loaders: - name: target-postgres config: host: REDACTED port: REDACTED user: REDACTED dbname: REDACTED ssl: false add_metadata_columns: true hard_delete: true transformers: - name: dbt-postgres config: host: REDACTED user: REDACTED port: REDACTED dbname: REDACTED schema: REDACTED ssl: false - name: prod config: plugins: extractors: - name: tap-REDACTED_TAP_1 config: host: REDACTED port: REDACTED user: REDACTED dbname: REDACTED ssl: false metadata: "*": replication-method: FULL_TABLE select: - REDACTED.* - REDACTED.* - name: tap-REDACTED_TAP_2 config: host: REDACTED user: REDACTED database: REDACTED ssl: true metadata: "*": replication-method: FULL_TABLE select: - REDACTED.* - REDACTED.* - REDACTED.* - REDACTED.* - name: tap-REDACTED_TAP_3 config: host: REDACTED user: REDACTED database: REDACTED ssl: true metadata: "*": replication-method: FULL_TABLE select: - REDACTED.* - REDACTED.* - REDACTED.* - REDACTED.* loaders: - name: target-postgres config: host: REDACTED port: REDACTED user: REDACTED dbname: REDACTED ssl: false add_metadata_columns: true hard_delete: true transformers: - name: dbt-postgres config: host: REDACTED user: REDACTED port: REDACTED dbname: REDACTED schema: REDACTED ssl: false plugins: extractors: - name: tap-REDACTED_TAP_1 inherit_from: tap-postgres variant: transferwise pip_url: pipelinewise-tap-postgres - name: tap-REDACTED_TAP_2 inherit_from: tap-mysql variant: transferwise pip_url: pipelinewise-tap-mysql -…
Also, I'm currently using the default variants for everything in meltano.
v
Based on
the table that i'm seeing the issue with is a result of dbt so i think it is created by dbt
It's likely not a tap/target being the issue. From your initial description it sounded like it could be an activate version thing from the default target-postgres. ---- The actual error from DBT would have probably cleared most of my confusion up above just fyi for next time! https://stackoverflow.com/questions/70383118/using-dbt-and-meltano-how-can-i-prevent-multiple-dbt-job-runs-from-conflicting is one possible solution, basically don't allow a dbt job to run at the same time as another job. Locks on the DB can help you here.
f
Thanks for your reply, I found the linked stackoverflow post, but that uses a handrolled lock on the entire DB, thus my question if this is possible without that, given that I couldn't find any other solutions and that does seem like quite a simple thing i figured that I was doing something wrong. Maybe i should ask that question in the DBT community? Also while I'm here another question - can you tell from my meltano.yml why the deletion of records in the target db does not work? I'm using
FULL_TABLE
,
add_metadata_columns
as well as
hard_delete
, but when i delete a record in the source (tap) db and re-run meltano, those records are neither marked as deleted (
_sdc_deleted_at
) nor hard-deleted - they just stay there until i delete them manually or drop the table before running meltano. I tried finding existing posts about this issue but couldn't find anything, so I guess I'm doing something wrong here?
And sorry for so many questions - I'm quite new to data science and given that meltano combines a lot of different tools and techniques It can get a bit confusing 😄
v
but that uses a handrolled lock on the entire DB, thus my question if this is possible without that - maybe i should ask that question in the DBT community?
What do you think "handrolled" lock means / does that's negative here?
seem like quite a simple thing i figured that I was doing something wrong
It's not simple, just like this isn't a quick question 😄 . dbt is really designed to be ran all together a few times a day.
Also while I'm here another question - can you tell from my meltano.yml why the deletion of records in the target db does not work
Assuming you're talking about
REDACTED_JOB_1
, which refers to
tap-REDACTED_TAP_3
which is
pipelinewise-tap-mysql
. Also I think what you want here is for the target table to match the source table 1-1. For records to be deleted properly using purely the tap/target combo we recommend using
activate_version
, note that another simple solution for full_table syncs is to just drop the target table before you load your new tap data, but that requires you to be ok with the target table being in a broken state for a period of time. So I'm going to look at the question, "why doesn't
pipelinewise-tap-mysql
with
pipelinewise-target-postgres
properly use activate version to allow my source/target tables to "match". I believe
pipelinewise-tap-mysql
supports activate version. I also believe pipelinewise-target-postgres supports activate_version. So it "should" be working as
hard_delete
is set to true. What I think is you should debug this by running something like
meltano invoke tap-redacted_tap3 > out
look at the output file for some activate_version messages. If they are there, cool! Then run you target with that data
cat out | meltano invoke target-postgres
then it's debugging into why it's not happening properly. Should open a sepearte thread if you're not sure if it's a tap/target issue at this point
f
What do you think "handrolled" lock means / does that's negative here?
Sorry if that came across as negative, I didn't mean that. What I meant by "hand-rolled" is doing it via a hook and macros and then writing my own locking logic and not dbt or meltano "native" functionality. But I will look into this 🙂 About
ACTIVATE_VERSION
: Yes those messages are indeed present in the tap output - I will continue debugging this and open a seperate issue if needed. Thanks a lot for your help, that cleared things up for me!
v
Sorry if that came across as negative, I didn't mean that. What I meant by "hand-rolled" is doing it via a hook and macros and then writing my own locking logic and not dbt or meltano "native" functionality. But I will look into this 🙂
Totally ok, no offense taken at all just was curious if there was a technical reason you didn't like it. That reason makes sense to me Good luck with activate version! It "should" work 😄
f
Thanks! Yes the "technical" reason is that I'm lazy and didn't want to mantain my own locking logic 😜
v
If you're being lazy I recommend not running dbt constantly, just run it once a day or twice a day and call it good !
f
Yeah the problem is that the resulting data is synced into business analytics tools where we need relatively up-to-date data, so I guess I'll have to work around the issues (I'm aware that i could completely forego dbt and do it differently, but i quite like dbt by now...). The cleanest solution would probably be to incorporate my "after-dbt" processing into the airflow jobs, so that i have a single schedule instead of 2 independent schedules that clash.