Hello I set an incremental migration pipeline from postgres Meltano #troubleshooting

Hello! I set an incremental migration pipeline fro...

kanstantin_karaliov

03/24/2022, 5:59 PM

Hello! I set an incremental migration pipeline from postgres to snowflake for one huge table. There was so much data, that after few weeks I decided to to stop that process and dump this data manually. When I did that, I wanted to turn the pipeline back on, but when I had started it I noticed that it kept loading data from the point where it had stopped. I tried to find the way how can I reset pipelines cache, so it will pull up the actual state from destination table. I tried to run meltano etl process in terminal instead of UI, providing it an extra state parameter in hope that after it finished it would automatically update replication key value. It did not. I also tried to provide an extra state directly to that pipeline in

.meltano.yml

file, that still continued from its own point. Then I found out that meltano stores jobs metadata in

.meltano/meltano.db

, and I updated last jobs payload state using new replication key value, and then it worked. My questions: 1. Is there a more correct way of reseting pipelines cache values? Disregarding the inconvenience of this method, it will store a wrong pipeline log. 2. Is it ok, that pipelines aren’t checking the last replication key value in destination table? Seems like it doesn’t even reset a process cache within meltano processes. What if I will set incremental process as my basic update method and once in a while a full table upload. How that processes synchronize with each other? What if one of that uploading processes would not served by meltano?

thomas_briggs

03/24/2022, 9:11 PM

@kanstantin_karaliov I was dealing with something similar yesterday. I used the --dump switch to

meltano elt

to get the state for the job, then tweaked it, then ran

meltano elt

with the --state switch to use the new job. Not pretty but it works

thomas_briggs

03/24/2022, 9:12 PM

There's a

meltano state

command in the works that will make this a bit easier too, I think... see https://gitlab.com/meltano/meltano/-/issues/2754

kanstantin_karaliov

04/07/2022, 4:36 PM

I checked

meltano etl

with --state parameter and it started a job I needed but that run would not be affected on existent pipeline. So, pipeline will start copying data from the same point, after that job finished. But what really helps is to run that specific pipeline using command

meltano schedule run <pipeline_name> --state <state_file>

. This would instantly run a job in selected pipeline with pre-defined state. And in the future run that pipeline will refer to latest job, which contains updated state and will continue from the new point.

3 Views

Open in Slack

Previous Next