Hello! I set an incremental migration pipeline fro...
# troubleshooting
k
Hello! I set an incremental migration pipeline from postgres to snowflake for one huge table. There was so much data, that after few weeks I decided to to stop that process and dump this data manually. When I did that, I wanted to turn the pipeline back on, but when I had started it I noticed that it kept loading data from the point where it had stopped. I tried to find the way how can I reset pipelines cache, so it will pull up the actual state from destination table. I tried to run meltano etl process in terminal instead of UI, providing it an extra state parameter in hope that after it finished it would automatically update replication key value. It did not. I also tried to provide an extra state directly to that pipeline in
.meltano.yml
file, that still continued from its own point. Then I found out that meltano stores jobs metadata in
.meltano/meltano.db
, and I updated last jobs payload state using new replication key value, and then it worked. My questions: 1. Is there a more correct way of reseting pipelines cache values? Disregarding the inconvenience of this method, it will store a wrong pipeline log. 2. Is it ok, that pipelines aren’t checking the last replication key value in destination table? Seems like it doesn’t even reset a process cache within meltano processes. What if I will set incremental process as my basic update method and once in a while a full table upload. How that processes synchronize with each other? What if one of that uploading processes would not served by meltano?
t
@kanstantin_karaliov I was dealing with something similar yesterday. I used the --dump switch to
meltano elt
to get the state for the job, then tweaked it, then ran
meltano elt
with the --state switch to use the new job. Not pretty but it works
There's a
meltano state
command in the works that will make this a bit easier too, I think... see https://gitlab.com/meltano/meltano/-/issues/2754
k
I checked
meltano etl
with --state parameter and it started a job I needed but that run would not be affected on existent pipeline. So, pipeline will start copying data from the same point, after that job finished. But what really helps is to run that specific pipeline using command
meltano schedule run <pipeline_name> --state <state_file>
. This would instantly run a job in selected pipeline with pre-defined state. And in the future run that pipeline will refer to latest job, which contains updated state and will continue from the new point.