Hello I am trying to run an initial historical sync on one o Meltano #troubleshooting

Hello, I am trying to run an initial historical s...

aditya_penmesta

01/19/2023, 3:41 PM

Hello, I am trying to run an initial historical sync on one of my larger Postgres tables (~50 GB in size), but it keeps failing. Is there any way to quicken the load time for such a large table, or a way for it to continuously pick up where it left off when it fails in order to finish the sync?

thomas_briggs

01/19/2023, 9:12 PM

We've always had to do initial loads of large tables manually. 😕 Basically export the data using the database's tools, import it into the target using the target's tools, then set meltano's state to pick up changes going forward. I don't know how to get state from a Postgres database though, unfortunately. @alexander_rougellis asked the same question not long ago; maybe he has some advice?

thomas_briggs

01/19/2023, 9:14 PM

I guess I assumed you were using log-based replication but if not that would make things simpler...

aditya_penmesta

01/20/2023, 4:39 PM

Interesting

aditya_penmesta

01/20/2023, 4:39 PM

Do you know what the best way to load a Postgres DB into Snowflake would be manually, as you suggested. Looking online it seems like a pretty complicated procedure.

thomas_briggs

01/20/2023, 4:41 PM

I've only done it with a MySQL source, unfortunately. I expect the procedure is roughly the same but I don't know how to get the Postgres equivalent of MySQL's "binlog position"

thomas_briggs

01/20/2023, 4:44 PM

See this thread for some background.

alexander_rougellis

01/20/2023, 4:49 PM

Hi @aditya_penmesta The biggest table ive migrated was ~25gb. and this did take a while. In my experience with meltano, the only was i was able to get the state was through a successful initial full table migration. Any time an initial full table dump failed, i had to restart that pipeline (wipe the state file and start over) bc continuing off a failed initial dump would throw off my WAL pointer and render the state almost useless (bc it would go back to the rows already migrated and migrate them again). What errors pop up when you get failed runs? Also, when i ran the initial full table dump on my large table, i relied my unsteady internet connection which has a high chance of messing up that initial dump run (chance of internet instability therefore failed migration was high over the long time it took to migrate.) I have my pipeline set up in a tmux session that exist in an AWS instance to combat that problem which gave me a greater chance of a successful run.

aditya_penmesta

01/20/2023, 4:55 PM

Hi Alexander, thanks for your help. I'm currently running the initial sync in Airflow, and the error I ran into was

Copy code

This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: Instance '****' has been deleted, or its row is otherwise not present.

aditya_penmesta

01/20/2023, 4:56 PM

Is this kind of what you were saying about restarting the pipeline? Because my airflow job runs on a daily cadence in hopes of pushing all the data in eventually

alexander_rougellis

01/20/2023, 5:25 PM

Ah, for orchestration, my manager at the time refused to use airflow and wanted a cron job to orchestrate instead. But question. Were you able to do a successful initial full table migration. From my experience, i was able to gather the state to do log based replication and was able to orchestrate the log based replication AFTER a successful initial full table dump. I am assuming that is what @thomas_briggs was talking about when he said “We’ve always had to do initial loads of large tables manually”. That initial migration had to be done manually, then can be orchestrated. Another question, what sort of replication are you using/wanting to use? Full table, key based, or log based?

Open in Slack

Previous Next