Hello how can I force multiple taps to use the same state ID Meltano #troubleshooting

Hello, how can I force multiple taps to use the sa...

Adam Wegscheid

09/23/2024, 4:43 PM

Hello, how can I force multiple taps to use the same state ID? I am on Windows and cannot find a method to define a specific state ID when using

meltano run

Copy code

$ meltano state list
test:tap-oracle1-to-target-redshift
test:tap-oracle0-to-target-redshift
test:tap-oracle2-to-target-redshift

I would like tap-oracle0 (and 1 and 2) to all use a shared state ID.

Edgar Ramírez (Arch.dev)

09/23/2024, 5:11 PM

Hi @Adam Wegscheid! It's not currently possible, but you can manually merge them with a combination of

meltano state

subcommands. I'm curious, what's the use case of having two taps use the same state ID?

Adam Wegscheid

09/23/2024, 5:59 PM

@Edgar Ramírez (Arch.dev) I am needing to incrementally replicate around ~200 tables from an Oracle database into a Redshift database every few hours. To do so, I am using Python and multiprocessing to run several streams at once to avoid running everything sequentially as that would take forever. I attempted initially to use just one tap but ran into the issue logged here: https://github.com/meltano/meltano/issues/8763 So instead, I am attempting to use multiple taps sort of as "workers" that will get randomly assigned a stream to distribute the workload. So, I would want one state for all these taps so they know when a stream was last loaded regardless of which tap actually handled it. You may ask "Why?" to several of these design decisions and the answer is because I am learning 😄

👀 1

Reuben (Matatika)

09/23/2024, 8:28 PM

You can set a fully-defined state ID using the

--state-id

option with meltano el. I don't know if you'll run into file locking issues when writing state in parallel though...

👍 2

Edgar Ramírez (Arch.dev)

09/23/2024, 9:14 PM

Yeah, I wonder if a better approach would be use

meltano state merge

to combine all states into a single object, and then set all state IDs to that value.

👍 1

Adam Wegscheid

09/24/2024, 1:09 PM

Unfortunately, I am limited to Windows (sadly) so

el

is not an option otherwise this would have been easy using

--state-id

. I believe I will have to merge all of the states into one after each run and then copy that combined state over the others.

Edgar Ramírez (Arch.dev)

09/24/2024, 2:06 PM

We could also think of supporting a pattern in

meltano state merge

to merge multiple states in one go.

Adam Wegscheid

09/24/2024, 2:33 PM

That would certainly make this more elegant. Thank you both for your opinions and help!

np 2

Edgar Ramírez (Arch.dev)

09/24/2024, 3:29 PM

https://github.com/meltano/meltano/issues/8797

🙌 1

Steve Clarke

10/07/2024, 6:13 PM

I follow this same pattern with a meltano run command. For this problem, I requested the addition of this switch which was added to meltano --state-id-suffix. This works well as you can use the same tap and have several loads run in parallel against the same database. We have never had any issues with this pattern, it works well 😊 .

➕ 1

Adam Wegscheid

10/07/2024, 7:55 PM

@Steve Clarke Unfortunately, I immediately get the error I faced before. I am running 32 processes in parallel.

Run invocation could not be completed as block failed: Cannot start plugin tap-oracle: [WinError 32] The process cannot access the file because it is being used by another process: 'E:\\meltano_poc\\.meltano\\run\\tap-oracle\\tap.properties.json'

steve_clarke

10/07/2024, 9:00 PM

Hmmm, I don't have this issue. I would imagine that the command lines would look something like this. Note: These two commands below should have different state entries because of the state id suffix.

Copy code

meltano run tap-oracle target-snowflake --state-id-suffix=batcha
meltano run tap-oracle target-snowflake --state-id-suffix=batchb

Adam Wegscheid

10/08/2024, 1:13 PM

You are correct and I can see two different state entries (I run the exact same commands as you listed). However, it seems that a tap, no matter which state suffix is used, reads from the same set of local files which causes the issue (which is why I use multiple taps rather than state suffix). How many processes do you run in parallel? It is completely possible (and likely) that this is user error on my end 😁

steve_clarke

10/08/2024, 7:57 PM

Ah, I think I understand what is going on here. Because you are using the same tap name, it is generating a catalog file via the discovery with the same name. I think a good idea would be take the value of

--state-id-suffix

then concatenating that onto the directory name i.e.

'E:\\meltano_poc\\.meltano\\run\\tap-oracle-batcha\\tap.properties.json'

'E:\\meltano_poc\\.meltano\\run\\tap-oracle-batchb\\tap.properties.json'

I realise that when I run each tap it is in a separate isolated ephemeral container, so that it why I haven't seen this issue. @Edgar Ramírez (Arch.dev). What do you think that if we use a

--state-id-suffix

with a run command or a

--state-id

with a el command? To append on the name of the tap? This would provide the ability to truely run two instances of a tap in parallel.

👀 1

Edgar Ramírez (Arch.dev)

10/08/2024, 9:19 PM

I do want to change how and where we put runtime files to make parallel runs work. Put a bit of time a few weeks ago but didn't get far: https://github.com/meltano/meltano/pull/8794

👀 1

👍 1

41 Views

Open in Slack

Previous Next