Hello, how can I force multiple taps to use the sa...
# troubleshooting
a
Hello, how can I force multiple taps to use the same state ID? I am on Windows and cannot find a method to define a specific state ID when using
meltano run
.
Copy code
$ meltano state list
test:tap-oracle1-to-target-redshift
test:tap-oracle0-to-target-redshift
test:tap-oracle2-to-target-redshift
I would like tap-oracle0 (and 1 and 2) to all use a shared state ID.
e
Hi @Adam Wegscheid! It's not currently possible, but you can manually merge them with a combination of
meltano state
subcommands. I'm curious, what's the use case of having two taps use the same state ID?
a
@Edgar Ramírez (Arch.dev) I am needing to incrementally replicate around ~200 tables from an Oracle database into a Redshift database every few hours. To do so, I am using Python and multiprocessing to run several streams at once to avoid running everything sequentially as that would take forever. I attempted initially to use just one tap but ran into the issue logged here: https://github.com/meltano/meltano/issues/8763 So instead, I am attempting to use multiple taps sort of as "workers" that will get randomly assigned a stream to distribute the workload. So, I would want one state for all these taps so they know when a stream was last loaded regardless of which tap actually handled it. You may ask "Why?" to several of these design decisions and the answer is because I am learning 😄
👀 1
r
You can set a fully-defined state ID using the
--state-id
option with meltano el. I don't know if you'll run into file locking issues when writing state in parallel though...
👍 2
e
Yeah, I wonder if a better approach would be use
meltano state merge
to combine all states into a single object, and then set all state IDs to that value.
👍 1
a
Unfortunately, I am limited to Windows (sadly) so
el
is not an option otherwise this would have been easy using
--state-id
. I believe I will have to merge all of the states into one after each run and then copy that combined state over the others.
e
We could also think of supporting a pattern in
meltano state merge
to merge multiple states in one go.
a
That would certainly make this more elegant. Thank you both for your opinions and help!
np 2
e
s
I follow this same pattern with a meltano run command. For this problem, I requested the addition of this switch which was added to meltano --state-id-suffix. This works well as you can use the same tap and have several loads run in parallel against the same database. We have never had any issues with this pattern, it works well 😊 .
1
a
@Steve Clarke Unfortunately, I immediately get the error I faced before. I am running 32 processes in parallel.
Run invocation could not be completed as block failed: Cannot start plugin tap-oracle: [WinError 32] The process cannot access the file because it is being used by another process: 'E:\\meltano_poc\\.meltano\\run\\tap-oracle\\tap.properties.json'
s
Hmmm, I don't have this issue. I would imagine that the command lines would look something like this. Note: These two commands below should have different state entries because of the state id suffix.
Copy code
meltano run tap-oracle target-snowflake --state-id-suffix=batcha
meltano run tap-oracle target-snowflake --state-id-suffix=batchb
a
You are correct and I can see two different state entries (I run the exact same commands as you listed). However, it seems that a tap, no matter which state suffix is used, reads from the same set of local files which causes the issue (which is why I use multiple taps rather than state suffix). How many processes do you run in parallel? It is completely possible (and likely) that this is user error on my end 😁
s
Ah, I think I understand what is going on here. Because you are using the same tap name, it is generating a catalog file via the discovery with the same name. I think a good idea would be take the value of
--state-id-suffix
then concatenating that onto the directory name i.e.
'E:\\meltano_poc\\.meltano\\run\\tap-oracle-batcha\\tap.properties.json'
'E:\\meltano_poc\\.meltano\\run\\tap-oracle-batchb\\tap.properties.json'
I realise that when I run each tap it is in a separate isolated ephemeral container, so that it why I haven't seen this issue. @Edgar Ramírez (Arch.dev). What do you think that if we use a
--state-id-suffix
with a run command or a
--state-id
with a el command? To append on the name of the tap? This would provide the ability to truely run two instances of a tap in parallel.
👀 1
e
I do want to change how and where we put runtime files to make parallel runs work. Put a bit of time a few weeks ago but didn't get far: https://github.com/meltano/meltano/pull/8794
👀 1
👍 1