Hi, we want to setup a pipeline to extract data fr...
# getting-started
f
Hi, we want to setup a pipeline to extract data from a mongoDB database. However there are concerns about the the historical replication and the load on the db. Is it possible to run the historical replication from a clone of the db? E.g. spinning up a clone from the backups, load most of the data from there and then switch to the live db? Or should there be no problem with load in the first place?
c
What are the specific concerns? I suppose a common practice for reporting and analytics purposes is to make the analytics process (e.g. an EL pipline like Meltano), read from a secondary replica. I'm very much the complete opposite of a MongoDB expert. But I understand that there a different consistency settings for secondary replicas in MongoDB.
f
Well, the db´s have some performance problems, and their usage is very write heavy. Setting up a read replica sounds like a good option too, but much more effort (or in other words budget from IT). Reading the initial replication from a clone is simple in comparison. But as a complete newcomer to meltano I can not determine if it would be possible. (Setting aside the open bug in the mongoDB tap about log based replication).
c
I don't know much about MongoDB performance problems, but for an EL workload, I would imagine that your initial replication throughput is bound by overall IO performance, since it should effectively result in a scan operation in your DB and return all records. What were your specific concerns? If your concerns are about the runtime duration to perform the initial load of historical records, then the recommended strategy really depends on your hosting infrastructure. I.e. if you have a constraint on disk IO performance, there may not be a huge difference between first cloning your disk with the MongoDB data and reading from that clone vs reading from the existing MongoDB replica. You would gain an advantage if your cloned disk were to remove any disk IO (and possibly network IO) constraints though and if that clone would allow you to test and develop your initial load strategy much faster, especially when testing multiple attempts of the initial EL. Again, it really depends on your hosting environment I would say. Regarding log_based replication. Personally, I tend to avoid using it if possible and use incremental replication based on a replication key instead, as this method is typically a more generic approach compared to having to set up specific configurations on the source system that are needed for log based replication.. How does the data that you need to load look like? If all the documents in your collections that you need to extract and load have a date field that tracks their "last modified" date, you could look at the Meltano SDK based mongodb tap instead which can perform incremental replication using "last modified" fields. https://github.com/z3z1ma/tap-mongodb/ Lastly, I imagine that log based replication would also present a challenge in respect to your "clone" scenario, as there are likely some MongoDB intricacies of how the MongoDB oplog works. But I'm not 100% sure about that. It might just work.
f
The concern is that a the many read queries slow down the critical writes to the point that things timeout or clog up the memory somewhere. Yes the clone would be run on different infrastructure. Thanks for the hint about incremental replication, I will look into this. But overall I understand that switching the db connection after the first replication is not impossible?
c
But overall I understand that switching the db connection after the first replication is not impossible?
Switching the db connection would be easier with the incremental replication, because the incremental replication logic is basically abstracted away into the presentation layer (speaking in OSI terms). Incremental replication does not depend on database-specific protocols like the opslog in MongoDB. So, the Singer Bookmark information in the meltano state files is unlikely to cause disruption when switching out the db connection in the tap's configuration.
The concern is that a the many read queries slow down the critical writes to the point that things timeout or clog up the memory somewhere.
Sounds like a "clone" will give you an increased flexibility to perform multiple repeated test loads and refine your initial load strategy then.