Hi everyone, I have a question about best practic...
# best-practices
l
Hi everyone, I have a question about best practice for managing let say 500 tables incremental state.( this is including all database systems and countries, please let me know if there are a differences for this). With this I dont think a state.yml file is enought, so I would really love to sharing experience on how you guys mananing this. Thanks a lot guys
I not really familar with state file yet, but so far as I understand, it stored a table and its bookmarks. Or instead of using incremental, is using log-based a good workaround since we will do not need to care about state anymore?
a
Generally speaking, the state file will be inclusive of all streams you run in a single pass. Do you have an estimate for how many passes you want to make? Generally, folks who want to run the same/similar sync operation on 10 regional databases would have 10 passes, aka 10 invocation contexts (perhaps each with different endpoints and creds, for instance). Furthermore, you might choose to batch all the “fast/small tables” together in a “job” and one or two other jobs for longer-running tables. Assuming 10 regions, and a split into 2 jobs per reason for slow/fast table extracts, this might be 10x2 = 20 “jobs” corresponding to 20 state files. Internally meltano manages state in its systemdb so you don’t have to worries about size or scalability when managing those state files. 20 state files and 20 passes might sound like a lot but really its just whatever stream grouping you want to do, times the number of regions or endpoints, and then you automate the injection of config for each region or endpoint. Is this helpful at all? If you say more about the use case and/or what re-use or common traits you expect across the 500 tables, this might produce slightly different guidance. In short though, 500 tables should be no problem. Just a question of how you want to manage those and if there are common clusters or groups of streams that will help in their management/orchestration.
l
Thank you so much for you prompt respond, i am trying to process this bit of information. You are right about the design for countries, i would love to go like that. The use case is not much different with what you stated at all However, from what i undersantd meltano db stored job timestampt point right? So what if for each connector i put all tables that i want to ingest ( let zone to mysql tap so we are on the same page). Where will meltano managed the incremental points of each? The state of each tables will still be stored in the state file right with your suggestion right, then next time the job will pick the timestamp from the state file? Thanks
Should we version control state file at all? Daily should be okay right?
a
The meltano arg that handles this distinction is the
--job_id
as described here: https://meltano.com/docs/command-line-interface.html#elt Each job-id has it’s own state, and so if you had one job per country (for instance) you’d never have to worry about one country’s state overriding the state for another country.
If you choose to export the state to version control, you can do this with an extra step which is
meltano elt … --job-id <myjob> --dump=state
which outputs the state to a file you could add to git.
(Most users do not store state in git, but certainly you can with a couple extra steps.)
l
Can i tap in for more detail here. So i have 3 tap for each countries, under each is 10 table. When the job run successfully surelly it store the job state. But each tables state will also be stored in metlano db, can i trouble you to point out the table if it is? If it in meltano db already, the no need to exprt the state anymore. Also, i was thought that we always need the job-id from the first impression on the meltano front page, lol. If some of these bit you have already mentioned then I will reread this tmr, we work in Singapore time so i am kinda worn out for the day. Sorry if there some slowness here. Thank you
a
You might already know this, but just to be sure, the meltano systemdb I’m referring to is an internal database that meltano uses to hold state, execution logs, etc. That is by default a local sqlite db created automatically but you can also migrate it to a postgres db on another remote RDS server (for instance). To your question about which internal table the state lives in, I can’t recall off the top of my head but @douwe_maan might know that one. (It’s designed so you generally shouldn’t need to query it directly.) And to your scenario of 3 taps per country, each with 10 taps, you’d probably create a job-id naming convention like this: •
<tap-name>_<country-code>_<prod-or-test>
Which might result in these job_id names: • salesforce_us_prod • salesforce_us_test • salesforce_uk_prod • salesforce_us_test • … • pardot_us_prod • pardot_us_test • ….
Most likely you’d auto-generate the job IDs based on your inputs.
we work in Singapore time
Oh my! Have a good night then. No worries about responding quickly. We are very used to async conversations. 🙂
d
https://meltano.com/docs/project.html#system-database:
job
table: One row for each
meltano elt
pipeline run, holding started/ended timestamps and incremental replication state.
l
Thanks guys, a lot of usefull infor. It helps me a lot