Hi everyone I have a question about best practice for managi Meltano #best-practices

Hi everyone, I have a question about best practic...

ly_pham

06/16/2021, 5:24 PM

Hi everyone, I have a question about best practice for managing let say 500 tables incremental state.( this is including all database systems and countries, please let me know if there are a differences for this). With this I dont think a state.yml file is enought, so I would really love to sharing experience on how you guys mananing this. Thanks a lot guys

ly_pham

06/16/2021, 5:26 PM

I not really familar with state file yet, but so far as I understand, it stored a table and its bookmarks. Or instead of using incremental, is using log-based a good workaround since we will do not need to care about state anymore?

aaronsteers

06/16/2021, 5:35 PM

Generally speaking, the state file will be inclusive of all streams you run in a single pass. Do you have an estimate for how many passes you want to make? Generally, folks who want to run the same/similar sync operation on 10 regional databases would have 10 passes, aka 10 invocation contexts (perhaps each with different endpoints and creds, for instance). Furthermore, you might choose to batch all the “fast/small tables” together in a “job” and one or two other jobs for longer-running tables. Assuming 10 regions, and a split into 2 jobs per reason for slow/fast table extracts, this might be 10x2 = 20 “jobs” corresponding to 20 state files. Internally meltano manages state in its systemdb so you don’t have to worries about size or scalability when managing those state files. 20 state files and 20 passes might sound like a lot but really its just whatever stream grouping you want to do, times the number of regions or endpoints, and then you automate the injection of config for each region or endpoint. Is this helpful at all? If you say more about the use case and/or what re-use or common traits you expect across the 500 tables, this might produce slightly different guidance. In short though, 500 tables should be no problem. Just a question of how you want to manage those and if there are common clusters or groups of streams that will help in their management/orchestration.

ly_pham

06/16/2021, 5:51 PM

Thank you so much for you prompt respond, i am trying to process this bit of information. You are right about the design for countries, i would love to go like that. The use case is not much different with what you stated at all However, from what i undersantd meltano db stored job timestampt point right? So what if for each connector i put all tables that i want to ingest ( let zone to mysql tap so we are on the same page). Where will meltano managed the incremental points of each? The state of each tables will still be stored in the state file right with your suggestion right, then next time the job will pick the timestamp from the state file? Thanks

ly_pham

06/16/2021, 5:53 PM

Should we version control state file at all? Daily should be okay right?

aaronsteers

06/16/2021, 5:55 PM

The meltano arg that handles this distinction is the

--job_id

as described here: https://meltano.com/docs/command-line-interface.html#elt Each job-id has it’s own state, and so if you had one job per country (for instance) you’d never have to worry about one country’s state overriding the state for another country.

aaronsteers

06/16/2021, 5:57 PM

If you choose to export the state to version control, you can do this with an extra step which is

meltano elt … --job-id <myjob> --dump=state

which outputs the state to a file you could add to git.

aaronsteers

06/16/2021, 5:57 PM

(Most users do not store state in git, but certainly you can with a couple extra steps.)

ly_pham

06/16/2021, 6:05 PM

Can i tap in for more detail here. So i have 3 tap for each countries, under each is 10 table. When the job run successfully surelly it store the job state. But each tables state will also be stored in metlano db, can i trouble you to point out the table if it is? If it in meltano db already, the no need to exprt the state anymore. Also, i was thought that we always need the job-id from the first impression on the meltano front page, lol. If some of these bit you have already mentioned then I will reread this tmr, we work in Singapore time so i am kinda worn out for the day. Sorry if there some slowness here. Thank you

aaronsteers

06/16/2021, 6:12 PM

You might already know this, but just to be sure, the meltano systemdb I’m referring to is an internal database that meltano uses to hold state, execution logs, etc. That is by default a local sqlite db created automatically but you can also migrate it to a postgres db on another remote RDS server (for instance). To your question about which internal table the state lives in, I can’t recall off the top of my head but @douwe_maan might know that one. (It’s designed so you generally shouldn’t need to query it directly.) And to your scenario of 3 taps per country, each with 10 taps, you’d probably create a job-id naming convention like this: •

<tap-name>_<country-code>_<prod-or-test>

Which might result in these job_id names: • salesforce_us_prod • salesforce_us_test • salesforce_uk_prod • salesforce_us_test • … • pardot_us_prod • pardot_us_test • ….

aaronsteers

06/16/2021, 6:12 PM

Most likely you’d auto-generate the job IDs based on your inputs.

aaronsteers

06/16/2021, 6:13 PM

we work in Singapore time

Oh my! Have a good night then. No worries about responding quickly. We are very used to async conversations. 🙂

douwe_maan

06/16/2021, 6:13 PM

https://meltano.com/docs/project.html#system-database:

job
table: One row for each
meltano elt
pipeline run, holding started/ended timestamps and incremental replication state.

ly_pham

06/16/2021, 6:21 PM

Thanks guys, a lot of usefull infor. It helps me a lot

Open in Slack

Previous Next