I have two streams with a parent child relationship The pare Meltano #singer-tap-development

I have two streams with a parent/child relationshi...

joshuadevlin

05/09/2022, 6:35 AM

I have two streams with a parent/child relationship. The parent returns a list of records, and the child (unfortunately) has to hit the API for each individual record returned by the parent. The state for the child stream has partition bookmarks for all ~50k parent records, and the size of my meltano.db is increasing by ~2MB with each subsequent run. My questions: 1. Can I use

state_partitioning_keys

to reduce the state (ideally to zero) that's stored between invocations of the tap, and if so how? 2. Are there ways that I can purge some of the history of

meltano.db

to reduce its size to something more sensible?

joshuadevlin

05/09/2022, 6:45 AM

It looks like the answer to the first might be

state_partitioning_keys = []

— Maybe I could update the docs with a. "If you wish to disable this behavior, you can set

state_partitioning_keys

to `[]`"

taylor

05/09/2022, 2:17 PM

@joshuadevlin I’m sure we’d accept an MR with that advice if you’re willing 🙂

joshuadevlin

05/09/2022, 10:13 PM

Will put one in! Any advice on purging the system database while retaining state?

taylor

05/10/2022, 2:01 PM

The new

meltano state

command might help with that https://docs.meltano.com/reference/command-line-interface#state but not sure what else you’re wanting to purge

joshuadevlin

05/10/2022, 10:51 PM

I'll take a look, thanks! My understanding is that inside the job table in the system database is the state at the end of every job that's ever run. Because we've had ~2MB of state for every job for a while, the size of the file has ballooned and I'd (ideally) like to find a way to remove some of the old cruft without losing the current incremental state. The scenario is made trickier by the fact that we're operating inside GitHub actions, so the system database is stored as an artifact there and so it's tricky to manually edit the sqlite db (which is probably how I would otherwise fix this)

taylor

05/11/2022, 2:10 PM

right - and even doing

meltano state clear

would just append a new record with an empty payload, so that wouldn’t achieve your goal. the only thing I can think of would be to print the latest state for each job_id in CI logs and then just delete the artifact. then on the next run you can use that state to start the job?

joshuadevlin

05/12/2022, 5:09 AM

That may work — thanks for the guidance!

Open in Slack

Previous Next