I have two streams with a parent/child relationshi...
# singer-tap-development
j
I have two streams with a parent/child relationship. The parent returns a list of records, and the child (unfortunately) has to hit the API for each individual record returned by the parent. The state for the child stream has partition bookmarks for all ~50k parent records, and the size of my meltano.db is increasing by ~2MB with each subsequent run. My questions: 1. Can I use
state_partitioning_keys
to reduce the state (ideally to zero) that's stored between invocations of the tap, and if so how? 2. Are there ways that I can purge some of the history of
meltano.db
to reduce its size to something more sensible?
It looks like the answer to the first might be
state_partitioning_keys = []
— Maybe I could update the docs with a. "If you wish to disable this behavior, you can set
state_partitioning_keys
to `[]`"
t
@joshuadevlin I’m sure we’d accept an MR with that advice if you’re willing 🙂
j
Will put one in! Any advice on purging the system database while retaining state?
t
The new
meltano state
command might help with that https://docs.meltano.com/reference/command-line-interface#state but not sure what else you’re wanting to purge
j
I'll take a look, thanks! My understanding is that inside the job table in the system database is the state at the end of every job that's ever run. Because we've had ~2MB of state for every job for a while, the size of the file has ballooned and I'd (ideally) like to find a way to remove some of the old cruft without losing the current incremental state. The scenario is made trickier by the fact that we're operating inside GitHub actions, so the system database is stored as an artifact there and so it's tricky to manually edit the sqlite db (which is probably how I would otherwise fix this)
t
right - and even doing
meltano state clear
would just append a new record with an empty payload, so that wouldn’t achieve your goal. the only thing I can think of would be to print the latest state for each job_id in CI logs and then just delete the artifact. then on the next run you can use that state to start the job?
j
That may work — thanks for the guidance!