Hi all. Do I understand it correctly that state is...
# getting-started
t
Hi all. Do I understand it correctly that state is a snapshot of all recent unique data based on "replication key"? Let's imagine I have 1M unique IDs (PK) and updated_at date for each. So, the state json file will be 1M id-updated_at pairs (and all other data we have for them)? For very big data sets this state file will be enormous, am I right?
j
which tap are you using? State is managed a bit different between taps. I assume you are using a database tap of some sort
t
hi @jaye_howell. I'm developing a custom tap, based on RESTStream base class. We have a tap.py with subclasses for each stream (rest endpoints) which contains schema definition, replication_key (and value). All related with my post above: https://meltano.slack.com/archives/CMN8HELB0/p1671787011078509 Now, I can see this state file but for me it containts all PKs which is weird. I wonder why it does not contain only last updated_at date from previous elt run and get only records with updated_at > last updated_at? Is this something I need to develop by myself in my tap?
Can we define our own state file structure (still as json) any way in a custom tap? I can see a tap_state property in Stream class of singer_sdk, however with note: "This method is internal to the SDK and should not need to be overridden"