Is there a documentation on how taps work to deter...
# documentation
k
Is there a documentation on how taps work to determine the latest record read from the source so that we only pull incremental data? Or is that something that's stored on job/stream? Is it that 'state' table I see behind meltano backend state db? https://sdk.meltano.com/en/latest/context_object.html it says 'state file' but I assume this is the backend state table, if I set postgres? I only see much fewer records than the number of streams being sycned (one per schedule), so is it stored somewhere else or maybe it's not tracking it properly?
c
as far as state goes if you store state in PostgreSQL you’ll have a state table with a couple of JSON columns for
partial_state
and
completed_state
and within those JSON columns will contain all of the bookmark data for each configured tap. If you have state that has an entry for a table with no bookmark information then double check your configuration since the default for some taps may be full table. You should have one record in the state table per tap and the tap can have many streams.
when it comes to configuring both a database and object store (S3) I am not sure which one wins out, I’ve only ever configured state to be stored in Postgres
k
ok, that's fine. I'm using Postgres anyways. I just noticed both of them gets updated, weirdly. Then, that's really odd. So we expect there to be one row per tap, but I only see one stream data in there
Copy code
{
  "singer_state": {
    "stream_descriptor": {
      "name": "carrier_payments",
      "namespace": null
    },
    "stream_state": {
      "updatedAt": "2024-08-28T19:21:22.151401Z"
    }
  }
}
So.. there should be multiple, one per stream? There are like 20 streams, and only one datapoint in there, which iss odd.
c
one row per tap
k
that's under
completed_state
column.
Right, one row per tap. But,
carrier_payments
is only one of the 20 or so streams. And I don't see other stream timestamps.
c
I can’t be certain how other taps store their state but at least for tap-postgres it is just a single row (it’d make sense that all taps follow this pattern)
double check your configuration, I am unsure if you’ll have any state if you’ve left the default OR its misconfigured and the default happens to not require state (example:
FULL_TABLE
for tap-postgres is the default, however I’ll see state entries with no bookmark information; but each tap could handle this differently)
k
what's the configuration key for that? nvm. I looked up. 'Replication Method' for postgres
c
at the start of the execution of the tap you may see some warnings if you have declared streams that aren’t part of the catalog… like you’ve forgotten the schema name or something when you configured the stream (which is exactly what I did last week and that user created error is what sent me down the path of looking at meltano state storage closer because I thought it was a bug)
my real world error was this:
Copy code
extractors:
  - name: tap-postgres--primary-facade-svc-db
    inherit_from: tap-postgres
    metadata:
      # primary facade
      company:
        replication-method: INCREMENTAL
        replication-key: modified_date
I had accidentally left the schema name off from
company
so when the stream
primary_facade-company
was copied it was
FULL_TABLE
the default and I was like… why? I configured this! But meltano was actually giving me a warning message that “stream company does not exist in source” and I was missing it, flummoxed by the full table copy behavior
I only noticed the problem when what I expected to be incremental executions were taking as long as the initial load, so I started investigating state and I saw that I had a bunch of stream names in my state but with no bookmarks. So, if you are using other taps they may not even bother to record the stream name if there’s no bookmark information to retain - implementation is tap specific, as I understand.
k
oh.. maybe I need to specify that for each stream.. 😞 I don't have those replication-key, replication-method. I would have expected them to be automatic from my tap as it's common streams there. hmm
c
if you have warnings, they will appear right after the tap launches so pay particular attention at the start/examine the head of the log file
k
my tap is hubspot, so not different from use case to use case. 😞
c
something I discovered is that wildcard stream configuration was not working for me so I have to configure each stream independently even if they all have the same exact settings (the table schema for my source database has a very common pattern of
modified_date
being present and reliable on every row)
k
well, I guess I got myself a new night project tonight. thank you.
c
example:
Copy code
primary_facade-company:
        replication-method: INCREMENTAL
        replication-key: modified_date
      primary_facade-investment:
        replication-method: INCREMENTAL
        replication-key: modified_date
      primary_facade-investor_profile:
        replication-method: INCREMENTAL
        replication-key: modified_date
I originally tried
primary_facade-*
with the replication-method and replication-key but that did not work - instead I got
FULL_TABLE
replication for everything, all of the time which was no good
and remember your stream configuration is separate from your selected streams, but you probably already know this if you have something that is somewhat functioning
k
yes. so, I guess the default is full_table replication. It now makes sense because I was wondering why all these subsequent runs were taking so long, given that there aren't any new data. maybe this is the answer.
c
yeah I kind of wish the default was an error message requiring that you explicitly choose the replication method via configuration, it would have saved me a lot of time
k
Hmm.. it doesn't seem like the state is being saved, so it always pulls from the oldest data, i.e. ends up being full table sync, even though it does say something like "initiating incremental sync". Odd.