Is there a documentation on how taps work to determine the l Meltano #documentation

Is there a documentation on how taps work to deter...

Kyunghwan Choi

08/28/2024, 7:26 PM

Is there a documentation on how taps work to determine the latest record read from the source so that we only pull incremental data? Or is that something that's stored on job/stream? Is it that 'state' table I see behind meltano backend state db? https://sdk.meltano.com/en/latest/context_object.html it says 'state file' but I assume this is the backend state table, if I set postgres? I only see much fewer records than the number of streams being sycned (one per schedule), so is it stored somewhere else or maybe it's not tracking it properly?

Charles Feduke

08/28/2024, 7:55 PM

as far as state goes if you store state in PostgreSQL you’ll have a state table with a couple of JSON columns for

partial_state

and

completed_state

and within those JSON columns will contain all of the bookmark data for each configured tap. If you have state that has an entry for a table with no bookmark information then double check your configuration since the default for some taps may be full table. You should have one record in the state table per tap and the tap can have many streams.

Charles Feduke

08/28/2024, 7:55 PM

when it comes to configuring both a database and object store (S3) I am not sure which one wins out, I’ve only ever configured state to be stored in Postgres

Kyunghwan Choi

08/28/2024, 8:14 PM

ok, that's fine. I'm using Postgres anyways. I just noticed both of them gets updated, weirdly. Then, that's really odd. So we expect there to be one row per tap, but I only see one stream data in there

Copy code

{
  "singer_state": {
    "stream_descriptor": {
      "name": "carrier_payments",
      "namespace": null
    },
    "stream_state": {
      "updatedAt": "2024-08-28T19:21:22.151401Z"
    }
  }
}

So.. there should be multiple, one per stream? There are like 20 streams, and only one datapoint in there, which iss odd.

Charles Feduke

08/28/2024, 8:15 PM

one row per tap

Kyunghwan Choi

08/28/2024, 8:15 PM

that's under

completed_state

column.

Kyunghwan Choi

08/28/2024, 8:15 PM

Right, one row per tap. But,

carrier_payments

is only one of the 20 or so streams. And I don't see other stream timestamps.

Charles Feduke

08/28/2024, 8:16 PM

I can’t be certain how other taps store their state but at least for tap-postgres it is just a single row (it’d make sense that all taps follow this pattern)

Charles Feduke

08/28/2024, 8:18 PM

double check your configuration, I am unsure if you’ll have any state if you’ve left the default OR its misconfigured and the default happens to not require state (example:

FULL_TABLE

for tap-postgres is the default, however I’ll see state entries with no bookmark information; but each tap could handle this differently)

Kyunghwan Choi

08/28/2024, 8:19 PM

~~what's the configuration key for that?~~ nvm. I looked up. 'Replication Method' for postgres

Charles Feduke

08/28/2024, 8:19 PM

at the start of the execution of the tap you may see some warnings if you have declared streams that aren’t part of the catalog… like you’ve forgotten the schema name or something when you configured the stream (which is exactly what I did last week and that user created error is what sent me down the path of looking at meltano state storage closer because I thought it was a bug)

Charles Feduke

08/28/2024, 8:22 PM

my real world error was this:

Copy code

extractors:
  - name: tap-postgres--primary-facade-svc-db
    inherit_from: tap-postgres
    metadata:
      # primary facade
      company:
        replication-method: INCREMENTAL
        replication-key: modified_date

I had accidentally left the schema name off from

company

so when the stream

primary_facade-company

was copied it was

FULL_TABLE

the default and I was like… why? I configured this! But meltano was actually giving me a warning message that “stream company does not exist in source” and I was missing it, flummoxed by the full table copy behavior

Charles Feduke

08/28/2024, 8:23 PM

I only noticed the problem when what I expected to be incremental executions were taking as long as the initial load, so I started investigating state and I saw that I had a bunch of stream names in my state but with no bookmarks. So, if you are using other taps they may not even bother to record the stream name if there’s no bookmark information to retain - implementation is tap specific, as I understand.

Kyunghwan Choi

08/28/2024, 8:26 PM

oh.. maybe I need to specify that for each stream.. 😞 I don't have those replication-key, replication-method. I would have expected them to be automatic from my tap as it's common streams there. hmm

Charles Feduke

08/28/2024, 8:26 PM

if you have warnings, they will appear right after the tap launches so pay particular attention at the start/examine the head of the log file

Kyunghwan Choi

08/28/2024, 8:27 PM

my tap is hubspot, so not different from use case to use case. 😞

Charles Feduke

08/28/2024, 8:27 PM

something I discovered is that wildcard stream configuration was not working for me so I have to configure each stream independently even if they all have the same exact settings (the table schema for my source database has a very common pattern of

modified_date

being present and reliable on every row)

Kyunghwan Choi

08/28/2024, 8:27 PM

well, I guess I got myself a new night project tonight. thank you.

Charles Feduke

08/28/2024, 8:28 PM

example:

Copy code

primary_facade-company:
        replication-method: INCREMENTAL
        replication-key: modified_date
      primary_facade-investment:
        replication-method: INCREMENTAL
        replication-key: modified_date
      primary_facade-investor_profile:
        replication-method: INCREMENTAL
        replication-key: modified_date

I originally tried

primary_facade-*

with the replication-method and replication-key but that did not work - instead I got

FULL_TABLE

replication for everything, all of the time which was no good

Charles Feduke

08/28/2024, 8:30 PM

and remember your stream configuration is separate from your selected streams, but you probably already know this if you have something that is somewhat functioning

Kyunghwan Choi

08/28/2024, 8:39 PM

yes. so, I guess the default is full_table replication. It now makes sense because I was wondering why all these subsequent runs were taking so long, given that there aren't any new data. maybe this is the answer.

Charles Feduke

08/28/2024, 8:40 PM

yeah I kind of wish the default was an error message requiring that you explicitly choose the replication method via configuration, it would have saved me a lot of time

Kyunghwan Choi

08/28/2024, 10:22 PM

Hmm.. it doesn't seem like the state is being saved, so it always pulls from the oldest data, i.e. ends up being full table sync, even though it does say something like "initiating incremental sync". Odd.

3 Views

Open in Slack

Previous Next