Hey I’m a bit confused how to set the `state-id` w...
# getting-started
m
Hey I’m a bit confused how to set the
state-id
when using
meltano run
or
meltano elt
. I have a docker image that builds and calls
meltano run tap-mongodb target-snowflake
once per hour in production. When I push an update to that repository, the docker image rebuilds, causing it to run the entire pipeline from scratch even though INCREMENTAL and LOG_BASED are set (for different collections). I don’t see it loading the data into the target, making me think that meltano did not keep track of the
state-id
. What are the steps I should take to keep track of the state-id even if the docker image rebuilds?
e
The default state backend is a SQLite db in
.meltano/meltano.db
, so it's most likely getting lost after the container is removed. You may want to use an external state backend: https://docs.meltano.com/concepts/state_backends
m
Ok thanks that’s a very helpful link. I should reach out to devops and ask for •
state_backend.uri: azure://<your container_name>/<prefix for state JSON blobs>
• and the connection string
state_backend.azure.connection_string
Once I have these then I should be able to authenticate the azure db connection by setting the state backend in meltano.yml? Are there any other steps I need to take?
Copy code
extractors:
  - name: tap-mongodb
    namespace: tap_mongodb
    state_backend:
      type: remote
      uri: azure://<your container_name>/<prefix for state JSON blobs> # get this from .env ${TAP_MONGODB_AZURE_MELTANO_STATE}
      connection_string: ${TAP_MONGODB_STATE_CONNECTION_STRING}
a
@edgar_ramirez_mondragon where does the setting changed by
meltano config meltano set state_backend.uri
get persisted? They're not in my
meltano.yml
, somewhere else?
m
Would it work if i added to the Dockerfile
RUN meltano config meltano set state_backend.uri <azure://<your container_name>/<prefix for state JSON blobs>
and then added the connection string in .env as
AZURE_STORAGE_CONNECTION_STRING
? I looks like I don’t need to do it this way, and should theoretically be able to do this in meltano.yml using my example above (but not sure).
e
@Andy Carter It's persisted to
.env
by that command (see the setting definition) @matt_elgazar You can pass both as env vars to the container: • MELTANO_STATE_BACKEND_URIMELTANO_STATE_BACKEND_AZURE_CONNECTION_STRING
a
Ah that explains it, when I do local runs I don't have that env var set, but in docker on Azure I do. Mystery solved 🙂
m
@edgar_ramirez_mondragon so is my example for the chunk in meltano.yml correct? Will that work?
e
@matt_elgazar Ah, didn't see that. No. The settings should be placed at the top level:
Copy code
project_id: ...
state_backend:
  uri: azure://<your container_name>/<prefix for state JSON blobs>
  azure:
    connection_string: ...
Or just pass the env vars I mentioned above. That should work too.
a
@edgar_ramirez_mondragon last one from me I promise. Looking at this code, https://github.com/meltano/meltano/blob/52320054a273d8f41eae86612c2a3e8c502a1d9e/src/meltano/core/state_store/azure.py#L98C39-L98C39 if
connection_string
is not present, will it default back to using DefaultAzureCredential? Is that what
return BlobServiceClient()
suggests? I would love to turn off the access via connection string to make my IT team happier.
e
@Andy Carter Hmm in fact that part of the code doesn't look ok. It'd crash:
Copy code
>>> from azure.storage.blob import BlobServiceClient
>>> BlobServiceClient()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: BlobServiceClient.__init__() missing 1 required positional argument: 'account_url'
Can you log an issue? I can add details to it later 🙂
u
m
@edgar_ramirez_mondragon got it working thanks! Very helpful 👍 One last question - I have my meltano.yml file set up to run different environments, but I want to deselect one of the collections (say
collection3
) under the select but it doesn’t appear to be working.
Copy code
version: 1
send_anonymous_usage_stats: true
project_id: tap-mongodb

default_environment: dev

state_backend:
  type: remote
  uri: ${AZURE_TAP_MONGODB_STATE_URI}
  azure:
    connection_string: ${AZURE_TAP_MONGODB_STATE_CONNECTION_STRING}

plugins:
  extractors:
  - name: tap-mongodb
    namespace: tap_mongodb
    pip_url: git+<https://github.com/melgazar9/tap-mongodb.git@20738c1272ff12eb403abb9f8019200e5acd573f>
    capabilities:
      - state
      - catalog
      - discover
      - about
      - stream-maps

    config:
      add_record_metadata: true
      allow_modify_change_streams: true

    select:
      - 'collection1.*'
      - 'collection2.*'
      - 'collection3.*'

environments:
  - name: testing
    config:
      plugins:
        extractors:
          - name: tap-mongodb
            config:
              mongodb_connection_string: ${TESTING_MONGODB_CONNECTION_STRING}
              database: Testing
            select:
              - '!collection3.*'  # bug in mongodb data - I want to disregard collection3 when environment = 'testing'
        loaders:
          - name: target-snowflake
            env:
              TARGET_SNOWFLAKE_DEFAULT_TARGET_SCHEMA: MONGODB_TESTING
u
Hey @matt_elgazar, glad you got it working! So,
select
arrays are not additive across environment, which means you have to be explicit about what you want selected in that environment: Base plugin def:
Copy code
select:
  - 'collection1.*'
  - 'collection2.*'
  - 'collection3.*'
testing
environment:
Copy code
select:
  - 'collection1.*'
  - 'collection2.*'
m
Ah man yea the problem is I have about 80 collections listed at the top level, so I don’t want to copy all of that just for env testing. Figured it would be much easier to just exclude one collection. There’s no way to do this?
e
I see. Do log a feature request. Some folks from the community may even have suggestions.