Is anyone deploying Meltano on Azure? I'd love to ...
# infra-deployment
c
Is anyone deploying Meltano on Azure? I'd love to hear about your experiences. I am currently working with Azure Container Apps, but I am wondering how others have been deploying in the Azure ecosystem.
@Conner Panarella if you let me know where you up to so far, perhaps I can help. What do you have so far by way of meltano.yml, Dockerfile etc? How do you plan to interact with the container? SSH in to run
meltano
commands, or with an orchestrator like Airflow or Dagster?
c
Sure thing! The goal is to use Dagster as an orchestrator for scheduling jobs. Right now my meltano.yml is pretty straightforward. A few extractions from a MSSQL database that are being loaded into a postgres database. Dockerfile is similarly simple, just using the provided one from
meltano add files files-docker
the only addition I've made is adding an additional
meltano invoke dagster:start
as a CMD in the Dockerfile.
@Andy Carter The end goal is to deploy this to App Service so that the dagster UI can be accessed.
a
Are you planning to integrate DBT too? Or just meltano runs?
c
@Andy Carter Yes, we have a separate dbt project that I would like to integrate in the future as well
a
So I would first look at the
dagster-ext
extension and getting that up and running with your meltano streams. Code example here for
repository.py
https://github.com/quantile-development/dagster-meltano/issues/28#issuecomment-1655283987
c
@Andy Carter Yes I have the utility installed so I've gotten that far.
@Andy Carter What does your dockerfile look like?
a
Copy code
# <http://registry.gitlab.com/meltano/meltano:latest|registry.gitlab.com/meltano/meltano:latest> is also available in GitLab Registry
ARG MELTANO_IMAGE=meltano/meltano:v2.20.0-python3.10
FROM $MELTANO_IMAGE

ARG ado_token
ENV ADO_TOKEN $ado_token

WORKDIR /project

# Install any additional requirements
COPY ./requirements.txt .
RUN pip install -r requirements.txt

COPY meltano.yml logging.yaml ./
ADD meltano-yml meltano-yml
ADD plugins plugins
# Copy over Meltano project directory
# COPY . .

RUN meltano install

COPY ga4_reports.json ./

# then copy dbt models in orchestrate folder
ADD orchestrate orchestrate

# Installs dbt's required dependencies and extra packages
RUN meltano invoke dagster:dbt_deps

# overwrite dagster.yaml with contents of dagster_azure.yaml
RUN rm -rf ./orchestrate/dagster/dagster.yaml
COPY ./orchestrate/dagster/dagster_azure.yaml ./orchestrate/dagster/dagster.yaml

# Don't allow changes to containerized project files
ENV MELTANO_PROJECT_READONLY 1

# Expose default port used by `meltano ui`
EXPOSE 5000
# Expose port used for postgres connection
EXPOSE 5432

# Expose port used for postgres connection
EXPOSE 3000

ENTRYPOINT ["meltano"]
The
dagster.yaml
lines are because I want different behaviour in cloud to local (writing logs etc), and there's no native way to to this with a single
dagster.yaml
file.
c
@Andy Carter Got it! What does your app service look like? What is the startup command?
a
meltano invoke dagster:dev
Copy code
utilities:
  - name: dagster
    variant: quantile-development
    pip_url: git+<https://github.com/quantile-development/dagster-ext@v0.1.1> dagster-postgres dagster-dbt dbt-postgres<1.8.0 dagster-azure dagster_msteams elementary-data
    settings:
    - name: dagster_home
      env: DAGSTER_HOME
      value: $MELTANO_PROJECT_ROOT/orchestrate/dagster
    - name: dbt_load
      env: DAGSTER_DBT_PARSE_PROJECT_ON_LOAD
      value: 1
    commands:
      dev:
        args: dev --workspace $REPOSITORY_DIR/workspace.yml --dagit-host 0.0.0.0
        executable: dagster_invoker
I know
dev
shouldn't really be used for production but it works for me 🙂
@Conner Panarella are you using bicep for your app service?
Copy code
resource appService 'Microsoft.Web/sites@2022-09-01' = {
  name: '${env}-dagster-${uniqueString(resourceGroup().id)}'
  location: location
  properties: {
    serverFarmId: appServicePlan.id
    siteConfig: {
      alwaysOn: true
      linuxFxVersion: 'DOCKER|${containerRegistryName}.<http://azurecr.io/myimage:latest|azurecr.io/myimage:latest>'
      appSettings: appSettings
      appCommandLine: 'invoke dagster:dev'
      ipSecurityRestrictions: securityRestrictions
      ipSecurityRestrictionsDefaultAction: 'Deny'
      publicNetworkAccess: null
    }
    virtualNetworkSubnetId: subnetID
  }
}
And you need
SERVER_PORT=3000
in the env variables for dagit.
c
@Andy Carter Perfect! I was able to get it up and running 🙂 Now, I am looking at scheduling. How are you managing scheduling? I want to limit the concurrency of jobs but the jobs are being loaded via
dagster-meltano
here:
meltano_jobs = load_jobs_from_meltano_project(MELTANO_PROJECT_DIR)
so I'm not sure if I can further customize them with tags to limit concurrency
I just define crons for each asset job, nothing fancy like freshness
c
@Andy Carter Thank you! I guess I'm just struggling on adding concurrency control since I am not actually defining the jobs. They are being imported via the dagster-meltano plugin.
a
Can you define a bit more what you mean by 'concurrency control'? In
dagster.yaml
you can see your global concurrency limits and specifically for certain tags:
Copy code
run_queue:
  max_concurrent_runs: 15
  tag_concurrency_limits:
    - key: "database"
      value: "redshift" # applies when the `database` tag has a value of `redshift`
      limit: 4
    - key: "dagster/backfill" # applies when the `dagster/backfill` tag is present, regardless of value
      limit: 10
Are you talking about job-level concurrency?
c
@Andy Carter Correct. I was hoping I could tag jobs in the
meltano.yaml
and update the run_queue as you showed above. That way the
repository.py
could just use
load_jobs_from_meltano_project
. However, I don't think this is possible, so I will need to define each job and schedule in dagster instead of meltano.
a
using
load_jobs..
doesn't give a full view of the individual streams, no. If you care about that I think you need to define them using multi_asset approach.
c
@Andy Carter Hey! Have you had any issues with Dagster runs dangling indefinitely?
I am seeing things like this:
a
Hmm maybe once or twice but not repeatedly. Can you share any of your dagster code to define these assets?
Is this just a meltano run you are trying to orchestrate, or dbt too?
c
@Andy Carter I started using
execute_shell_command
because of the significant performance hit described here: https://discuss.dagster.io/t/18765172/u0667dnc02y-i-have-a-meltano-job-that-takes-5-mins-when-i-ru Here's the function that I created to help construct the jobs:
Copy code
def make_meltano_job(tap_name: str):
    cleaned_name = tap_name.replace("-", "_")

    @op(
        name=f"{cleaned_name}_tap_op"
    )
    def meltano_shell_op(context: OpExecutionContext):
        execute_shell_command(
            f"meltano run {tap_name} target-postgres --force",
            output_logging="STREAM",
            log=context.log)

    @job(
        name=f"{cleaned_name}_tap_job"
    )
    def meltano_run_job():
        meltano_shell_op()

    return meltano_run_job
From there I create jobs like this:
Copy code
all_shopify_taps = [
    'tap-shopify--store1',
    'tap-shopify--store2
]

shopify_jobs = [make_meltano_job(shopify_tap) for shopify_tap in all_shopify_taps]
a
Where are you running this? Locally/container? Could it be some sort of memory issue? Not sure I'm going to be much help here I worry.
c
Yeah it's in a container. I will continue to debug. Appreciate the help!