Hi. I have a couple of questions regarding transfo...
# getting-started
c
Hi. I have a couple of questions regarding transformers. Please note, that I am just starting out with meltano, have a working setup with a custom tap, a postgres target, and multiple schedules executed by the included airflow orchestrator, but have not used dbt before. 1. I see a warning on https://docs.meltano.com/guide/transformation , that transform plugins are de-prioritized, but am not sure what this means. Is every "transformer" a "transform plugin" and I should not use this feature at all, or is only the "dbt" package deprecated, and the usage of dbt-postgres is still supported, or ...? 2. I have added a
dbt-postgres
transformer to my existing project with
meltano add transformer dbt-postgres
. This added a new section in my
meltano.yml
and created a bunch of files in the transform directory. I have created a custom
source.yml
and
my_new_table.sql
within
./transform/models/<my_project_name>
. I can successfully execute those by running
meltano run dbt-postgres:run
or
meltano invoke dbt-postgres:run
locally, but am failing to run them as part of my schedules. a. I tried enabling it by setting
transform: run
within a schedule. The schedule now fails with this error:
{"error": "Plugin 'Transformer 'dbt' not found.\nUse of the legacy 'dbt' Transformer is deprecated in favor of new adapter specific implementations (e.g. 'dbt-snowflake') compatible with the 'meltano run ...' command.\n<https://docs.meltano.com/guide/transformation>\nTo continue using the legacy 'dbt' Transformer, add it to your Project using 'meltano add transformer dbt'.' is not known to Meltano"}
, but I have not added the
dbt
plugin, only
dbt-postgres
b. How can I specify which transformer plugin to run within a schedule. I only find the option to set it to
run
,
skip
,
only
? I would like to run different transformers on different schedules. Thank you!
s
To answer 1 really quick: You can use all existing transformers (all dbt dialects). They work, and we support them. We just don't add new transformers (of a different type than dbt) right now.
p
Also context for 1. Meltano has
transformers
plugins (only dbt as of today) to manage your data transformations after the data lands in the warehouse. Meltano also had a legacy feature called
transforms
which is what that warning message is about. Transforms were meltano specific dbt packages that included pre-built transformations for a specific tap's data that were tightly coupled, we've moved away from that pattern. Although you can still use normal dbt packages in your project.
For 2: theres 2 ways to schedule pipelines and it looks like the getting started guide docs are a little unclear right now. If you follow this guide https://docs.meltano.com/guide/orchestration it will show you how to create jobs and schedule them using the new more flexible syntax.
Let us know if you run into any issues and we can help resolve!
a
c
Thanks everyone, those explanations helped already in understanding it a bit better. I am now trying to replicate my existing setup (without transformer) first, but running into a problem with the env variables in combination with airflow orchestrator. I am setting the
TARGET_POSTGRES_DBNAME
inside my schedule, like:
Copy code
schedules:
- name: my-schedule-name
  job: my-job-name
  env:
    TARGET_POSTGRES_DBNAME: mydbname
This works when i run the schedule with
meltano schedule run my-schedule-name
, but fails when the schedule is run inside the orchestrator (
meltano invoke airflow scheduler
) . Settings this env variable worked with the old EL(T) style schedule in the orchestrator. It also works when i set the
dbname
property directly on the loader, but i want to avoid duplicating (inherit_from) the same postgres loader many times with only a difference in the dbname. Any ideas?
Seems to be a known issue, this describes the problem: https://meltano.slack.com/archives/C01TCRBBJD7/p1663679958598429
p
@christopher_kintzel sounds like that thread explains it - with the new job/tasks syntax we opted to split each job into its own Airflow task vs lumping them all into a single task like the older elt style. It still seems like a good idea to split out those tasks. Have you tried the suggested tweak to the dag generator from https://github.com/meltano/files-airflow/issues/32#issue-1379590061? That seems like a reasonable solution. I created a PR for that change https://github.com/meltano/files-airflow/pull/36
c
Thanks. Yes, I am currently in process of trying this workaround. Had to additionally use a newer version of airflow, min 2.3.4 (
append_env
is not available in previous versions). Is there a specific reason meltano uses 2.1.2 by default? But I am now getting this error in the log:
Environment variable 'MELTANO_LOAD_SCHEMA' referenced but not set. Make sure the environment variable is set.
I have actually seen this error multiple times already while trying different unrelated things as well, but could not find any good information about it online 🙂
I was on the wrong meltano version. I can confirm, the tweak works with airflow 2.3.4.
I am currently on meltano==2.8.0. Unrelated to this fix, my current setup stops working when upgrading to 2.10.0 or 2.11.1 because of the
MELTANO_LOAD_SCHEMA
issue.
p
Hmm I was able to find https://github.com/meltano/meltano/issues/2928 which sounds like it might be related. But it was closed a year ago so I wouldnt think it would be an issue in 2.10.0 🤔 . Although I'm not sure an exact fix was put in place. I see a note at the top
According to @DouweM, this can occur if a custom loader that doesn't define a schema setting, so the env var doesn't get populated and dbt doesn't know what schema to read from.
that seems relevant