https://meltano.com/ logo
#announcements
Title
# announcements
v

victorious-hydrogen-63155

04/17/2021, 5:31 PM
@ripe-musician-59933, as mentioned in the other thread about the data type issue, I am working on automating my ETL pipeline as much as possible through Meltano. The project I am on right now uses sensitive client data. I need to fetch it through different tap configurations and store it separately through different target configurations to make sure about not mixing up the data and managing fine-grained access rights. Hence, I need to run about 80 to 100 different tap-target configurations, which, to make it even more complex, sometimes change, new ones get added, and others get deleted. I used the search box and investigated options by reading some epics on GitLab and the documentation. I figured out that plugin inheritance seems to be the approach you guys settled with. However, maintaining all of these configurations myself seems not appropriate. I want to allow the project owner or one of his employees to maintain the settings without technical background knowledge and prevent them from breaking the settings. To keep it simple, I set up a Google Sheet with these 100 rows to store and easily maintain all information that changes while keeping all shared settings across clients in a single config file. I like the concept of inheritance, but I am still searching to allow the client to maintain config settings themselves without logging into the cloud VM and working on the CLI. I can programmatically read the google sheet through their API and call the tap-target combination by iterating over the sheet rows and dynamically generating configuration dictionaries with the singer-runner. Additionally, I don’t need to schedule every single tap-target combination individually but rather schedule the orchestration script itself. Also, I want to run the 100 combinations one after another as they are not time-critical, only run once a week, and keep it deployed on a low-cost VM. Running them one after another helps me achieve these goals while also keeping the overall runtime as short as possible. The runs vary in their runtime, so a general buffer time between runs and separate scheduling for each of them would increase the overall runtime. Do you have a suggestion on how to solve this with Meltano?
r

ripe-musician-59933

04/19/2021, 3:24 PM
@victorious-hydrogen-63155 I recommend writing a custom Airflow DAG generator that iterates over the rows in the Google sheet and creates a new DAG for each, with the
BashOperator
and
meltano elt <tap> <target> --job_id=<identifier>
command, and the relevant configuration from the sheet injected through the environment (https://meltano.com/docs/configuration.html#configuring-settings). Your Meltano project will only need one definition for the tap and target with the common settings, and all other settings will be applied at runtime. The default DAG generator at https://gitlab.com/meltano/files-airflow/-/blob/master/bundle/orchestrate/dags/meltano.py creates a DAG for each scheduled pipeline in Meltano by iterating over the result of
meltano schedule list --format=json
, and you can modify this to iterate over your sheet instead.
v

victorious-hydrogen-63155

01/18/2022, 4:43 PM
@ripe-musician-59933, how are you doing? 🙂 I’m digging this up again. I’ve implemented this back at the time with a little python script using a library called Singer Runner. This runs smoothly so far. I’ve to get another project started in a very similar fashion. I came across this post here about dagger. I’m wondering if your proposed solution is still the best fit for my problem or if alternatives emerged? Is it also possible to integrate other tasks into the DAGs, like some Python code snippets to do some housekeeping tasks? What do you think?
@worried-river-89520, how are you? I’m adding you here as well. I came across your blog post and I’m wondering how you solved the problem of looping through the list of RDS instance connections and databases? I believe this is similar to my use case.
w

worried-river-89520

01/18/2022, 5:01 PM
It does indeed sound very similar. I have a Dagster op/solid that queries the list of configurations and then triggers a another set of ops/solids for each config that does the actual ELT. Originally it was just a simple loop but eventually I parallelized it because it was such a long list.
If you’re thinking of using Dagster, I would encourage you to look at this repo to help achieve the Dagster-meltano integration https://github.com/quantile-development/dagster-meltano
v

victorious-hydrogen-63155

01/18/2022, 5:16 PM
Thanks a lot, Josh! What do you think of how much overhead time this is to get this up and running if my custom tap-target implementations already exist? I’m comfortable with the Singer Runner and could get this achieved in a few hours likely. What would be advantages of doing it through Meltano & Dagster? Parallelizing this is a big topic as well. My previous solution through the Singer Runner is set up to run in a sequence and takes quite long. I need to parallelize this fur sure. How did you do this?
Is your code somewhere public/accessible for reference?
w

worried-river-89520

01/18/2022, 6:13 PM
Unfortunately, it’s not public. Adding parallelization will not likely be the easiest thing you’ve coded, but the visibility into your pipeline and benefits of using Dagster could out weigh the cost of development. Dagster will require a server with a UI that you’ll log into in order to run your pipeline. The way I achieve parallelization is to have a Dagster graph with an op that creates an Asset Materialization with the configuration details in the metadata. One materialization is produced for each parallel run I want, in my case that’s one parallel run for each MySQL instance. Each materialization’s metadata contains the list of db names that will run serially within that parallel run. Then I have another graph that operates a Sensor that is listening for those asset materializations and runs the Meltano pipeline based on the config found in the Asset Materialization metadata. There are ways of doing it without passing the config within the Asset Materialization metadata, but for my purposes it was secure and it was the quickest way of achieving it and I’ve heard of other people doing similar things.
If you decide to go that route, I’m happy to provide some of my desensitized code privately to help speed up your development.
🙏 1
v

victorious-hydrogen-63155

01/18/2022, 6:29 PM
Thanks, Josh. That’s really helpful! I’m not sure yet if I want to go that route to be honest. I feel like continuing with the lightweight Singer Runner might be more time/cost effective for me. I guess I need to rethink this when I’m exceeding an acceptable execution time with this serial approach. I’m afraid this might happen earlier than later…
What are the benefits of using Dagster actually rather than a simple Python script with Singer Runner?
w

worried-river-89520

01/18/2022, 6:53 PM
Beyond the basic scheduling and orchestration of complex DAGs, Dagster provides some level of observability into your data pipelines (run times, logs, etc.). It also allows you to parametrize your pipelines such that you can run just portions of your pipelines depending on how you’ve set it up. Deploying into a container environment (K8s, etc.) gives you essentially infinite scalability with very little manual work on your end. Granted this is all only useful for situations where your have more than one pipeline and pipelines are more complex than a single operation. You could theoretically get a lot of the same benefits with any orchestrator like Airflow or Prefect.
v

victorious-hydrogen-63155

01/18/2022, 7:00 PM
Right. I think I realize that my use case is not complex enough to justify the overhead. But let’s see. I’ll sleep over it and try a few things I think. How can I get started and get something up and running quickly with Meltano and Dagster for my case? Where do I need to start?
w

worried-river-89520

01/18/2022, 8:24 PM
quick and dirty: my blog post you found is probably the only introduction to both in the same post. There’s plenty of other intro’s that will take you more deep on each tool individually (each tool’s docs is the best place to start). Once you feel comfortable how each tool operates independently, this dagster-meltano repo is probably the best integration out there so far, though it is not complete yet, could change a lot, nor have I tested it extensively in a production like environment. If you do choose to use it, it does ignore everything I have in my blog post.
🙌 2