Hello everyone, I’m a junior data engineer doing a...
# getting-started
l
Hello everyone, I’m a junior data engineer doing a POC to see if Meltano can be used for our bronze layer EL (extract + load raw data into Databricks tables/S3) while taking advantage of Spark and distributed compute. Main things I’m trying to find out: 1. Has anyone successfully run Meltano inside Databricks Jobs/Clusters? Was it stable and easy to maintain? 2. Has anyone packaged a Meltano project as a Python wheel and installed it on Databricks clusters? How did you handle dependencies/plugins? 3. Is Databricks Container Service a good option for Meltano, or is there a simpler, proven approach? 4. For orchestration, is Airflow still better, or can Databricks-native workflows handle Meltano well? Thanks in advance!
v
I haven't done things inside of Databricks, but I have ran Meltano inside of Snowflake, MSSQL , and I think 5-6 other orchestrators (Windows Task Scheduler, Cron, Prefect, Dagster, ECS, Gitlab CI, Github Actions, Melatano Cloud, Arch, other custom schedulers folks roll themselves)
All of which are stable but I think generally for existing teams you should deploy in whatever orchestrator the team already uses and is familiar with. If this is your first thing in Databricks then I wouldn't do it
Has anyone packaged a Meltano project as a Python wheel and installed it on Databricks clusters? How did you handle dependencies/plugins?
Normally for orchestrators we tend to package Meltano projects as containers, a wheel a has a number of issues (platform dependent, would have to be pretty large, etc) it could ofc work as at the end of the day it's a package manager but I wouldn't do it personally
e
Has anyone packaged a Meltano project as a Python wheel and installed it on Databricks clusters? How did you handle dependencies/plugins?
I can imagine it's doable, but not straightforward. Is that a requirement of the Databricks platform?
l
Thank you for your insights, Derek and Edgar. We’re currently using Databricks with asset bundles and Python notebooks to extract data from various sources. Our goal is to see if Meltano can simplify and speed up the extract-load process for the bronze layer, though custom taps and targets may require additional maintenance. Databricks Container Service requires significant configuration and ongoing maintenance, so we are focusing on simpler deployment approaches. Supported methods for installing Python libraries to Databricks clusters: • Workspace File Path: Upload a
.whl
,
wheelhouse.zip
, or
requirements.txt
from the workspace. • Volumes File Path: Use a
.whl
,
.jar
, or
requirements.txt
stored in Volumes. • File Path / S3: Accepts JAR files (
.jar
,
.zip
,
.tar
) or Python packages (
.whl
,
.zip
,
.tar
,
.tar.gz
). • PyPI: Install packages with exact versions using
==
to avoid regressions; optional custom index URL. • Maven: Install via Maven coordinates (e.g.,
com.databricks:spark-csv_2.10:1.0.0
) with optional repository and exclusions for dependencies. Feedback on deploying Meltano with asset bundles and Databricks Jobs, especially regarding stability, maintainability, and managing custom plugins, would be appreciated.
v
How is > Databricks Container Service requires significant configuration and ongoing maintenance True But > • Workspace File Path: Upload a
.whl
,
wheelhouse.zip
, or
requirements.txt
from the workspace. > • Volumes File Path: Use a
.whl
,
.jar
, or
requirements.txt
stored in Volumes. > • File Path / S3: Accepts JAR files (
.jar
,
.zip
,
.tar
) or Python packages (
.whl
,
.zip
,
.tar
,
.tar.gz
). > • PyPI: Install packages with exact versions using
==
to avoid regressions; optional custom index URL. > • Maven: Install via Maven coordinates (e.g.,
com.databricks:spark-csv_2.10:1.0.0
) with optional repository and exclusions for dependencies. > Doesn't "requires significant configuration and ongoing maintenance" I'd go with the container service
Or I'd run inside of the orchestrator you all currently use
l
For installing libraries to cluster I answered to Edgar's question, I didn't mean that it doesn't require configuration and maintenance. Just trying to find the best solution at the moment and exploring. I will check with container service then if you think it's better. For the orchestrator what do you mean to run inside it? Instead of airflow which is the standard for meltano, to use databricks workflows that we normally use?