Hi everyone! just started experimenting with Melta...
# getting-started
d
Hi everyone! just started experimenting with Meltano (driven by the amazing MDS-in-a-box article)! I'm now trying to understand how to integrate python scripts / tranformations to the pipeline in the "intented" way for Meltano. Is there such a thing? Should I rely on python-integration and adapters only from dbt, or there is another way (e.g. using the SDK to buiid our own singer tap/target and plug the transformations)?
v
for "random" python scripts I'd look at a meltano utility. I do something like
meltano run tap-name target-name dbt:run autoidm-transform tap-name target-name
That will probably give you what you're after, but there's also a concept of mappers and a mapper object which acts as a tap and a target so it'd look something like
meltano run tap-name mapper target-name
https://github.com/MeltanoLabs/meltano-map-transform both methods work I guess it depends on what you're after!
Meltano team can answer better than me šŸ˜„ just trying to get you going
a
@dionisio_agourakis - Welcome! Can you say more about the python scripts/transformations you are trying to run? Python scripts often has side affects that are hard to keep isolated to a single environment. But probably there's a way to get done what you want. dbt-fal helps with this, as this article dives into. But also, per that article's title, another option is to run python scripts in your Airflow DAG. As @visch points out, there's also some 'inline map transform' capability if you just want to alter the data from source (like hashing) before landing it in the target. Hope this starts you in a good direction. Lmk if I am missing your use case.
e
It’s also worth mentioning that dbt >= 1.3 supports Python ā€œmodelsā€, though I can’t find any guide on how to set up those yet
a
Fascinating! Thanks for sharing @edgar_ramirez_mondragon. If I read the referenced discussion properly, it looks like you can basically write any function that returns a dataframe, and with rich access to call 'ref()' to get data from other streams. Very cool.
j
if that dbt stuff works well enough, you can drop it in a dbt "run-operation". but tbd on how well the python / dbt integration will work. I've actually just added the following
make
command to the MDS in a box project as an example:
Copy code
pipeline:
	meltano run tap-spreadsheets-anywhere target-duckdb --full-refresh;\
	meltano invoke dbt-duckdb run-operation elo_rollforward;\
	meltano run dbt-duckdb:build
i think you can do like "meltano run dbt-duckdb:macro" also but i haven't successfully gotten that to work with the dbt-ext (if you have, DM me, I want to see it)
d
@aaronsteers thanks for the quick reply...I have played a lot with fal, and it is definitely one candidate solution, but now that I know Meltano, I was wondering if there is another way.
@edgar_ramirez_mondragon the issue with the 'oficial' python support for dbt right now is that all the python code runs in the DW env, and that currently means only snowflake (snowpark), databricks (workspace api) and GCP (big query + dataproc)....so no local python support for now, and a lot of adapters are just out of scope (mysql, sqlserver, duckdb)
basically what we are doing (at JAI) today is: 1 - extracting data from different data sources (think legacy DBs and csv files), converting everything into .parquet - I didn't know Meltano, so we adapted rust/powershell scripts for that 🤫 2 - because the parquet files are now in a remote filesystem, we use duckdb to register/query these files 3 - we then process/sync a subset of these tables (mostly the dimensional ones) with pre-trained ML Models (example: use BERT to process the column "product_description") - this step is done using our python SDK, but all the GPU processing and stuff happens on our cloud 4 - the vector embeddings generated from this processing get stored in a vector database (we use milvus for now), and then can be queried for similarity search or for being used as inputs to other models. so what I have been thinking is that, in some sense, JAI's parquet filesystem + vector database is possibly a Meltano-Singer Target, and maybe a possible tap as well, once its has ran at least once, then you can apply dbt transformations as well
hahah so what you people think?
the idea here is to really make machine learning work inside the mds...you know "the real feature engineering is the dimensional tables we make along the way"
j
I guess i have some questions, specifically because I wrote that MDS in a box article. 1. Where do you want to execute the python scripts? (i.e. using which compute?) 2. what are you using for orchestration today? 3. are you aware that parquet taps and targets already exist, and that duckdb can write to parquet?
d
1 - I want to run python scripts locally i.e. not at the DW - btw is that what you mean about which compute? 2 - we were trying to make it work with dbt and fal, but its not ideal, and some processes such as data ingestion were outside of scope 3 - yes, 100% What I want to orchestrate: • extraction from different sources to parquet • applying data model transformations - from parquets, to parquets (e.g. to normalize data into a star schema) • applying ml transformations (jai sdk/api) into some columns (e.g. classifying product categories using their description) - again origin are parquet files, then target are parquet files again
j
looks like its possible and from what I am seeing there is no "meltano preferred" way to do it. https://meltano.slack.com/archives/CFG3C3C66/p1597845357003500?thread_ts=1597831066.001600&cid=CFG3C3C66
d
thanks!!
will try it out for a full pipeline and then will post a write-up on how it went
s
Hey @dionisio_agourakis have you thought about using Jupyter? That might provide a nice wrapper around the lonely script and could integrate with the dev workflow: https://docs.meltano.com/tutorials/jupyter.