Hi everyone just started experimenting with Meltano driven b Meltano #getting-started

Hi everyone! just started experimenting with Melta...

dionisio_agourakis

10/24/2022, 6:19 PM

Hi everyone! just started experimenting with Meltano (driven by the amazing MDS-in-a-box article)! I'm now trying to understand how to integrate python scripts / tranformations to the pipeline in the "intented" way for Meltano. Is there such a thing? Should I rely on python-integration and adapters only from dbt, or there is another way (e.g. using the SDK to buiid our own singer tap/target and plug the transformations)?

visch

10/24/2022, 6:45 PM

for "random" python scripts I'd look at a meltano utility. I do something like

meltano run tap-name target-name dbt:run autoidm-transform tap-name target-name

That will probably give you what you're after, but there's also a concept of mappers and a mapper object which acts as a tap and a target so it'd look something like

meltano run tap-name mapper target-name

https://github.com/MeltanoLabs/meltano-map-transform both methods work I guess it depends on what you're after!

visch

10/24/2022, 6:46 PM

Meltano team can answer better than me 😄 just trying to get you going

aaronsteers

10/24/2022, 6:52 PM

@dionisio_agourakis - Welcome! Can you say more about the python scripts/transformations you are trying to run? Python scripts often has side affects that are hard to keep isolated to a single environment. But probably there's a way to get done what you want. dbt-fal helps with this, as this article dives into. But also, per that article's title, another option is to run python scripts in your Airflow DAG. As @visch points out, there's also some 'inline map transform' capability if you just want to alter the data from source (like hashing) before landing it in the target. Hope this starts you in a good direction. Lmk if I am missing your use case.

edgar_ramirez_mondragon

10/24/2022, 7:04 PM

It’s also worth mentioning that dbt >= 1.3 supports Python “models”, though I can’t find any guide on how to set up those yet

aaronsteers

10/24/2022, 7:28 PM

Fascinating! Thanks for sharing @edgar_ramirez_mondragon. If I read the referenced discussion properly, it looks like you can basically write any function that returns a dataframe, and with rich access to call 'ref()' to get data from other streams. Very cool.

jacob_matson

10/24/2022, 7:43 PM

if that dbt stuff works well enough, you can drop it in a dbt "run-operation". but tbd on how well the python / dbt integration will work. I've actually just added the following

make

command to the MDS in a box project as an example:

Copy code

pipeline:
	meltano run tap-spreadsheets-anywhere target-duckdb --full-refresh;\
	meltano invoke dbt-duckdb run-operation elo_rollforward;\
	meltano run dbt-duckdb:build

i think you can do like "meltano run dbt-duckdb:macro" also but i haven't successfully gotten that to work with the dbt-ext (if you have, DM me, I want to see it)

dionisio_agourakis

10/24/2022, 8:16 PM

@aaronsteers thanks for the quick reply...I have played a lot with fal, and it is definitely one candidate solution, but now that I know Meltano, I was wondering if there is another way.

dionisio_agourakis

10/24/2022, 8:19 PM

@edgar_ramirez_mondragon the issue with the 'oficial' python support for dbt right now is that all the python code runs in the DW env, and that currently means only snowflake (snowpark), databricks (workspace api) and GCP (big query + dataproc)....so no local python support for now, and a lot of adapters are just out of scope (mysql, sqlserver, duckdb)

dionisio_agourakis

10/24/2022, 8:34 PM

basically what we are doing (at JAI) today is: 1 - extracting data from different data sources (think legacy DBs and csv files), converting everything into .parquet - I didn't know Meltano, so we adapted rust/powershell scripts for that 🤫 2 - because the parquet files are now in a remote filesystem, we use duckdb to register/query these files 3 - we then process/sync a subset of these tables (mostly the dimensional ones) with pre-trained ML Models (example: use BERT to process the column "product_description") - this step is done using our python SDK, but all the GPU processing and stuff happens on our cloud 4 - the vector embeddings generated from this processing get stored in a vector database (we use milvus for now), and then can be queried for similarity search or for being used as inputs to other models. so what I have been thinking is that, in some sense, JAI's parquet filesystem + vector database is possibly a Meltano-Singer Target, and maybe a possible tap as well, once its has ran at least once, then you can apply dbt transformations as well

dionisio_agourakis

10/24/2022, 9:21 PM

hahah so what you people think?

dionisio_agourakis

10/24/2022, 9:24 PM

the idea here is to really make machine learning work inside the mds...you know "the real feature engineering is the dimensional tables we make along the way"

jacob_matson

10/24/2022, 10:00 PM

I guess i have some questions, specifically because I wrote that MDS in a box article. 1. Where do you want to execute the python scripts? (i.e. using which compute?) 2. what are you using for orchestration today? 3. are you aware that parquet taps and targets already exist, and that duckdb can write to parquet?

dionisio_agourakis

10/24/2022, 11:26 PM

1 - I want to run python scripts locally i.e. not at the DW - btw is that what you mean about which compute? 2 - we were trying to make it work with dbt and fal, but its not ideal, and some processes such as data ingestion were outside of scope 3 - yes, 100% What I want to orchestrate: • extraction from different sources to parquet • applying data model transformations - from parquets, to parquets (e.g. to normalize data into a star schema) • applying ml transformations (jai sdk/api) into some columns (e.g. classifying product categories using their description) - again origin are parquet files, then target are parquet files again

jacob_matson

10/25/2022, 12:26 AM

looks like its possible and from what I am seeing there is no "meltano preferred" way to do it. https://meltano.slack.com/archives/CFG3C3C66/p1597845357003500?thread_ts=1597831066.001600&cid=CFG3C3C66

dionisio_agourakis

10/25/2022, 1:18 PM

thanks!!

dionisio_agourakis

10/25/2022, 1:19 PM

will try it out for a full pipeline and then will post a write-up on how it went

Sven Balnojan

11/02/2022, 12:21 PM

Hey @dionisio_agourakis have you thought about using Jupyter? That might provide a nice wrapper around the lonely script and could integrate with the dev workflow: https://docs.meltano.com/tutorials/jupyter.

Open in Slack

Previous Next