Hi Meltano group, Hope you’re getting a good vaca...
# best-practices
e
Hi Meltano group, Hope you’re getting a good vacation break during the holiday season. I was looking at the following challenge: I have a python script which takes data from a tap sourced table and transforms it into a new table.. I wanted at first to schedule this to run right after my tap performs a query updating the incoming/input table to this transformation.. Questions 1: Is the proper move for me to start installing a standalone Apache Airflow and adding a Bash or Python Operator.. wiring the execution of the script alongside meltano? Or do I simply add to my current DAG sitting in the sub-dir? Does this require me to possibly also stop using the bundled apache airflow.. ? Will I lose the ability to schedule my jobs in Meltanos GUI if I break out to use a standalone apache airflow? or does it sort of handle that? to experiment next here.. I am going to install apache airflow in a standalone separate node from meltano and try to wire them together eventually.. but first just get comfortable scheduling apache airflow bash/python jobs .. disconnected from my meltano airflow for now.. Just checking in , incase I am starting to go down a wrong path with Meltano.. Questions 2: the GUI dropdown has pipeline frequencies as quick as
Hourly
.. does it mean if I need to go every half hour or 10 minutes that I need to not use meltano? or do I just hardwire the yaml file and ignore the GUI drop down? Thanks so much!
I've begin a deeper dive into meltanos apache airflow.. and one sentence in there gave me pause.. "Meltano's use of Airflow will be unaffected by other usage of Airflow as long as
orchestrate/dags/meltano.py
remains untouched and pipelines are managed through the dedicated interface."
Do we read this to mean.. you can open apache airflow directly.. but don't attempt to fiddle with the dag(s) managed by meltano at all.. just leave them alone and if I need to edit them.. go to meltano UI, yes?
based on my reading... I will not isntall another apache airflow, for now.. and instead attempt to just add my own DAGs into the mix.. and see if I can have them piggy back off of the meltano DAG(s) .. if not. it's okay for now.. and I will just live with it
I can't be bothered yet to install and manage a whole apache airflow just yet
ooo I see this is
v1.x
apache airflow.. and the latest tuorial is leveraging
v2.x
I wonder if, I start using
v2.x
apache airflow is it going to not be compatible with meltano as a scheduler?
I am looking again at how I am proceeding with transformations and it still feels a bit "wrong" to have to ignore DBT.. so I did a brief reddit google https://www.reddit.com/r/dataengineering/comments/n98s5b/is_there_a_python_alternative_to_dbt/ It seems I am in the camp where I am now doing not just transformations.. but data science and model building.. and some seem to think this should or could be done in the DB itself... as sort of a UDF (User Defined Function) .. I have no problem with that just lacking experiencing here.. I think for now I will try to avoid UDFs and just have apache airflow ..
after some pondering.. I think I will have to actually go ahead and run a seperate apache airflow from Meltano... and here's why Meltano seems to focus solely on the data engineering.. this is wonderful, as i've gotten now a very good system for sourcing data and plopping into my DB.. where it is NOT working is the second I need to start using python to transform data.. what most people are terming not data engineering.. but data science... SO.. I will install and manage an apache airflow that is dedicated JUST for data science flows... and leave meltano's apache airflow alone
the one thing I wish was easier was.. setting up the connections to data sources via re-use of my taps/targets when i start to write those routines. once I go into data science workflows.. I have to basically restart either from scratch or manage some library for my tap.. as well as the target
perhaps maybe I am missing some way to reuse the taps-targets.. perhaps it's not useful if the workflow is slightly different like in data science.. which may have all manner of quirks per enterprise using meltano?
Update: Successfully got version 2.2.3 basic local install going of apache airflow next todos are to 1. schedule a basic Python or Bash operation to execute against data in postgresql 2. see if I can have Meltano's Airflow DAGs trigger or talk to my airflow.. so as to trigger ASAP any machine learning or python transformations
for now I won't try to mix meltanos DAGs with mine.. since I can see the version we're given in meltano is 1.x ..
t
Couple of things here @emcp - the default Airflow version that ships now is 2.1.2. Are you on an older version of Meltano? The intention with the Meltano-managed Airflow is that you can still use Airflow as you normally would. Just throw in any DAGs into the appropriate folder and you should be off to the races. There’s a dag that’s specific to Meltano schedules, but other than that it’s a pretty standard Airflow instance. If that’s not the case then we’d have to assess what’s going on! That said, there’s a lot we can do to improve the experience of getting Airflow up and running with Meltano, which we plan to do this quarter 🙂
e
Thanks Taylor, hmm I am on latest pip installable version , which last I checked was 1.9.x .. I can dig down again and verify.. where did I get 1.x from and come back I chose to go for a separate airflow installation on it’s own for now.. just so it’s handy and it does seem to work.. I need to merely productionalize it a tad bit further (utilize encryption key, use proper connections for credentials to my DB) Edit: Reason for needing airflow on it’s own was lack of inclusion of python based transformations or models .. in the midst of tap target transformations (I know we have DBT but it’s more geared towards the tabular data transformations.. which is still great but I am doing python/julia work beyond what DBT does )
t
Yep - makes total sense 🙂 I think you’ll be excited in the coming months what we’re working on. the new
meltano run
syntax can enable more complicated workflows and I want us to better support non-dbt transformations as well!
e
very happy to hear that, and I will keep demoing Meltano to anyone I meet .. we went ahead and patched up the juju charm bundle showing it off in conjunction with apache superset .. and it’s started to let some people understand how things work .. most people who ever see the inside of a data enterprise are the rocketship smarty pants people who get hired.. hoping to see how to change that and democratize the process