guys, tell me please about the orchestrator. Prob...
# getting-started
a
guys, tell me please about the orchestrator. Problem: I have many partners and they are independent and they can connect / remove many different products from where I should collect data I want to make one meltano project that will have all the tap products that we support at the moment. Than When we have a new partner and he connect first product, I want to create a docker container (with environment variables for this partner (token or any other data for config)) and also with a command in which I describe which jobs to run When the client connects another product, I update the variables and the list of jobs and recreate the docker container Is there any way I can display all this in the orchestrator (for example airflow, although we like dagster more, but similar dagster does not support this) What do you think of such a scheme in general? Or should it work somehow differently?
j
Will the containers be running in your premise or in customer's premise?
a
I didn't quite understand your question. Each client logs in to our application and adds products, after he added a product, we must collect data from connected products The client has nothing stored, the client just has a link to the site and the functionality that we give
j
Are you running containers "standalone" or in Kubernetes?
a
I think that I will need to run them in kubernetes Now I want to try to run something like this project https://gitlab.com/gitlab-data/gitlab-data-meltano
j
I see. I have no k8s experience with Meltano itself, but in general, I would create a simple k8s operator spinning up new Meltano containers for each tenant(customer). I would utilize Kopf framework. The operator would deploy k8s ConfigMap, Secret, and Meltano POD for each tenant once you would simply create corresponding k8s custom resource. With Kopf, it is very easy to create a custom k8s operator. We built an operator for managing our tenants in k8s. Personally, I implemented a Kopf operator for Vertica database, which is far more complex use case (stateful, nodes must be coordinated).
a
Thanks for the direct recommendation. And what about the orchestration, can I somehow keep track of all this?
Her one problem is that after the meltano is finished, I need to pass the data to our google cloud-function which will add new fields, perform some internal operations based on the received data. I mean, it would be perfect for me. tap-mytapname -> custom-mapping -> run google cloud function -> save to db (I hope we can do this with meltano) We may need to change this to something else. tap-mytapname -> custom-mapping -> save to db (general bucket with all data that is cleared after some time) -> run cloud function -> save to db (final elasticsearch database with the given data type)
j
With my low knowledge, I would point Meltano to load crawled data into a cloud storage, from where any following functionality can take it and do any kind of post-processing.
Regarding orchestration of k8s containers, I am not sure. I don't know Airflow or Dagster yet. I checked Dagster DOC. Looks like you can implement any Python logic into Dagster assets and you get all the orchestration and monitoring. So maybe you could combine Dagster with Kopf, but not sure, if it is feasible, this is really a new area for me. Moreover, there is KNative project providing capabilities in the area of serverless functions running in k8s. Not sure, if it could be useful.
a
Yes, I also thought about it. Perhaps it will be pub / sub which will send a notification when new documents appear or something else
KNative seems to be able to do this with google cloud too.
We want our DAGs to be informative, i.e. to have multiple operations inside the dag, not just tap_tap_name_to_target_name. This is what dagster generates from meltano-hub. And it only seems to work if I have one meltano instance. More precisely, it seems to me that it will create a new duster for each meltano instance. And I want that I could see all running instances in one orchestrator
j
That goes beyond my current knowledge. Sooner or later, I will have to add a robust orchestrator into my demo (Meltano + dbt + ...). Maybe then I will get back to you with some advice 😉
a
I will be grateful if you do this for me