Hi I m working with PII data so in order to avoid a lot of p Meltano #docker

Hi. I'm working with PII data, so in order to avoi...

Jens Christian Hillerup

07/14/2024, 5:49 PM

Hi. I'm working with PII data, so in order to avoid a lot of paperwork I want to deploy Meltano in an environment which we're already clear for in terms of compliance audits. That means Heroku. Part of the EL workflow is to access our prod database and strip/mask the PII before loading it in another database (a new Heroku Postgres deployment with different credentials etc.) I got

tap-postgres

and

target-postgres

up and running pretty quickly and I'm ready to try deploying it to Heroku, but I'm wondering about the

.meltano

directory: besides the meltano.db SQLite database, does it contain anything that must be persisted? This docs page lists what's in the directory: • I can live without the log files for prod (or potentially find a way to get the logs themselves extracted and loaded somehow) • I suppose the `venv`s of the needed Python packages could be created at

docker build

-time? I know Meltano supports pluggable system databases, and I'm planning on just using letting Meltano have a schema in my BI database for that. Other than that, what else do I need to know for a stateless Docker deployment (on Heroku, in my case)?

Jens Christian Hillerup

07/14/2024, 6:17 PM

Seems I'm confusing System Database and State Backend somewhat

Charles Feduke

07/14/2024, 8:55 PM

I’ve done something similar using Amazon’s Managed Workflows for Apache Airflow (MWAA) and this solution has been running in production for about a month - nothing is necessary in the .meltano directory if you manage to move the state database somewhere (mine is hosted by the target PostgreSQL database). The only issue I ran into is that Meltano did not create the destination schema for the target warehouse during initialization in production - could have been a me thing I suppose. (I had created it manually in our staging environment.) For the virtual environments you can run

meltano install

and it will create them. I had to do some manual creation of the hosting venv for meltano itself because I’m running on MWAA which runs Airflow (also uses Python), so after I create a venv and install meltano and copy my directory to the target machine, I then execute meltano install using the project directory as the working directory and everything works fine. This would be, I think, easier and cleaner in a Docker environment. (Mine has to be repeatable for each MWAA worker that AWS brings up, and AWS doesn’t allow you to define a Docker image for MWAA.) So based on your question in the other channel, basically if you set

MELTANO_DATABASE_URI

and

meltano install

during image creation I think you’ll have a decent start at least. The logs you should be able to configure to go to some agent that sends them to centralized logging. I didn’t have to do any of that for MWAA, as it comes configured to send the logs to CloudWatch by default, and when Meltano runs as a part of Airflow, it uses the Airflow logging mechanism.

❤️ 3

Jens Christian Hillerup

07/14/2024, 9:05 PM

Thanks for the very detailed response! I appreciate it

visch

07/15/2024, 5:56 PM

besides the meltano.db SQLite database, does it contain anything that must be persisted?

no. I don't use incremental syncs for a lot of our stuff so I don't even keep the db file

👀 1

Edgar Ramírez (Arch.dev)

07/15/2024, 7:30 PM

Seems I'm confusing System Database and State Backend somewhat

We could certainly improve the conceptual docs to make it clear what purpose each serves, and what's their overlap.

Jens Christian Hillerup

07/15/2024, 7:33 PM

Yeah, I did find a few places that mentioned the "old" way of using dbt (the one that wasn't called

utility

) and some other things, too. I should fork and send a PR with some docs work, but yesterday I just wanted to get something to work 😅

❤️ 3

dancingpenguin 2

12 Views

Open in Slack

Previous Next