Hi everyone, I'm a product manager at a marketing...
# troubleshooting
a
Hi everyone, I'm a product manager at a marketing platform that provides solutions like MMM (Marketing Mix Modeling), MTA (Multi-Touch Attribution), and CAPI (Conversions API). Our clients include both marketing agencies (who can onboard their brands) and brands themselves, and we need reliable data extractors. We've been building our connectors in-house using Alloy, but scaling these connectors has become challenging. I recently came across Meltano and considering we run a very lean team with just 2 developers, I decided to try Meltano out initially before recommending it to my team. 1. Google Sheets to BigQuery 2. Snowflake to BigQuery I need the help of the you guys to solve for the following use cases: 1. Currently with the Google Sheets integration on Alloy, Alloy dynamically (the columns aren't constant) type casts the columns data type before inserting it in Bigquery. I couldn't figure out a way to do this using Meltano. 2. Also, I am not sure how the infra is going to be set up. I am doing it as a pet project with the motive of making our connectors list much better and would love to hear your opinions on building a solid infra which scales Our current data flow: 1. The user initiates the connection by hitting the respective connector endpoints. 2. We have a custom transformer for each integration that transforms source data into the target data format compatible with our systems. 3. The transformed data is then stored in BigQuery. The Integrations we offer: 1. Ad Channels and Programmatic Ad Platforms: Facebook, Google, Twitter, AdRoll, Taboola, Criteo, TTD, etc. 2. Data Warehouses: Snowflake, Redshift, Azure Database, MongoDB, Redis, etc. 3. Ecommerce Platforms: Integrations with Woocommerce, BigCommerce, Salesforce Commerce Cloud, Magento, etc. 4. Marketing Automation Platforms: E.g Klaviyo, Attentive. 5. Analytics Platforms: Google Analytics, etc. Note: This list is not exhaustive but gives an idea of the type of connectors we need. I'd also love to hear any feedback on our current data pipeline and any suggestions for improving efficiency. As someone recently diving into the data side of things, I appreciate any insights you might have. Thank you in advance!
e
Hi @Anup N!
Alloy dynamically (the columns aren't constant) type casts the columns data type before inserting it in Bigquery. I couldn't figure out a way to do this using Meltano.
Do you mean auto-detecting the column type based on the source metadata? If the gsheets extractor doesn't currently do that (I'm not aware), it shouldn't be too hard to implement actually. I'd take a look at https://github.com/Matatika/tap-google-sheets/ to confirm.
Also, I am not sure how the infra is going to be set up. I am doing it as a pet project with the motive of making our connectors list much better and would love to hear your opinions on building a solid infra which scales
One advantage of Meltano is you can start rather small (e.g. on GitHub actions) and incrementally migrate to a cloud/containerized environment using an S3 state backend, for example. The latter is how Arch.dev runs Meltano, so it definitely works. I'm gonna let other comment on the particulars of the connectors in case they've used them.
a
Hello @Edgar Ramírez (Arch.dev) thanks for the reply. Yes, I am talking about the auto-detecting schema part. I see the way to handle this is by writing a custom transformation for this use case but could you mention some best practices to implement this? For the second question, we already have some infra set up for our ETL with many events. We run on Google Cloud Platform.
e
I see the way to handle this is by writing a custom transformation
Hey, is this a recommendation you saw in the docs or somewhere else? You could always override the extractor schema.
We run on Google Cloud Platform
This might be helpful then: https://medium.com/data-manypets/how-to-run-meltano-in-a-container-on-google-cloud-composer-860783d0575c
a
Got it. Thank you, @Edgar Ramírez (Arch.dev). Also I had the following follow up questions: Question 1: How to run a meltano pipeline for multiple accounts/credentials (with different configurations) without changing anything in the source configuration (as once meltano project is hosted it should be ideally immutable). The configuration should be dynamic both for extractors as well as loaders For example: I want to run Google Sheets integration (extractor) and dump it to a BigQuery table (loader) for each customer via meltano pipeline. I should be able to run the same meltano pipeline but with different credentials that the customer will provide. Question 2: Can I run concurrent pipelines. For example: I want to run the same Google Sheets pipeline for 2 oauth accounts at the same time Question 3: How can I configure meltano to run on a trigger basis (say from an endpoint or a webhook)
Hi @Edgar Ramírez (Arch.dev) would love to hear inputs on this
e
Question 1: How to run a meltano pipeline for multiple accounts/credentials (with different configurations) without changing anything in the source configuration (as once meltano project is hosted it should be ideally immutable). The configuration should be dynamic both for extractors as well as loaders
You could use environment variables: https://docs.meltano.com/guide/configuration/#configuring-settings For example, some people create an ECS task definition with a Meltano container, then pass environment variables at runtime.
Question 2: Can I run concurrent pipelines.
This is easy to do in the ECS example I mentioned above: a single task definition can be run concurrently any number of times.
Question 3: How can I configure meltano to run on a trigger basis (say from an endpoint or a webhook)
This isn't natively supported by Meltano. Specially the latter two of these requirements are probably outside of the scope of Meltano itself, since it's not a long-running service but rather a command line application that runs to completion and then exits. So, it can't fan-out for multiple configurations, or react to external events via webhooks, or run multiple pipelines in parallel on its own. Is that helpful?
a
Thanks @Edgar Ramírez (Arch.dev) that answered my question