https://meltano.com/ logo
#announcements
Title
# announcements
s

shy-article-27410

12/14/2020, 3:48 PM
Hi, I'm new to Meltano and am evaluating it for our data/dataops pipeline. Would Meltano be effective for the following use cases? Use case 1... 1. On a daily basis, discover csv files with varying file names (names include date stamps) in s3. 2. Load the data into a table in Postgres or Snowflake. The first time the data is loaded, auto-create the table. On subsequent loads, append new data. Use case 2... On a periodic basis (say, 5 minutes), query a source Oracle database and load data into Postgres or Snowflake tables. Data streaming using CDC is not an option since the database is hosted by a third party; we have limited access to the database. We need to run queries to get a subset of the data because there is a lot of bloat (e.g. 90+%) in the source database.
1
m

many-farmer-43903

12/14/2020, 4:01 PM
I'm pretty early into my usage/eval as well, so I'm curious how the community would assess your use cases I just built my first custom extractor/singer tap, and my experience makes me think that for both use cases, you would likely require the custom plugin development
r

ripe-musician-59933

12/14/2020, 4:38 PM
@shy-article-27410 Meltano is a great fit for both use cases! Use case 1 would need a combination of https://meltano.com/plugins/extractors/spreadsheets-anywhere.html (or https://meltano.com/plugins/extractors/csv.html) and https://meltano.com/plugins/loaders/postgres.html / https://meltano.com/plugins/loaders/snowflake.html. If you go through the Getting Started guide (https://meltano.com/docs/getting-started.html) with those in mind, it should get you all set up. For use case 2, you would add https://github.com/transferwise/pipelinewise-tap-oracle (or https://github.com/singer-io/tap-oracle) as a custom plugin and use it with https://meltano.com/plugins/loaders/postgres.html / https://meltano.com/plugins/loaders/snowflake.html, or specifically the transferwise variants of those: https://meltano.com/plugins/loaders/postgres--transferwise.html, https://meltano.com/plugins/loaders/snowflake--transferwise.html, that were made to be used with the transferwise variant of tap-oracle.
It looks like https://github.com/transferwise/pipelinewise-tap-oracle has some requirements of your Oracle server: https://transferwise.github.io/pipelinewise/connectors/taps/oracle.html, so if those cannot be realized, you may need to write a custom Oracle tap that doesn't depend on these features (i.e. simply runs queries), or contribute that functionality to the existing one.
s

shy-article-27410

12/14/2020, 4:47 PM
Thanks! I'll have a look. At first glance, I don't think the Oracle tap would work since our limited access excludes Oracle v$ views.
r

ripe-musician-59933

12/14/2020, 4:59 PM
It's possible that access to those views is only necessary for table/column discovery, not for the actual syncing of the data. Since discovery and sync mode are already implemented with separate code paths, I would then recommend adding a setting to the existing tap where discovery mode is disabled (returning an empty catalog: https://meltano.com/docs/integration.html#extractor-catalog-generation), but you can provide the full catalog schema and metadata for each stream (table) at sync time using https://meltano.com/docs/integration.html#setting-metadata and https://meltano.com/docs/integration.html#overriding-schemas.
Syncing may just work without any changes, as long as discovery is replaced by the end user manually providing descriptions of the tables.
I'm sure the PipelineWise folks (some of whom roam this Slack workspace as well) would be happy to advice on and accept such a contribution 🙂
s

shy-article-27410

12/14/2020, 7:20 PM
Thanks for the feedback. I'll start off with the s3/csv to Postgres use case.
👍 1
g

gray-cricket-92960

12/15/2020, 1:57 AM
Curious as to how you evaluation goes. I am looking at basically the same usecases. Hoping to try to prove this out over holidays
s

shy-article-27410

12/15/2020, 7:54 PM
The initial use case of moving data from a csv file to Postgres was very straightforward and simple. What would be a recommended work flow for receiving/unzipping/staging the files to load only the most recent file? Is this something best done outside Meltano, or is it something Meltano can handle?
r

ripe-musician-59933

12/15/2020, 8:41 PM
The initial use case of moving data from a csv file to Postgres was very straightforward and simple.
@shy-article-27410 I'm glad to hear it!
What would be a recommended work flow for receiving/unzipping/staging the files to load only the most recent file? Is this something best done outside Meltano, or is it something Meltano can handle?
If that workflow can be expressed in a generic enough way that it could be a setting on the tap, you could consider contributing it; https://github.com/ets/tap-spreadsheets-anywhere supports tons of sources and already does unzipping, so adding some logic to limit it to the most recent file (by creation or modification date) seems like it would be in scope 🙂 Otherwise this would live outside of Meltano, in a pipeline step ahead of
meltano elt
. If you're using Airflow, you could modify the default
orchestrate/dags/meltano.py
to add some operators ahead of the BashOperator that runs
meltano schedule run <schedule>
.
s

shy-article-27410

12/15/2020, 9:56 PM
I've not done much with Python, so I don't think I'd be a good candidate for contributing. I'll look at plugging Meltano into a (ConcourseCI) pipeline or creating an Athena table directly from the s3 csv files. Thanks for your help.
👍 1