and if so what if we need to scale up into a big pipeline ho Meltano #getting-started

and if so what if we need to scale up into a big p...

tu___n_dao_gia

02/16/2023, 1:59 AM

and if so what if we need to scale up into a big pipeline how can we do replication on a cluster if the replication job is just on a single computer?

aaronsteers

02/16/2023, 2:30 AM

Welcome, @tu___n_dao_gia! Generally, Meltano runners doing EL jobs (for instance) will run on a container in ECS or similar, so that we are not constrained to a single node running all workloads. Since Meltano is portable and open source, you can parallelize to any number of runners performing workloads at any time.

tu___n_dao_gia

02/16/2023, 2:31 AM

@aaronsteers but for example case like reading from one single table, but it is a very big table

tu___n_dao_gia

02/16/2023, 2:31 AM

can it parallelize the task into 2 clusters

aaronsteers

02/16/2023, 2:34 AM

In theory, yes, but in most cases this is not needed or recommended. The most efficient method is generally just to run each EL pipeline from a single node.

aaronsteers

02/16/2023, 2:34 AM

Do you have a specific source in mind which would require additional sharding?

tu___n_dao_gia

02/16/2023, 2:37 AM

@aaronsteers for example a source of s3 folder files getting populated frequently

tu___n_dao_gia

02/16/2023, 2:37 AM

it has many small files maybe, so in theory if there are many load to each files parallelized it should be faster

tu___n_dao_gia

02/16/2023, 2:38 AM

@aaronsteers but like you said so normally the cluster deployment will be

tu___n_dao_gia

02/16/2023, 2:38 AM

a cluster of many different instances of meltano

tu___n_dao_gia

02/16/2023, 2:39 AM

doing different EL pipelines?

aaronsteers

02/16/2023, 2:42 AM

Gotcha - yeah, this is a good example case. Meltano doesn't try to be the "cluster" - in your use case, it's very likely that Snowflake or BigData (or similar) will ingest directly from S3 without Meltano needing to add much to the process. Sending "COPY FROM" commands to the target system will let the target datawarehouse parallelize the work to maximum degree that the target system supports.

aaronsteers

02/16/2023, 2:42 AM

Do you have a specific target in mind?

tu___n_dao_gia

02/16/2023, 2:43 AM

oh, so it seems Meltano will fit into usecase that is specifically incremental?

aaronsteers

02/16/2023, 2:51 AM

Meltano can really fit at several of layers in the data stack, since it can be used to wrap a number of open source tools. But for the EL side, the focus is on efficient, stable, and scalable pipelines - where Meltano facilitates communication between an extractor plugin and loader plugin. Whereas for map-reduce and transform pipelines, you'd often need to scale out to a large cluster in order to facilitate map-reduce-type transitions and lookups, Meltano's EL is generally much leaner and focused on the minimal touch needed to make records parseable and ingestible by the target. To your question though, Meltano supports INCREMENTAL, FULL_TABLE, and LOG_BASED replication modes, configurable per table or source stream.

aaronsteers

02/16/2023, 2:54 AM

For very high volume workloads, we've recently introduced BATCH communication between taps and targets, which allows the tap and target to bulk export and then bulk import using native and fastest-possible data processing and parallelization techniques that tap and target are capable of. https://sdk.meltano.com/en/latest/batch.html#the-batch-message

tu___n_dao_gia

02/16/2023, 3:02 AM

@aaronsteers awesome thank you so much

tu___n_dao_gia

02/16/2023, 3:03 AM

great to know more about how Meltano work, recently my clients have been talking about Meltano, guess I have some catchups to do

aaronsteers

02/16/2023, 3:05 AM

Happy to help and glad you are here!

tu___n_dao_gia

02/16/2023, 3:13 AM

@aaronsteers so the default deployment of Meltano is very lightweight right?

tu___n_dao_gia

02/16/2023, 3:13 AM

maybe i will make some connector to run on my home server to test it out

aaronsteers

02/16/2023, 3:30 AM

Totally! Let us know if you get stuck or need assistance. Generally Meltano EL can run on pretty small instances. Around ~2 GB RAM is generally fine, and most RAM usage is just buffering between tap and target when targets are slow. If running on a home server, you'll eventually want to decide what you want to use for the transformation layer. Most users pick one of the dbt options here on the Hub. Most common choices are Snowflake or Redshift - but DuckDB is a new and lightweight alternative that is fully open source and doesn't require a cloud account.

aaronsteers

02/16/2023, 3:31 AM

Many users have also successfully built pipelines that run entirely in GitHub or GitLab CI runners 😅

tu___n_dao_gia

02/16/2023, 3:31 AM

i see … i have a DBT project living in git ci runner as well

tu___n_dao_gia

02/16/2023, 3:32 AM

just daily refresh

Open in Slack

Previous Next