and if so what if we need to scale up into a big p...
# getting-started
t
and if so what if we need to scale up into a big pipeline how can we do replication on a cluster if the replication job is just on a single computer?
a
Welcome, @tu___n_dao_gia! Generally, Meltano runners doing EL jobs (for instance) will run on a container in ECS or similar, so that we are not constrained to a single node running all workloads. Since Meltano is portable and open source, you can parallelize to any number of runners performing workloads at any time.
t
@aaronsteers but for example case like reading from one single table, but it is a very big table
can it parallelize the task into 2 clusters
a
In theory, yes, but in most cases this is not needed or recommended. The most efficient method is generally just to run each EL pipeline from a single node.
Do you have a specific source in mind which would require additional sharding?
t
@aaronsteers for example a source of s3 folder files getting populated frequently
it has many small files maybe, so in theory if there are many load to each files parallelized it should be faster
@aaronsteers but like you said so normally the cluster deployment will be
a cluster of many different instances of meltano
doing different EL pipelines?
a
Gotcha - yeah, this is a good example case. Meltano doesn't try to be the "cluster" - in your use case, it's very likely that Snowflake or BigData (or similar) will ingest directly from S3 without Meltano needing to add much to the process. Sending "COPY FROM" commands to the target system will let the target datawarehouse parallelize the work to maximum degree that the target system supports.
Do you have a specific target in mind?
t
oh, so it seems Meltano will fit into usecase that is specifically incremental?
a
Meltano can really fit at several of layers in the data stack, since it can be used to wrap a number of open source tools. But for the EL side, the focus is on efficient, stable, and scalable pipelines - where Meltano facilitates communication between an extractor plugin and loader plugin. Whereas for map-reduce and transform pipelines, you'd often need to scale out to a large cluster in order to facilitate map-reduce-type transitions and lookups, Meltano's EL is generally much leaner and focused on the minimal touch needed to make records parseable and ingestible by the target. To your question though, Meltano supports INCREMENTAL, FULL_TABLE, and LOG_BASED replication modes, configurable per table or source stream.
For very high volume workloads, we've recently introduced BATCH communication between taps and targets, which allows the tap and target to bulk export and then bulk import using native and fastest-possible data processing and parallelization techniques that tap and target are capable of. https://sdk.meltano.com/en/latest/batch.html#the-batch-message
t
@aaronsteers awesome thank you so much
great to know more about how Meltano work, recently my clients have been talking about Meltano, guess I have some catchups to do
a
Happy to help and glad you are here!
t
@aaronsteers so the default deployment of Meltano is very lightweight right?
maybe i will make some connector to run on my home server to test it out
a
Totally! Let us know if you get stuck or need assistance. Generally Meltano EL can run on pretty small instances. Around ~2 GB RAM is generally fine, and most RAM usage is just buffering between tap and target when targets are slow. If running on a home server, you'll eventually want to decide what you want to use for the transformation layer. Most users pick one of the dbt options here on the Hub. Most common choices are Snowflake or Redshift - but DuckDB is a new and lightweight alternative that is fully open source and doesn't require a cloud account.
Many users have also successfully built pipelines that run entirely in GitHub or GitLab CI runners 😅
t
i see … i have a DBT project living in git ci runner as well
just daily refresh