I come in search of general advice for performance and tunin Meltano #troubleshooting

I come in search of general advice for performance...

Desiree Cox

04/17/2024, 5:57 PM

I come in search of general advice for performance and tuning 🙏 We are using the transferwise variants of tap-postgres and target-snowflake. Our team adds many new tables weekly and our runs are taking too long. The node running Meltano is under-utilized, we are quite certain the bottleneck is in the application, and would welcome any advice or guidance on where to look. Scaling Meltano My understanding is that Meltano does not support multi-threading and scaling Meltano itself would involve splitting our tables across multiple instances of Meltano. Is there another approach to improving Meltano performance that I am not seeing? Scaling the tap/target I think the main knobs we can turn here are

itersize

to increase the size of the batch tap-postgres reads from the source, and

batch_size_rows

to increase the number of records loaded to Snowflake at a time. Are there other parameters worth tuning here? Thank you in advance for any advice or feedback!

dylan_just

04/18/2024, 3:44 AM

I think you might want to look into orchestration with Airflow

dylan_just

04/18/2024, 3:45 AM

https://docs.meltano.com/guide/orchestration/

dylan_just

04/18/2024, 3:46 AM

Taps and targets that support the BULK message would also help.

Desiree Cox

04/18/2024, 3:36 PM

Thanks, this Airflow link is really helpful. I have been moving all my tables in a single

meltano run tap target

command, but it looks like the strategy is more like issuing one

meltano run tap target --select=tablename

per table, defining each table as a pipeline, and scheduling the pipelines, is that closer to it?

dylan_just

04/18/2024, 9:40 PM

I think so.

dylan_just

04/18/2024, 9:43 PM

What sort of ingestion are you using? full-table, incremental, or log-based?

dylan_just

04/18/2024, 9:48 PM

We're also using the transferwise tap-postgres. One thing we've tried is increasing the MELTANO_ELT_BUFFER_SIZE env var. I'm not sure how much this will help, but it's at least another knob for you to try.

Desiree Cox

04/18/2024, 9:49 PM

🙏 Amazing, you're more than helpful. We're using key-based incremental for most, but a few large tables don't have replication keys, so we're doing full_table for those. I definitely want to try that buffer size.

dylan_just

04/18/2024, 9:58 PM

We're also using key-based incremental mostly. What are the large tables like? You really want to be able to do incremental on those, if possible. You can use a mapper to define key properties if the underlying table doesn't have them (or is a view). e.g. we use:

Copy code

mappers:
  - name: meltano-map-transformer
    variant: meltano
    pip_url: git+<https://github.com/MeltanoLabs/meltano-map-transform.git@v0.7.0>
    mappings:
    - name: mapper-mydb
      config:
        stream_maps:
          public-mytable:
            __key_properties__:
              - team_id
              - user_id
              - created

On large tables, we have a different problem - reading too many rows in one go causing problems on the source data, so we use the

limit

parameter to limit to 250k rows at a time.

dylan_just

04/18/2024, 10:00 PM

It's worth checking, too, if target-snowflake and the meltano variant of tap-postgres support the BULK message. I've heard that this speeds up things dramatically, by bypassing the row-at-a-time JSON serialisation.

25 Views

Open in Slack

Previous Next