I come in search of general advice for performance...
# troubleshooting
d
I come in search of general advice for performance and tuning 🙏 We are using the transferwise variants of tap-postgres and target-snowflake. Our team adds many new tables weekly and our runs are taking too long. The node running Meltano is under-utilized, we are quite certain the bottleneck is in the application, and would welcome any advice or guidance on where to look. Scaling Meltano My understanding is that Meltano does not support multi-threading and scaling Meltano itself would involve splitting our tables across multiple instances of Meltano. Is there another approach to improving Meltano performance that I am not seeing? Scaling the tap/target I think the main knobs we can turn here are
itersize
to increase the size of the batch tap-postgres reads from the source, and
batch_size_rows
to increase the number of records loaded to Snowflake at a time. Are there other parameters worth tuning here? Thank you in advance for any advice or feedback!
d
I think you might want to look into orchestration with Airflow
Taps and targets that support the BULK message would also help.
d
Thanks, this Airflow link is really helpful. I have been moving all my tables in a single
meltano run tap target
command, but it looks like the strategy is more like issuing one
meltano run tap target --select=tablename
per table, defining each table as a pipeline, and scheduling the pipelines, is that closer to it?
d
I think so.
What sort of ingestion are you using? full-table, incremental, or log-based?
We're also using the transferwise tap-postgres. One thing we've tried is increasing the MELTANO_ELT_BUFFER_SIZE env var. I'm not sure how much this will help, but it's at least another knob for you to try.
d
🙏 Amazing, you're more than helpful. We're using key-based incremental for most, but a few large tables don't have replication keys, so we're doing full_table for those. I definitely want to try that buffer size.
d
We're also using key-based incremental mostly. What are the large tables like? You really want to be able to do incremental on those, if possible. You can use a mapper to define key properties if the underlying table doesn't have them (or is a view). e.g. we use:
Copy code
mappers:
  - name: meltano-map-transformer
    variant: meltano
    pip_url: git+<https://github.com/MeltanoLabs/meltano-map-transform.git@v0.7.0>
    mappings:
    - name: mapper-mydb
      config:
        stream_maps:
          public-mytable:
            __key_properties__:
              - team_id
              - user_id
              - created
On large tables, we have a different problem - reading too many rows in one go causing problems on the source data, so we use the
limit
parameter to limit to 250k rows at a time.
It's worth checking, too, if target-snowflake and the meltano variant of tap-postgres support the BULK message. I've heard that this speeds up things dramatically, by bypassing the row-at-a-time JSON serialisation.