I am currently running tap-mssql (BuzzCutNorman) +...
# troubleshooting
j
I am currently running tap-mssql (BuzzCutNorman) + target-snowflake (Meltano) for my ingestion + using Dagster for my orchestration - got a theoretical question regarding parallelization of streams ๐Ÿงต I also have split up my extractors into individual YMLs based on business logic and I have Dagster run 4 sets of YMLs at a time (parallelization set to 4). I was wondering if there are settings in Meltano that tell the program to fan-out each object under your
select
into it's own "process", for lack of a better term
As an example, I broke out my tables into individual YMLs like so
Each YML looks something like this
Copy code
plugins:
  extractors:
  - name: tap-mssql-closures
    inherit_from: tap-mssql
    config:
      stream_maps:
        Admin-Closure:
          ClosureVersionId: __NULL__
    # Any tables not included here from the SELECT configuration above is defaulted to FULLTABLE replication
    select:
    - Admin-Closure.*
    - Admin-ClosureAllocation.*
    - Admin-ClosureReasonCode.* # No incremental 
    - Admin-ClosureReasonCodeLocalization.* # No incremental 
    - Admin-ClosureSettings.* # No incremental 
    - Admin-ClosureVersion.*
    - Admin-ClosureZoneAllocation.*
    - Admin-ClosureZoneEntryAllocation.*
    metadata:
      Admin-Closure:
        replication-method: INCREMENTAL
        replication-key: ClosureId
      Admin-ClosureVersion:
        replication-method: INCREMENTAL
        replication-key: LastEditDate
      Admin-ClosureAllocation:
        replication-method: INCREMENTAL
        replication-key: LastEditDate
      Admin-ClosureZoneEntryAllocation:
        replication-method: INCREMENTAL
        replication-key: LastEditDate
      Admin-ClosureZoneAllocation:
        replication-method: INCREMENTAL
        replication-key: LastEditDate
So when dagster runs, I've told it to pick up each extractor name and run it, so
tap-mssql-admin
,
tap-mssql-closures
etc.
But sometimes some of the tables are MUCH bigger than the others, and even if they start right away, they take significantly longer than the other streams and would benefit from its own parelliization
I can break out those big tables into a standalone stream, but it's kind of annoying then having tons of YMLs that are one offs for tables
So I was wondering if there's any functionality or settings that tell meltano to fan-out the names under the
select
and do all of them in parellel (or to a limit of how many you can do at once)
So that Dagster is still runs my "4" extractors at a time, but each extractor then also fans out and runs in some kind of paralellization so that it's not stuck doing 1 table at a time
if that's not possible then i'll just stick with problematic tables having their own names/stream ๐Ÿ˜…
v
https://github.com/meltano/meltano/issues/2677 and there's some chats in the gitlab thread that go pretty far To do this today I use the
select_filter
option in meltano via env variables. I have a process that runs these and then we schedule N number of jobs (300 in my case) that can all run independently of one another.
j
Melturbo
๐Ÿ˜‚ โค๏ธ
๐Ÿ˜‚ 1
Thanks Derek, I'll keep an eye on this
v
I don't know that anyone's working on it right now, I just know that people do it (including me) but it's really specific to our orchestrators and sometimes even the tap/target we're using
๐Ÿ’ฏ 1
j
Yeah I am assuming it's not being looked at right now , and that's ok I have workarounds.
โž• 1
But maybe one day ๐ŸŽ„ ๐ŸŽ…
๐Ÿ’ฏ 1