hey all! I have a custom utiliy like so: ```plugin...
# troubleshooting
k
hey all! I have a custom utiliy like so:
Copy code
plugins:
  utilities:  
  - name: transform-google-sheets
      namespace: transform-google-sheets
      commands:
        run_script:
          executable: python
          args: extract/integrations/google_sheet_transform_helper.py
when i reference this in my meltano.yml
Copy code
- name: ingest_googlesheets_meltano
  tasks:
    - tap-google-sheets
    - transform-google-sheets:run_script
    - target-snowflake-googleSheets
i get some block violations -
block violates set requirements: Unknown command type or bad block sequence at index 1, starting block 'tap-google-sheets'
. Anyone know the proper syntax here?
βœ… 1
r
A valid
run
block is either
[<tap>, <target>]
or
[<tap>, <mapping>, target]
, not
[<tap>, <utility>, <target>]
. As far as using
python
as the
executable
, I don't know if what you're trying to do is possible currently (i.e. run a script as a mapping between tap stdout and target stdin). Maybe try moving the definition under
mappers
(as opposed to
utilities
)? I suspect you will run into another block violation error since you would be referencing a mapper plugin, rather than a
mapping
(which I don't believe supports
commands
). https://github.com/meltano/meltano/pull/9105 might make it work as a mapper though...
πŸ‘ 1
k
got it, thanks! Ill give it a try πŸ™‚ for context, im essentially flattening the data and just making new a couple new columns so nothing crazy (so a simple data manipulation). I initially had a mapper but i think that only renames columns or perhaps cast data types not make new columns explicitly. ie if i have in the gsheet
Copy code
STATUSA STATUSB STATUSC
user1   user2   user3
the result would be below before landing in snowflake
Copy code
user  status
user1 STATUSA
user2 STATUSB
user3 STATUSC
I could do this using other transformation tools after ingest but in this use case it would have to be transformed before it lands in snowflake which in this case is easiest to do with a siumple python script @Reuben (Matatika)
r
What new columns did you want to add? That is definitely possible with stream maps, which along with flattening
tap-google-sheets
supports out-the-box:
Copy code
- name: tap-google-sheets
    variant: matatika
    pip_url: <https://github.com/Matatika/tap-google-sheets.git|https://github.com/Matatika/tap-google-sheets.git>
    config:
      flattening_enabled: true
      flattening_max_depth: 1
      stream_maps:
        <sheet name>:
          new_property: '"test"'
πŸ‘ 1
k
Oh sweet , also new column would be a status column that maps each user to a status rather then each status being its own column of users
r
How are you determining what status to assign to a user?
k
It’s actually already there in the sheet to begin with. If status is current , that column already has a list of user ids. Thing is , because we went to track historical changes to the user, it makes sense to have the user id be the primary key and a status column explicitly stating the status. So it’s just a re manipulation of the data. Not determining anything new
Similar to one hot encoding
r
Right, so for a single row you want to extract each user out along with the rest of the row as a record i.e one row in the sheet with three user IDs should output three records?
k
Yes!
Actually well close, the three status columns would be a value in a general status column but nevertheless I feel this is possible so thanks for guiding
πŸ‘ 1
hmm so running into a small block and wondering if you knew how to get around this. so test google shee ti have is included in the pic below. when i add this stream map to the tap:
Copy code
stream_maps:
        testsheetkevin:  # Name of your Google Sheet tab
          __key_properties__: ["user_id", "status"]
          user_id: >
            record["Current"] if "Current" in record and record["Current"] else
            record["demoted"] if "demoted" in record and record["demoted"] else
            record["Watchlist"] if "Watchlist" in record and record["Watchlist"] else None
          status: >
            "Current" if "Current" in record and record["Current"] else
            "demoted" if "demoted" in record and record["demoted"] else
            "Watchlist" if "Watchlist" in record and record["Watchlist"] else None
          Current:  __NULL__
          demoted: __NULL__
          Watchlist: __NULL__
I get the output json
Copy code
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_111","status":"Current"},"time_extracted":"2025-03-10T14:07:17.933276+00:00"}
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_222","status":"Current"},"time_extracted":"2025-03-10T14:07:17.933402+00:00"}
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_333","status":"Current"},"time_extracted":"2025-03-10T14:07:17.933467+00:00"}
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_444","status":"Current"},"time_extracted":"2025-03-10T14:07:17.933523+00:00"}
which is what i want but it seems to be only doing it for the
current
column. Any idea how i can get the same for demoted and watchlist as well? Thought id ask before trying something janky
ideally i would like to have had
Copy code
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_111","status":"Current"},"time_extracted":"2025-03-10T14:07:17.933276+00:00"}
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_222","status":"Current"},"time_extracted":"2025-03-10T14:07:17.933402+00:00"}
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_333","status":"Current"},"time_extracted":"2025-03-10T14:07:17.933467+00:00"}
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_444","status":"Current"},"time_extracted":"2025-03-10T14:07:17.933523+00:00"}

{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_555","status":"demoted"},"time_extracted":"2025-03-10T14:07:17.933276+00:00"}
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_666","status":"demoted"},"time_extracted":"2025-03-10T14:07:17.933402+00:00"}
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_777","status":"demoted"},"time_extracted":"2025-03-10T14:07:17.933467+00:00"}
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_888","status":"demoted"},"time_extracted":"2025-03-10T14:07:17.933523+00:00"}

{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_999","status":"Watchlist"},"time_extracted":"2025-03-10T14:07:17.933276+00:00"}
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_100","status":"Watchlist"},"time_extracted":"2025-03-10T14:07:17.933402+00:00"}
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_101","status":"Watchlist"},"time_extracted":"2025-03-10T14:07:17.933467+00:00"}
{"type":"RECORD","stream":"testsheetkevin","record":{"user_id":"US_102","status":"Watchlist"},"time_extracted":"2025-03-10T14:07:17.933523+00:00"}
r
IIRC the tap handles one row as one record. I assume restructuring the sheet is not an option? You might be able to get something referencing the same sheet 3x in
sheets
config with a different
range
set for each. Might look something like this:
Copy code
config:
      sheets:
      - sheet_id: <sheet ID>
        child_sheet_name: testsheetkevin
        output_name: Current
        range: <Current range>
      - sheet_id: <sheet ID>
        child_sheet_name: testsheetkevin
        output_name: demoted
        range: <demoted range>
      - sheet_id: <sheet ID>
        child_sheet_name: testsheetkevin
        output_name: Watchlist
        range: <Watchlist range>
      stream_maps:
        "*":
          __key_properties__:
          - user_id
          - status
          user_id: record[__stream_name__]
          status: __stream_name__
πŸ‘€ 1
k
thanks for the help! I ended up creating a custom extract and load and ran it like so:
Copy code
- name: extract_google_sheets
      namespace: extract_google_sheets
      env:
        SHEET_ID: "1test"
      commands:
        extract:
          executable: python
          args: extract/integrations/extract_google_sheets.py
          description: >
            Extracts Google Sheets data and stores in a JSONL file.
Copy code
- name: load_google_sheets_snowflake
      namespace: load_google_sheets_snowflake
      commands:
        load:
          executable: python
          args: extract/integrations/load_google_sheets_to_snowflake.py
          description: >
            Loads extracted Google Sheets data from JSONL into Snowflake.
Copy code
- name: ingest_google_sheets_pc
  tasks:
    - extract_google_sheets:extract
    - load_google_sheets_snowflake:load
This gave me autonomy over what i wanted to happen since the destination table in snowflake was going to be SCD type 2 along with a whole other set of requirements πŸ‘
πŸ‘ 1