Hello everyone I m having some trouble with a Meltano extrac Meltano #troubleshooting

Hello everyone, I'm having some trouble with a Me...

Arnaud Stephan

07/17/2024, 8:00 AM

Hello everyone, I'm having some trouble with a Meltano extraction. I did not write it, I just joined the company a few weeks ago :

Copy code

2024-07-15T23:37:18.301932Z [error    ] Cannot start plugin tap-postgres: [Errno 7] Argument list too long: '/project/.meltano/extractors/tap-postgres/venv/bin/tap-postgres'

2024-07-15T23:37:18.302143Z [error    ] Block run completed.           block_type=ExtractLoadBlocks err=RunnerError("Cannot start plugin tap-postgres: [Errno 7] Argument list too long: '/project/.meltano/extractors/tap-postgres/venv/bin/tap-postgres'") exit_codes={} set_number=0 success=False

I've been researching this error on Google and I see this is a common error (not specifically Meltano related though). Is there any way I could troubleshoot this further to see what causes it ? For context, I'm working for a B2B Saas. We have Meltano deployed in our customers' instances to fetch us the production data, and it was working fine untill we added more tables in what we fetch every night.

✅ 1

Reuben (Matatika)

07/17/2024, 8:38 AM

Never seen that error before when running Meltano... Assuming you are on Linux and from reading https://unix.stackexchange.com/a/45584, I would run

Copy code

ulimit -s

and see what the value is. For context, mine appears to be the "default" of

- if that is not the case for you, try increasing it with

Copy code

ulimit -s 8192

and re-run the Meltano command.

Arnaud Stephan

07/17/2024, 8:47 AM

I've seen the

ulimit

thing but it didn't seem very satisfying to me, since it would only fix the consequence and not the root cause. I've just had a talk with my boss, apparently they had this problem before and it's caused by our way of doing things : • We execute a query on the database to fetch all tables and their columns • For each table, we generate a .yaml file to specify the schema • We then concatenate all the .yaml files into a giant one that we give to Meltano So the fix would be to either reduce the list of tables we fetch, or play with the ulimit parameter

👀 1

Reuben (Matatika)

07/17/2024, 9:03 AM

So you are generating a

meltano.yml

? What does that look like? > it was working fine untill we added more tables in what we fetch every night. Sorry, didn't actually read this before my first reply. 😅 If you run

Copy code

meltano --log-level debug invoke tap-postgres

you can see the commands Meltano is using to invoke the plugin, which looks like where the error is coming from. Maybe that helps you a bit?

Copy code

$ meltano --log-level debug invoke tap-spotify 2>&1 | grep Invoking
2024-07-17T09:06:18.490896Z [debug    ] Invoking: ['/home/reuben/Documents/taps/tap-spotify/.meltano/extractors/tap-spotify/venv/bin/tap-spotify', '--config', '/home/reuben/Documents/taps/tap-spotify/.meltano/run/tap-spotify/tap.4bc939b2-c3a0-4b07-8ea9-a880ff54d4e2.config.json', '--discover']
2024-07-17T09:06:19.165393Z [debug    ] Invoking: ['/home/reuben/Documents/taps/tap-spotify/.meltano/extractors/tap-spotify/venv/bin/tap-spotify', '--config', '/home/reuben/Documents/taps/tap-spotify/.meltano/run/tap-spotify/tap.4bc939b2-c3a0-4b07-8ea9-a880ff54d4e2.config.json', '--catalog', '/home/reuben/Documents/taps/tap-spotify/.meltano/run/tap-spotify/tap.properties.json']

Arnaud Stephan

07/17/2024, 9:12 AM

This is the meltano.yml

Copy code

version: 1
default_environment: dev
project_id: 6f5c2fd5-2ec4-4ade-bbf0-5e03fc7931fd
environments:
- name: dev
- name: staging
- name: prod
plugins:
  extractors:
  - name: tap-postgres
    variant: meltanolabs
    pip_url: git+<https://github.com/MeltanoLabs/tap-postgres.git>
    config:
      database: aciso
      user: aciso
      host: db
      password: ${POSTGRES_PASSWORD}
      filter_schemas:
      - public
      port: 5432
      stream_maps: {}
    metadata: {}
    select: {}
  loaders:
  - name: target-postgres
    variant: transferwise
    pip_url: pipelinewise-target-postgres
    config:
      user: meltano
      host: *********
      password: ${DW_MELTANO_PASSWORD}
      dbname: **********
      port: 5888
      default_target_schema: meltano
      batch_size_rows: 100000
      parallelism: 0
      parallelism_max: 16
      add_metadata_columns: false
      hard_delete: false
send_anonymous_usage_stats: false

Arnaud Stephan

07/17/2024, 9:13 AM

I don't have access to the server where Meltano is ran, but I asked our DevOps team to check with the command you gave. Thanks !

👍 1

Reuben (Matatika)

07/17/2024, 9:18 AM

Is that the one that is being auto-generated? Doesn't look like anything out of the ordinary. Probably unrelated, but

select

should be an array rather than an object (not sure if that matters when its empty).

Arnaud Stephan

07/17/2024, 9:19 AM

I have no idea, I'm still in the "figuring out how everything is orchestrated" in my company haha.

Reuben (Matatika)

07/17/2024, 9:20 AM

Fair enough, interested to hear how this is all working though 😄

Arnaud Stephan

07/17/2024, 9:21 AM

Yup, I'll update this thread when I get it working !

👍 1

visch

07/17/2024, 12:55 PM

super curious what you all are putting in the args that makes sit so long, I haven't seen this one!

Arnaud Stephan

07/17/2024, 12:59 PM

meltano --log-level debug invoke tap-postgres

gave the same error :

Copy code

[Errno 7] Argument list too long: '/project/.meltano/extractors/tap-postgres/venv/bin/tap-postgres'

The DevOps tried ulimit -s to unlimited and the error stayed, he is now trying with ulimit -n 16000 and it's currently running

Reuben (Matatika)

07/17/2024, 1:00 PM

meltano --log-level debug invoke tap-postgres

> > gave the same error No doubt that it errored in the same way, but did they see what the command was? There should be a

debug

log before that error message.

Arnaud Stephan

07/17/2024, 2:40 PM

Here is the log file

meltano.log

Reuben (Matatika)

07/17/2024, 2:47 PM

When Meltano invokes a plugin, it makes the configuration available to the subprocess via environment variables. You mentioned a giant

meltano.yml

earlier, so I wonder if so many/particularly large environment variables are being set by Meltano that as a result it is exceeding the stack size when trying to kick off the process? The command in the logs looks normal, so that is my next best guess. https://stackoverflow.com/a/28865503 Maybe worth checking the server

ARG_MAX

before/after

ulimit

changes:

Copy code

getconf ARG_MAX

Arnaud Stephan

07/17/2024, 2:54 PM

Thanks for your answer ! I'll check back with you when the platform team answers me. I loved the SlackOverflow you posted, it's crazy the problem people find themselves in ! 1 million length en var, wow

😂 1

Arnaud Stephan

07/17/2024, 2:55 PM

Oh I just got an answer ! getconf ARG_MAX > 2097152

Reuben (Matatika)

07/17/2024, 3:03 PM

getconf ARG_MAX > 2097152

Yeah, that's the value I have locally as well. It would be good to have some more info on what exactly is being generated to have a better idea of what might be happening, and then possibly what a better approach might be for the problem you are trying to solve if there isn't a way around this error.

Arnaud Stephan

07/17/2024, 3:04 PM

I'll have to figure out a way to launch everything locally in order to troubleshoot myself and not annoy the platform team. It might take a while so in the meantime I will take out some tables from the extract to make a quick and dirty fix, as soon as I get some bandwith I'll try to understand what causes the error 🙂

👍 2

Reuben (Matatika)

07/17/2024, 3:10 PM

Sounds like baptism of fire dealing with this as a new employee 😨

😅 1

🌊 1

🔥 2

Arnaud Stephan

07/17/2024, 3:44 PM

Haha exactly ! Thanks for your help anway 😉

np 1

Arnaud Stephan

07/23/2024, 7:31 AM

Hello everyone ! I have managed to run Meltano locally and to understand better the structure of the project. 1. There are 260 .yml files, one for each table we want to fetch 2. The tap_postgres_config is created from those files and dumped into the giant meltano file (this is done in a file called

build.py

Copy code

for filename in os.listdir(tables_config_dir):
        if filename.endswith(".yaml") or filename.endswith(".yml"):
            filepath = os.path.join(tables_config_dir, filename)
            with open(filepath, "r") as stream:
                try:
                    table_config = yaml.safe_load(stream)
                    # Assuming each file contains configuration for a single table
                    for key, value in table_config["config"]["stream_maps"].items():
                        tap_postgres_config["config"]["stream_maps"][key] = value
                        tap_postgres_config["metadata"][key] = {"replication-method": "FULL_TABLE"}

                    if "select" in table_config:
                        if "select" not in tap_postgres_config:
                            tap_postgres_config["select"] = []
                        select_set = set(tap_postgres_config["select"])
                        select_set.update(table_config["select"])
                        tap_postgres_config["select"] = list(select_set)
                except yaml.YAMLError as exc:
                    print(f"Error processing {filename}: {exc}") 

    with open(meltano_config_path, "w") as outfile:
        yaml.dump(meltano_config, outfile, default_flow_style=False, sort_keys=False)

3. In the Dockerfile I guess this is where we make use of build.py to generate the meltano extractor

Copy code

# <http://registry.gitlab.com/meltano/meltano:latest|registry.gitlab.com/meltano/meltano:latest> is also available in GitLab Registry
ARG MELTANO_IMAGE=meltano/meltano:v3.3.1-python3.10
FROM $MELTANO_IMAGE

WORKDIR /project

# Install any additional requirements
COPY . .
RUN pip install --upgrade pip
RUN pip install -r requirements.txt
RUN python build.py
# Copy over Meltano project directory
RUN meltano lock --update --all
RUN meltano install
COPY mapper.py /project/.meltano/extractors/tap-postgres/venv/lib/python3.10/site-packages/singer_sdk
# Don't allow changes to containerized project files
# ENV MELTANO_PROJECT_READONLY 1

ENTRYPOINT [""]

4. Finally, we launch this command to exctract and load the data in our warehouse

Copy code

docker-compose run --build --rm meltano /bin/bash -c "meltano run tap-postgres target-postgres"

With this being in the docker-compose file

Copy code

#regenerated by ansible
version: "3"


networks:
    backend_aciso_dev:
        external: true

services:
  meltano:
    build: .
    image: tenacy_meltano
    container_name: meltano_dev
    volumes:
      - /srv/docker/meltano/logs:/project/.meltano/logs/elt
    networks:
      - backend_aciso_dev
    command: meltano run tap-postgres target-postgres
    environment:
      - instance_id=123456
      - POSTGRES_PASSWORD=***
      - DW_MELTANO_PASSWORD=***

Arnaud Stephan

07/23/2024, 7:32 AM

Right now I have updated all of my .yml files to try and reproduce the original error locally in my environment

Arnaud Stephan

07/23/2024, 7:52 AM

And I get the same error, yay !

Arnaud Stephan

07/23/2024, 8:37 AM

Here is the giant meltano.yml file

meltano.yml

Reuben (Matatika)

07/23/2024, 9:13 AM

Well I can sort of see why you are getting the

Argument list too long

error! 😅 Couple of things: I think this logic

Copy code

tap_postgres_config["metadata"][key] = {"replication-method": "FULL_TABLE"}

is redundant since the default is FULL TABLE anyway, so you might save yourself some lines there by removing it (should be the same behaviour). You can remove

Copy code

if "select" in table_config:
                        if "select" not in tap_postgres_config:
                            tap_postgres_config["select"] = []
                        select_set = set(tap_postgres_config["select"])
                        select_set.update(table_config["select"])
                        tap_postgres_config["select"] = list(select_set)

in favour of

Copy code

select:
- public-*.*

in your base

meltano.yml

. Then you are just left with your generated

stream_maps

config. Maybe see how you go with the previous two changes and if there's still a problem then we can come back to it? I still think there is the bigger (and more loaded) question of why for this whole process? I think you can categorise the output of this generative build step as your data transformation logic, and when things are getting this complicated, you would be better off using a dedicated data transformation tool like dbt to apply transformations from models that live in source control. You must already have your common transformations defined somewhere that are referenced when building each of the table

.yml

files. Meltano supports dbt as a plugin already: https://hub.meltano.com/utilities/dbt-postgres

Arnaud Stephan

07/23/2024, 9:21 AM

By whole process, you mean the one where we modify some columns instead of just doing a

select *

on everything ? I work for a cybersecurity company and some of our customers' data is sensitive so my boss wants the data to be hashed before it enters our datawarehouse. That way, we don't have the unhashed data available, and don't run into the risk of a leak.

Arnaud Stephan

07/23/2024, 9:21 AM

I'll try the modifications you suggested, thanks for the follow up !

Arnaud Stephan

07/23/2024, 9:23 AM

Copy code

select:
- public-*.*

But this will not take into account the fact that we want to transform some columns, right ?

Reuben (Matatika)

07/23/2024, 9:32 AM

I work for a cybersecurity company and some of our customers' data is sensitive so my boss wants the data to be hashed before it enters our datawarehouse. That way, we don't have the unhashed data available, and don't run into the risk of a leak.

OK, that makes more sense why you are using stream maps. 👍

Reuben (Matatika)

07/23/2024, 9:35 AM

> But this will not take into account the fact that we want to transform some columns, right ? It should work fine. All streams/properties matching that pattern will be selected at runtime before stream maps are applied. If you run

Copy code

meltano select tap-postgres --list --all

you should be able to see all

public-

tables/columns selected.

👍 1

Reuben (Matatika)

07/23/2024, 9:58 AM

I pasted that giant

meltano.yml

into a test project and Meltano just gets stuck when trying to run anything. 😅

😆 1

Arnaud Stephan

07/23/2024, 10:03 AM

What's your machine configuration ? Here's mine

Arnaud Stephan

07/23/2024, 10:06 AM

Copy code

meltano select tap-postgres --list --all

This gave me the same

argument list too long

error ! I think the problem comes from the number of columns we fetch. It's sad that Postgres does not have the

EXCLUDE

function, because otherwise we could do something like

Copy code

SELECT * (EXCLUDE column_to_hash), hash(column_to_hash) as column_to_hash
FROM table

Arnaud Stephan

07/23/2024, 10:07 AM

And the error is here even after doing this modification

Copy code

select:
    - public-*.*

And the one about the replication method

Reuben (Matatika)

07/23/2024, 10:16 AM

What's your machine configuration ? Here's mine

I am using latest Meltano installed with Python 3.8, so most likely down to Python version.

🤩 1

Reuben (Matatika)

07/23/2024, 10:24 AM

This gave me the same
argument list too long
error !

I replaced the

select

as above and removed the

metadata

section entirely and it did successfully invoke the plugin, which I think is further than you are getting:

Copy code

$ meltano invoke tap-postgres
2024-07-23T10:18:06.342241Z [info     ] Environment 'dev' is active   
Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.

Catalog discovery failed: command ['/tmp/p/.meltano/extractors/tap-postgres/venv/bin/tap-postgres', '--config', '/tmp/p/.meltano/run/tap-postgres/tap.5e93857e-b10f-43f2-83aa-f4d6f07b6539.config.json', '--discover'] returned 1 with stderr:

...

(don't mind the error, I don't have any config set - just wanted to see if it would invoke successfully at all)

meltano.yml

➕ 1

Arnaud Stephan

07/24/2024, 1:14 PM

Hey @Reuben (Matatika) ! I finally managed to make it work 🙂 In

build.py

I changed this :

Copy code

for key, value in table_config["config"]["stream_maps"].items():
                        tap_postgres_config["config"]["stream_maps"][key] = value
                        tap_postgres_config["metadata"][key] = {"replication-method": "FULL_TABLE"}
if "select" in table_config:
                        if "select" not in tap_postgres_config:
                            tap_postgres_config["select"] = []
                        select_set = set(tap_postgres_config["select"])
                        select_set.update(table_config["select"])
                        tap_postgres_config["select"] = list(select_set)

Into this :

Copy code

for key, value in table_config["config"]["stream_maps"].items():
                        tap_postgres_config["config"]["stream_maps"][key] = value

Which allowed me to reduce the size of my

meltano.yml

file by a couple hundred lines already, because i can use this :

Copy code

select:
    - public-*.*

Instead of the list of all the tables in the public schema. I also modified the file that generates a

.yml

for every table from this :

Copy code

for c in columns:
            #     stream_maps[model_name][c[0]] = "'notnull' if " + c[0] + "!= '' else ''"
            elif c[0] in discarded_columns:
                stream_maps[model_name][c[0]] = "sha3(" + c[0] + ")" + " if " + c[0] + " else ''"
            else:
                stream_maps[model_name][c[0]] = c[0]

Into this

Copy code

for c in columns:
            #     stream_maps[model_name][c[0]] = "'notnull' if " + c[0] + "!= '' else ''"
            elif c[0] in discarded_columns:
                stream_maps[model_name][c[0]] = "sha3(" + c[0] + ")" + " if " + c[0] + " else ''"

(so I just specify the columns which I want to hash and don't have a lot of arguments in my stream_maps). Thanks a lot for your help !

🙌 1

Reuben (Matatika)

07/24/2024, 3:04 PM

Nice! I wonder if it was the size of

select

metadata

stream_maps

, or all of them together that was causing the issue... Glad you got it working either way. 🙂

Arnaud Stephan

07/24/2024, 3:20 PM

Let's just say I hope the backend team does not add too many tables, because the problem would probably arise again in the future 😛

😬 1

Arnaud Stephan

07/25/2024, 12:59 PM

And the closing word regarding this matter : before putting my hands in the Meltano repo, it took more than 1 hour every day to fetch the data from all of our customer's instances. Now it's down to 7 minutes, so that's another win hehe.

🙌 2

dancingpenguin 1

🔥 2

22 Views

Open in Slack

Previous Next