Hello everyone, I'm having some trouble with a Me...
# troubleshooting
a
Hello everyone, I'm having some trouble with a Meltano extraction. I did not write it, I just joined the company a few weeks ago :
Copy code
2024-07-15T23:37:18.301932Z [error    ] Cannot start plugin tap-postgres: [Errno 7] Argument list too long: '/project/.meltano/extractors/tap-postgres/venv/bin/tap-postgres'

2024-07-15T23:37:18.302143Z [error    ] Block run completed.           block_type=ExtractLoadBlocks err=RunnerError("Cannot start plugin tap-postgres: [Errno 7] Argument list too long: '/project/.meltano/extractors/tap-postgres/venv/bin/tap-postgres'") exit_codes={} set_number=0 success=False
I've been researching this error on Google and I see this is a common error (not specifically Meltano related though). Is there any way I could troubleshoot this further to see what causes it ? For context, I'm working for a B2B Saas. We have Meltano deployed in our customers' instances to fetch us the production data, and it was working fine untill we added more tables in what we fetch every night.
āœ… 1
r
Never seen that error before when running Meltano... Assuming you are on Linux and from reading https://unix.stackexchange.com/a/45584, I would run
Copy code
ulimit -s
and see what the value is. For context, mine appears to be the "default" of
8192
- if that is not the case for you, try increasing it with
Copy code
ulimit -s 8192
and re-run the Meltano command.
a
I've seen the
ulimit
thing but it didn't seem very satisfying to me, since it would only fix the consequence and not the root cause. I've just had a talk with my boss, apparently they had this problem before and it's caused by our way of doing things : • We execute a query on the database to fetch all tables and their columns • For each table, we generate a .yaml file to specify the schema • We then concatenate all the .yaml files into a giant one that we give to Meltano So the fix would be to either reduce the list of tables we fetch, or play with the ulimit parameter
šŸ‘€ 1
r
So you are generating a
meltano.yml
? What does that look like? > it was working fine untill we added more tables in what we fetch every night. Sorry, didn't actually read this before my first reply. šŸ˜… If you run
Copy code
meltano --log-level debug invoke tap-postgres
you can see the commands Meltano is using to invoke the plugin, which looks like where the error is coming from. Maybe that helps you a bit?
Copy code
$ meltano --log-level debug invoke tap-spotify 2>&1 | grep Invoking
2024-07-17T09:06:18.490896Z [debug    ] Invoking: ['/home/reuben/Documents/taps/tap-spotify/.meltano/extractors/tap-spotify/venv/bin/tap-spotify', '--config', '/home/reuben/Documents/taps/tap-spotify/.meltano/run/tap-spotify/tap.4bc939b2-c3a0-4b07-8ea9-a880ff54d4e2.config.json', '--discover']
2024-07-17T09:06:19.165393Z [debug    ] Invoking: ['/home/reuben/Documents/taps/tap-spotify/.meltano/extractors/tap-spotify/venv/bin/tap-spotify', '--config', '/home/reuben/Documents/taps/tap-spotify/.meltano/run/tap-spotify/tap.4bc939b2-c3a0-4b07-8ea9-a880ff54d4e2.config.json', '--catalog', '/home/reuben/Documents/taps/tap-spotify/.meltano/run/tap-spotify/tap.properties.json']
a
This is the meltano.yml
Copy code
version: 1
default_environment: dev
project_id: 6f5c2fd5-2ec4-4ade-bbf0-5e03fc7931fd
environments:
- name: dev
- name: staging
- name: prod
plugins:
  extractors:
  - name: tap-postgres
    variant: meltanolabs
    pip_url: git+<https://github.com/MeltanoLabs/tap-postgres.git>
    config:
      database: aciso
      user: aciso
      host: db
      password: ${POSTGRES_PASSWORD}
      filter_schemas:
      - public
      port: 5432
      stream_maps: {}
    metadata: {}
    select: {}
  loaders:
  - name: target-postgres
    variant: transferwise
    pip_url: pipelinewise-target-postgres
    config:
      user: meltano
      host: *********
      password: ${DW_MELTANO_PASSWORD}
      dbname: **********
      port: 5888
      default_target_schema: meltano
      batch_size_rows: 100000
      parallelism: 0
      parallelism_max: 16
      add_metadata_columns: false
      hard_delete: false
send_anonymous_usage_stats: false
I don't have access to the server where Meltano is ran, but I asked our DevOps team to check with the command you gave. Thanks !
šŸ‘ 1
r
Is that the one that is being auto-generated? Doesn't look like anything out of the ordinary. Probably unrelated, but
select
should be an array rather than an object (not sure if that matters when its empty).
a
I have no idea, I'm still in the "figuring out how everything is orchestrated" in my company haha.
r
Fair enough, interested to hear how this is all working though šŸ˜„
a
Yup, I'll update this thread when I get it working !
šŸ‘ 1
v
super curious what you all are putting in the args that makes sit so long, I haven't seen this one!
a
meltano --log-level debug invoke tap-postgres
gave the same error :
Copy code
[Errno 7] Argument list too long: '/project/.meltano/extractors/tap-postgres/venv/bin/tap-postgres'
The DevOps tried ulimit -s to unlimited and the error stayed, he is now trying with ulimit -n 16000 and it's currently running
r
>
meltano --log-level debug invoke tap-postgres
> > gave the same error No doubt that it errored in the same way, but did they see what the command was? There should be a
debug
log before that error message.
a
Here is the log file
r
When Meltano invokes a plugin, it makes the configuration available to the subprocess via environment variables. You mentioned a giant
meltano.yml
earlier, so I wonder if so many/particularly large environment variables are being set by Meltano that as a result it is exceeding the stack size when trying to kick off the process? The command in the logs looks normal, so that is my next best guess. https://stackoverflow.com/a/28865503 Maybe worth checking the server
ARG_MAX
before/after
ulimit
changes:
Copy code
getconf ARG_MAX
a
Thanks for your answer ! I'll check back with you when the platform team answers me. I loved the SlackOverflow you posted, it's crazy the problem people find themselves in ! 1 million length en var, wow
šŸ˜‚ 1
Oh I just got an answer ! getconf ARG_MAX > 2097152
r
getconf ARG_MAX > 2097152
Yeah, that's the value I have locally as well. It would be good to have some more info on what exactly is being generated to have a better idea of what might be happening, and then possibly what a better approach might be for the problem you are trying to solve if there isn't a way around this error.
a
I'll have to figure out a way to launch everything locally in order to troubleshoot myself and not annoy the platform team. It might take a while so in the meantime I will take out some tables from the extract to make a quick and dirty fix, as soon as I get some bandwith I'll try to understand what causes the error šŸ™‚
šŸ‘ 2
r
Sounds like baptism of fire dealing with this as a new employee 😨
šŸ˜… 1
🌊 1
šŸ”„ 2
a
Haha exactly ! Thanks for your help anway šŸ˜‰
np 1
Hello everyone ! I have managed to run Meltano locally and to understand better the structure of the project. 1. There are 260 .yml files, one for each table we want to fetch 2. The tap_postgres_config is created from those files and dumped into the giant meltano file (this is done in a file called
build.py
Copy code
for filename in os.listdir(tables_config_dir):
        if filename.endswith(".yaml") or filename.endswith(".yml"):
            filepath = os.path.join(tables_config_dir, filename)
            with open(filepath, "r") as stream:
                try:
                    table_config = yaml.safe_load(stream)
                    # Assuming each file contains configuration for a single table
                    for key, value in table_config["config"]["stream_maps"].items():
                        tap_postgres_config["config"]["stream_maps"][key] = value
                        tap_postgres_config["metadata"][key] = {"replication-method": "FULL_TABLE"}

                    if "select" in table_config:
                        if "select" not in tap_postgres_config:
                            tap_postgres_config["select"] = []
                        select_set = set(tap_postgres_config["select"])
                        select_set.update(table_config["select"])
                        tap_postgres_config["select"] = list(select_set)
                except yaml.YAMLError as exc:
                    print(f"Error processing {filename}: {exc}") 

    with open(meltano_config_path, "w") as outfile:
        yaml.dump(meltano_config, outfile, default_flow_style=False, sort_keys=False)
3. In the Dockerfile I guess this is where we make use of build.py to generate the meltano extractor
Copy code
# <http://registry.gitlab.com/meltano/meltano:latest|registry.gitlab.com/meltano/meltano:latest> is also available in GitLab Registry
ARG MELTANO_IMAGE=meltano/meltano:v3.3.1-python3.10
FROM $MELTANO_IMAGE

WORKDIR /project

# Install any additional requirements
COPY . .
RUN pip install --upgrade pip
RUN pip install -r requirements.txt
RUN python build.py
# Copy over Meltano project directory
RUN meltano lock --update --all
RUN meltano install
COPY mapper.py /project/.meltano/extractors/tap-postgres/venv/lib/python3.10/site-packages/singer_sdk
# Don't allow changes to containerized project files
# ENV MELTANO_PROJECT_READONLY 1

ENTRYPOINT [""]
4. Finally, we launch this command to exctract and load the data in our warehouse
Copy code
docker-compose run --build --rm meltano /bin/bash -c "meltano run tap-postgres target-postgres"
With this being in the docker-compose file
Copy code
#regenerated by ansible
version: "3"


networks:
    backend_aciso_dev:
        external: true

services:
  meltano:
    build: .
    image: tenacy_meltano
    container_name: meltano_dev
    volumes:
      - /srv/docker/meltano/logs:/project/.meltano/logs/elt
    networks:
      - backend_aciso_dev
    command: meltano run tap-postgres target-postgres
    environment:
      - instance_id=123456
      - POSTGRES_PASSWORD=***
      - DW_MELTANO_PASSWORD=***
Right now I have updated all of my .yml files to try and reproduce the original error locally in my environment
And I get the same error, yay !
Here is the giant meltano.yml file
r
Well I can sort of see why you are getting the
Argument list too long
error! šŸ˜… Couple of things: I think this logic
Copy code
tap_postgres_config["metadata"][key] = {"replication-method": "FULL_TABLE"}
is redundant since the default is FULL TABLE anyway, so you might save yourself some lines there by removing it (should be the same behaviour). You can remove
Copy code
if "select" in table_config:
                        if "select" not in tap_postgres_config:
                            tap_postgres_config["select"] = []
                        select_set = set(tap_postgres_config["select"])
                        select_set.update(table_config["select"])
                        tap_postgres_config["select"] = list(select_set)
in favour of
Copy code
select:
- public-*.*
in your base
meltano.yml
. Then you are just left with your generated
stream_maps
config. Maybe see how you go with the previous two changes and if there's still a problem then we can come back to it? I still think there is the bigger (and more loaded) question of why for this whole process? I think you can categorise the output of this generative build step as your data transformation logic, and when things are getting this complicated, you would be better off using a dedicated data transformation tool like dbt to apply transformations from models that live in source control. You must already have your common transformations defined somewhere that are referenced when building each of the table
.yml
files. Meltano supports dbt as a plugin already: https://hub.meltano.com/utilities/dbt-postgres
a
By whole process, you mean the one where we modify some columns instead of just doing a
select *
on everything ? I work for a cybersecurity company and some of our customers' data is sensitive so my boss wants the data to be hashed before it enters our datawarehouse. That way, we don't have the unhashed data available, and don't run into the risk of a leak.
I'll try the modifications you suggested, thanks for the follow up !
Copy code
select:
- public-*.*
But this will not take into account the fact that we want to transform some columns, right ?
r
I work for a cybersecurity company and some of our customers' data is sensitive so my boss wants the data to be hashed before it enters our datawarehouse. That way, we don't have the unhashed data available, and don't run into the risk of a leak.
OK, that makes more sense why you are using stream maps. šŸ‘
> But this will not take into account the fact that we want to transform some columns, right ? It should work fine. All streams/properties matching that pattern will be selected at runtime before stream maps are applied. If you run
Copy code
meltano select tap-postgres --list --all
you should be able to see all
public-
tables/columns selected.
šŸ‘ 1
I pasted that giant
meltano.yml
into a test project and Meltano just gets stuck when trying to run anything. šŸ˜…
šŸ˜† 1
a
What's your machine configuration ? Here's mine
Copy code
meltano select tap-postgres --list --all
This gave me the same
argument list too long
error ! I think the problem comes from the number of columns we fetch. It's sad that Postgres does not have the
EXCLUDE
function, because otherwise we could do something like
Copy code
SELECT * (EXCLUDE column_to_hash), hash(column_to_hash) as column_to_hash
FROM table
And the error is here even after doing this modification
Copy code
select:
    - public-*.*
And the one about the replication method
r
What's your machine configuration ? Here's mine
I am using latest Meltano installed with Python 3.8, so most likely down to Python version.
🤩 1
This gave me the same
argument list too long
error !
I replaced the
select
as above and removed the
metadata
section entirely and it did successfully invoke the plugin, which I think is further than you are getting:
Copy code
$ meltano invoke tap-postgres
2024-07-23T10:18:06.342241Z [info     ] Environment 'dev' is active   
Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.

Catalog discovery failed: command ['/tmp/p/.meltano/extractors/tap-postgres/venv/bin/tap-postgres', '--config', '/tmp/p/.meltano/run/tap-postgres/tap.5e93857e-b10f-43f2-83aa-f4d6f07b6539.config.json', '--discover'] returned 1 with stderr:

...
(don't mind the error, I don't have any config set - just wanted to see if it would invoke successfully at all)
āž• 1
a
Hey @Reuben (Matatika) ! I finally managed to make it work šŸ™‚ In
build.py
I changed this :
Copy code
for key, value in table_config["config"]["stream_maps"].items():
                        tap_postgres_config["config"]["stream_maps"][key] = value
                        tap_postgres_config["metadata"][key] = {"replication-method": "FULL_TABLE"}
if "select" in table_config:
                        if "select" not in tap_postgres_config:
                            tap_postgres_config["select"] = []
                        select_set = set(tap_postgres_config["select"])
                        select_set.update(table_config["select"])
                        tap_postgres_config["select"] = list(select_set)
Into this :
Copy code
for key, value in table_config["config"]["stream_maps"].items():
                        tap_postgres_config["config"]["stream_maps"][key] = value
Which allowed me to reduce the size of my
meltano.yml
file by a couple hundred lines already, because i can use this :
Copy code
select:
    - public-*.*
Instead of the list of all the tables in the public schema. I also modified the file that generates a
.yml
for every table from this :
Copy code
for c in columns:
            #     stream_maps[model_name][c[0]] = "'notnull' if " + c[0] + "!= '' else ''"
            elif c[0] in discarded_columns:
                stream_maps[model_name][c[0]] = "sha3(" + c[0] + ")" + " if " + c[0] + " else ''"
            else:
                stream_maps[model_name][c[0]] = c[0]
Into this
Copy code
for c in columns:
            #     stream_maps[model_name][c[0]] = "'notnull' if " + c[0] + "!= '' else ''"
            elif c[0] in discarded_columns:
                stream_maps[model_name][c[0]] = "sha3(" + c[0] + ")" + " if " + c[0] + " else ''"
(so I just specify the columns which I want to hash and don't have a lot of arguments in my stream_maps). Thanks a lot for your help !
šŸ™Œ 1
r
Nice! I wonder if it was the size of
select
,
metadata
or
stream_maps
, or all of them together that was causing the issue... Glad you got it working either way. šŸ™‚
a
Let's just say I hope the backend team does not add too many tables, because the problem would probably arise again in the future šŸ˜›
😬 1
And the closing word regarding this matter : before putting my hands in the Meltano repo, it took more than 1 hour every day to fetch the data from all of our customer's instances. Now it's down to 7 minutes, so that's another win hehe.
šŸ™Œ 2
dancingpenguin 1
šŸ”„ 2