Arnaud Stephan
07/17/2024, 8:00 AM2024-07-15T23:37:18.301932Z [error ] Cannot start plugin tap-postgres: [Errno 7] Argument list too long: '/project/.meltano/extractors/tap-postgres/venv/bin/tap-postgres'
2024-07-15T23:37:18.302143Z [error ] Block run completed. block_type=ExtractLoadBlocks err=RunnerError("Cannot start plugin tap-postgres: [Errno 7] Argument list too long: '/project/.meltano/extractors/tap-postgres/venv/bin/tap-postgres'") exit_codes={} set_number=0 success=False
I've been researching this error on Google and I see this is a common error (not specifically Meltano related though). Is there any way I could troubleshoot this further to see what causes it ?
For context, I'm working for a B2B Saas. We have Meltano deployed in our customers' instances to fetch us the production data, and it was working fine untill we added more tables in what we fetch every night.Reuben (Matatika)
07/17/2024, 8:38 AMulimit -s
and see what the value is. For context, mine appears to be the "default" of 8192
- if that is not the case for you, try increasing it with
ulimit -s 8192
and re-run the Meltano command.Arnaud Stephan
07/17/2024, 8:47 AMulimit
thing but it didn't seem very satisfying to me, since it would only fix the consequence and not the root cause.
I've just had a talk with my boss, apparently they had this problem before and it's caused by our way of doing things :
⢠We execute a query on the database to fetch all tables and their columns
⢠For each table, we generate a .yaml file to specify the schema
⢠We then concatenate all the .yaml files into a giant one that we give to Meltano
So the fix would be to either reduce the list of tables we fetch, or play with the ulimit parameterReuben (Matatika)
07/17/2024, 9:03 AMmeltano.yml
? What does that look like?
> it was working fine untill we added more tables in what we fetch every night.
Sorry, didn't actually read this before my first reply. š
If you run
meltano --log-level debug invoke tap-postgres
you can see the commands Meltano is using to invoke the plugin, which looks like where the error is coming from. Maybe that helps you a bit?
$ meltano --log-level debug invoke tap-spotify 2>&1 | grep Invoking
2024-07-17T09:06:18.490896Z [debug ] Invoking: ['/home/reuben/Documents/taps/tap-spotify/.meltano/extractors/tap-spotify/venv/bin/tap-spotify', '--config', '/home/reuben/Documents/taps/tap-spotify/.meltano/run/tap-spotify/tap.4bc939b2-c3a0-4b07-8ea9-a880ff54d4e2.config.json', '--discover']
2024-07-17T09:06:19.165393Z [debug ] Invoking: ['/home/reuben/Documents/taps/tap-spotify/.meltano/extractors/tap-spotify/venv/bin/tap-spotify', '--config', '/home/reuben/Documents/taps/tap-spotify/.meltano/run/tap-spotify/tap.4bc939b2-c3a0-4b07-8ea9-a880ff54d4e2.config.json', '--catalog', '/home/reuben/Documents/taps/tap-spotify/.meltano/run/tap-spotify/tap.properties.json']
Arnaud Stephan
07/17/2024, 9:12 AMversion: 1
default_environment: dev
project_id: 6f5c2fd5-2ec4-4ade-bbf0-5e03fc7931fd
environments:
- name: dev
- name: staging
- name: prod
plugins:
extractors:
- name: tap-postgres
variant: meltanolabs
pip_url: git+<https://github.com/MeltanoLabs/tap-postgres.git>
config:
database: aciso
user: aciso
host: db
password: ${POSTGRES_PASSWORD}
filter_schemas:
- public
port: 5432
stream_maps: {}
metadata: {}
select: {}
loaders:
- name: target-postgres
variant: transferwise
pip_url: pipelinewise-target-postgres
config:
user: meltano
host: *********
password: ${DW_MELTANO_PASSWORD}
dbname: **********
port: 5888
default_target_schema: meltano
batch_size_rows: 100000
parallelism: 0
parallelism_max: 16
add_metadata_columns: false
hard_delete: false
send_anonymous_usage_stats: false
Arnaud Stephan
07/17/2024, 9:13 AMReuben (Matatika)
07/17/2024, 9:18 AMselect
should be an array rather than an object (not sure if that matters when its empty).Arnaud Stephan
07/17/2024, 9:19 AMReuben (Matatika)
07/17/2024, 9:20 AMArnaud Stephan
07/17/2024, 9:21 AMvisch
07/17/2024, 12:55 PMArnaud Stephan
07/17/2024, 12:59 PMmeltano --log-level debug invoke tap-postgres
gave the same error :
[Errno 7] Argument list too long: '/project/.meltano/extractors/tap-postgres/venv/bin/tap-postgres'
The DevOps tried ulimit -s to unlimited and the error stayed, he is now trying with ulimit -n 16000 and it's currently runningReuben (Matatika)
07/17/2024, 1:00 PMmeltano --log-level debug invoke tap-postgres
>
> gave the same error
No doubt that it errored in the same way, but did they see what the command was? There should be a debug
log before that error message.Arnaud Stephan
07/17/2024, 2:40 PMReuben (Matatika)
07/17/2024, 2:47 PMmeltano.yml
earlier, so I wonder if so many/particularly large environment variables are being set by Meltano that as a result it is exceeding the stack size when trying to kick off the process? The command in the logs looks normal, so that is my next best guess.
https://stackoverflow.com/a/28865503
Maybe worth checking the server ARG_MAX
before/after ulimit
changes:
getconf ARG_MAX
Arnaud Stephan
07/17/2024, 2:54 PMArnaud Stephan
07/17/2024, 2:55 PMReuben (Matatika)
07/17/2024, 3:03 PMgetconf ARG_MAX > 2097152Yeah, that's the value I have locally as well. It would be good to have some more info on what exactly is being generated to have a better idea of what might be happening, and then possibly what a better approach might be for the problem you are trying to solve if there isn't a way around this error.
Arnaud Stephan
07/17/2024, 3:04 PMReuben (Matatika)
07/17/2024, 3:10 PMArnaud Stephan
07/17/2024, 3:44 PMArnaud Stephan
07/23/2024, 7:31 AMbuild.py
for filename in os.listdir(tables_config_dir):
if filename.endswith(".yaml") or filename.endswith(".yml"):
filepath = os.path.join(tables_config_dir, filename)
with open(filepath, "r") as stream:
try:
table_config = yaml.safe_load(stream)
# Assuming each file contains configuration for a single table
for key, value in table_config["config"]["stream_maps"].items():
tap_postgres_config["config"]["stream_maps"][key] = value
tap_postgres_config["metadata"][key] = {"replication-method": "FULL_TABLE"}
if "select" in table_config:
if "select" not in tap_postgres_config:
tap_postgres_config["select"] = []
select_set = set(tap_postgres_config["select"])
select_set.update(table_config["select"])
tap_postgres_config["select"] = list(select_set)
except yaml.YAMLError as exc:
print(f"Error processing {filename}: {exc}")
with open(meltano_config_path, "w") as outfile:
yaml.dump(meltano_config, outfile, default_flow_style=False, sort_keys=False)
3. In the Dockerfile I guess this is where we make use of build.py to generate the meltano extractor
# <http://registry.gitlab.com/meltano/meltano:latest|registry.gitlab.com/meltano/meltano:latest> is also available in GitLab Registry
ARG MELTANO_IMAGE=meltano/meltano:v3.3.1-python3.10
FROM $MELTANO_IMAGE
WORKDIR /project
# Install any additional requirements
COPY . .
RUN pip install --upgrade pip
RUN pip install -r requirements.txt
RUN python build.py
# Copy over Meltano project directory
RUN meltano lock --update --all
RUN meltano install
COPY mapper.py /project/.meltano/extractors/tap-postgres/venv/lib/python3.10/site-packages/singer_sdk
# Don't allow changes to containerized project files
# ENV MELTANO_PROJECT_READONLY 1
ENTRYPOINT [""]
4. Finally, we launch this command to exctract and load the data in our warehouse
docker-compose run --build --rm meltano /bin/bash -c "meltano run tap-postgres target-postgres"
With this being in the docker-compose file
#regenerated by ansible
version: "3"
networks:
backend_aciso_dev:
external: true
services:
meltano:
build: .
image: tenacy_meltano
container_name: meltano_dev
volumes:
- /srv/docker/meltano/logs:/project/.meltano/logs/elt
networks:
- backend_aciso_dev
command: meltano run tap-postgres target-postgres
environment:
- instance_id=123456
- POSTGRES_PASSWORD=***
- DW_MELTANO_PASSWORD=***
Arnaud Stephan
07/23/2024, 7:32 AMArnaud Stephan
07/23/2024, 7:52 AMArnaud Stephan
07/23/2024, 8:37 AMReuben (Matatika)
07/23/2024, 9:13 AMArgument list too long
error! š
Couple of things:
I think this logic
tap_postgres_config["metadata"][key] = {"replication-method": "FULL_TABLE"}
is redundant since the default is FULL TABLE anyway, so you might save yourself some lines there by removing it (should be the same behaviour).
You can remove
if "select" in table_config:
if "select" not in tap_postgres_config:
tap_postgres_config["select"] = []
select_set = set(tap_postgres_config["select"])
select_set.update(table_config["select"])
tap_postgres_config["select"] = list(select_set)
in favour of
select:
- public-*.*
in your base meltano.yml
.
Then you are just left with your generated stream_maps
config. Maybe see how you go with the previous two changes and if there's still a problem then we can come back to it?
I still think there is the bigger (and more loaded) question of why for this whole process? I think you can categorise the output of this generative build step as your data transformation logic, and when things are getting this complicated, you would be better off using a dedicated data transformation tool like dbt to apply transformations from models that live in source control. You must already have your common transformations defined somewhere that are referenced when building each of the table .yml
files. Meltano supports dbt as a plugin already: https://hub.meltano.com/utilities/dbt-postgresArnaud Stephan
07/23/2024, 9:21 AMselect *
on everything ? I work for a cybersecurity company and some of our customers' data is sensitive so my boss wants the data to be hashed before it enters our datawarehouse. That way, we don't have the unhashed data available, and don't run into the risk of a leak.Arnaud Stephan
07/23/2024, 9:21 AMArnaud Stephan
07/23/2024, 9:23 AMselect:
- public-*.*
But this will not take into account the fact that we want to transform some columns, right ?Reuben (Matatika)
07/23/2024, 9:32 AMI work for a cybersecurity company and some of our customers' data is sensitive so my boss wants the data to be hashed before it enters our datawarehouse. That way, we don't have the unhashed data available, and don't run into the risk of a leak.OK, that makes more sense why you are using stream maps. š
Reuben (Matatika)
07/23/2024, 9:35 AMmeltano select tap-postgres --list --all
you should be able to see all public-
tables/columns selected.Reuben (Matatika)
07/23/2024, 9:58 AMmeltano.yml
into a test project and Meltano just gets stuck when trying to run anything. š
Arnaud Stephan
07/23/2024, 10:03 AMArnaud Stephan
07/23/2024, 10:06 AMmeltano select tap-postgres --list --all
This gave me the same argument list too long
error !
I think the problem comes from the number of columns we fetch. It's sad that Postgres does not have the EXCLUDE
function, because otherwise we could do something like
SELECT * (EXCLUDE column_to_hash), hash(column_to_hash) as column_to_hash
FROM table
Arnaud Stephan
07/23/2024, 10:07 AMselect:
- public-*.*
And the one about the replication methodReuben (Matatika)
07/23/2024, 10:16 AMWhat's your machine configuration ? Here's mineI am using latest Meltano installed with Python 3.8, so most likely down to Python version.
Reuben (Matatika)
07/23/2024, 10:24 AMThis gave me the sameI replaced theerror !argument list too long
select
as above and removed the metadata
section entirely and it did successfully invoke the plugin, which I think is further than you are getting:
$ meltano invoke tap-postgres
2024-07-23T10:18:06.342241Z [info ] Environment 'dev' is active
Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.
Catalog discovery failed: command ['/tmp/p/.meltano/extractors/tap-postgres/venv/bin/tap-postgres', '--config', '/tmp/p/.meltano/run/tap-postgres/tap.5e93857e-b10f-43f2-83aa-f4d6f07b6539.config.json', '--discover'] returned 1 with stderr:
...
(don't mind the error, I don't have any config set - just wanted to see if it would invoke successfully at all)Arnaud Stephan
07/24/2024, 1:14 PMbuild.py
I changed this :
for key, value in table_config["config"]["stream_maps"].items():
tap_postgres_config["config"]["stream_maps"][key] = value
tap_postgres_config["metadata"][key] = {"replication-method": "FULL_TABLE"}
if "select" in table_config:
if "select" not in tap_postgres_config:
tap_postgres_config["select"] = []
select_set = set(tap_postgres_config["select"])
select_set.update(table_config["select"])
tap_postgres_config["select"] = list(select_set)
Into this :
for key, value in table_config["config"]["stream_maps"].items():
tap_postgres_config["config"]["stream_maps"][key] = value
Which allowed me to reduce the size of my meltano.yml
file by a couple hundred lines already, because i can use this :
select:
- public-*.*
Instead of the list of all the tables in the public schema.
I also modified the file that generates a .yml
for every table from this :
for c in columns:
# stream_maps[model_name][c[0]] = "'notnull' if " + c[0] + "!= '' else ''"
elif c[0] in discarded_columns:
stream_maps[model_name][c[0]] = "sha3(" + c[0] + ")" + " if " + c[0] + " else ''"
else:
stream_maps[model_name][c[0]] = c[0]
Into this
for c in columns:
# stream_maps[model_name][c[0]] = "'notnull' if " + c[0] + "!= '' else ''"
elif c[0] in discarded_columns:
stream_maps[model_name][c[0]] = "sha3(" + c[0] + ")" + " if " + c[0] + " else ''"
(so I just specify the columns which I want to hash and don't have a lot of arguments in my stream_maps).
Thanks a lot for your help !Reuben (Matatika)
07/24/2024, 3:04 PMselect
, metadata
or stream_maps
, or all of them together that was causing the issue... Glad you got it working either way. šArnaud Stephan
07/24/2024, 3:20 PMArnaud Stephan
07/25/2024, 12:59 PM