Has anyone have a complete guide or suggestions on...
# troubleshooting
k
Has anyone have a complete guide or suggestions on how I can set-up AWS Redshift Loader? I am just using
target-redshift
, configured everything and when I run Meltano, it seems to be connecting to redshift properly, as it creates tables, but no data is being loaded. And the run fails with
BrokenPipeError
. When I load into json (using
target-jsonl
, it seems to work fine, so extractor does not seem to be an issue. Thank you.
e
k
Hello, yes. Increased it to 50MB and still fails eventually. I think the concern is that no data is inserted into redshift, even though it created the tables.
c
I haven’t tried Meltano with Redshift specifically, but have years of experience using Redshift in production. My guess is that target-redshift relies on a S3 bucket to place data prior to loading it into Redshift - does Redshift have read access permissions to S3? It’d make sense that the tables were created since that happens over a regular JDBC connection; the load command is invoked over JDBC as well, but the actual data is usually (at least, in my experience) copied from S3.
k
oh interesting. It have configured it to have read access, but there's no write access. The target did not ask for an arn or any credentials for the one with S3 write access though. I just set • aws_redshift_copy_role_arn • cluster_identifier • dbname • host • password • port • user Wouldn't that mean that it needs to put data from the extractor to s3, then load it somehow? But this loader (https://hub.meltano.com/loaders/target-redshift) does not seem to require something that has s3 write access?
or maybe I need to already have aws creds on the system. let me try that too.
c
aws_redshift_copy_role_arn
👍 1
that’s an ARN that should allow Redshift to have read access to S3, just my guess, similar to how most other things interface with Redshift
k
it doesn't seem to put anything in s3 bucket after setting creds.
Yes, that arn has read access to s3, but not write access.
c
where is Meltano executing? Your local machine, or an EC2 instance?
k
doesn't that mean something should put files in s3 first after reading from the extractor?
ec2 instance
c
yeah I would imagine that it will write files to S3
can the EC2 instance write files to the s3_bucket/s3_key_prefix?
(if its Amazon Linux I think the AWS CLI tools are installed by default, I could be wrong, it’s been a while since I’ve setup new EC2 instances)
assuming you are ssh-ing in you can
aws s3 cp /tmp/some-file.txt <s3://bucket/prefix/some-file.txt>
and verify quickly that way
k
yes, that succeeded. I set creds that has admin access.
c
and the credentials are associated with the IAM role associated with the EC2 instance or you’ve set them in the shell with something like
AWS_PROFILE
environment variable?
k
and I see that s3 file in s3 bucket.
I set them using
aws configure
c
looks like
target-redshit
has the configuration option
remove_s3_files
is that set to false? Will help debugging this problem.
okay - you are running meltano as a command in the same shell/same user where you’ve
aws configure
-d, correct?
k
yes
c
and you’ve verified with another target that your tap is functional so you know you’re reading data
do you have any additional log output?
k
yes to
remove_s3_files
. Default is False. I just set it to False as well to make sure.
Yes, it outputs data if I use
jsonl
target to json files.
c
certain there are no misspellings in either s3_bucket and s3_key_prefix
k
Copy code
/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/structlog/stdlib.py:1098: UserWarning: Remove `format_exc_info` from your processor chain if you want pretty exceptions.
  ed = p(logger, meth_name, cast(EventDict, ed))
^[[2m2024-08-06T19:24:16.044211Z^[[0m [^[[32m^[[1mdebug    ^[[0m] ^[[1mNeed help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.

^[[0m
Traceback (most recent call last):
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/cli/__init__.py", line 103, in _run_cli
    cli(obj={"project": None})
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/cli/cli.py", line 46, in main
    return super().main(*args, windows_expand_args=False, **kwargs)
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/cli/utils.py", line 662, in invoke
    super().invoke(ctx)  # type: ignore[misc]
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/cli/utils.py", line 711, in invoke
    super().invoke(ctx)
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/cli/params.py", line 74, in decorate
    return func(*args, **kwargs)
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/cli/params.py", line 164, in decorate
    func(project, *args, **kwargs)
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/core/utils/__init__.py", line 73, in wrapper
    return asyncio.run(func(*args, **kwargs))
  File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/cli/run.py", line 191, in run
    raise err
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/cli/run.py", line 188, in run
    await _run_blocks(tracker, parsed_blocks, dry_run=dry_run)
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/cli/run.py", line 241, in _run_blocks
    raise bare_err
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/cli/run.py", line 222, in _run_blocks
    await blk.run()
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/core/block/extract_load.py", line 505, in run
    await self.run_with_job()
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/core/block/extract_load.py", line 537, in run_with_job
    await self.execute()
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/core/block/extract_load.py", line 497, in execute
    await manager.run()
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/core/block/extract_load.py", line 704, in run
    await self._wait_for_process_completion(self.elb.head)
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/core/block/extract_load.py", line 777, in _wait_for_process_completion
    raise output_futures_failed.exception()  # noqa: RSE102
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/core/logging/utils.py", line 230, in capture_subprocess_output
    if not await _write_line_writer(writer, line):
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/core/logging/utils.py", line 198, in _write_line_writer
    await writer.wait_closed()
  File "/usr/lib64/python3.9/asyncio/streams.py", line 359, in wait_closed
    await self._protocol._get_close_waiter(self)
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/core/logging/utils.py", line 196, in _write_line_writer
    await writer.drain()
  File "/usr/lib64/python3.9/asyncio/streams.py", line 387, in drain
    await self._protocol._drain_helper()
  File "/usr/lib64/python3.9/asyncio/streams.py", line 197, in _drain_helper
    await waiter
BrokenPipeError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ec2-user/.local/share/pipx/venvs/meltano/lib64/python3.9/site-packages/meltano/cli/__init__.py", line 113, in _run_cli
    raise CliError(f"{troubleshooting_message}\n{err}") from err  # noqa: EM102
meltano.cli.utils.CliError: Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.


Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.
I didn't have anything for the prefix as it wasn't required, but let me set it and re-run and see.
c
yeah I don’t imagine it is required since it is just part of the key name prepended to whatever meltano is generating for the rest of the key
k
thank you for your help btw, first day trying to run meltano, but running into some issues. 😞
setting prefix doesn't seem to do anything.
c
I’d expect to see files being created on S3 at the very least, assuming the permissions are correct - unfortunately the logs and stack trace above is just an async trace which isn’t really useful. (I mean, I’m here… in this Slack… because of those error messages. Heh.)
if we can get a trace output that actually points to something in the target-redshift source code that’s at least something that can be diagnosed
k
I am running the
run
command with debug log level, and it doesn't seem to give an error related to S3.
c
what do you get if you run
free -h
?
k
image.png
I assume it should be enough as I'm testing a small load.
c
could try upping the buffer size 🙂 which Edgar suggested
k
bigger than 50m?
c
I’d expect the error message to be a bit more descriptive though if it was memory constrained, but who knows
so 50m is 25/25, I mean that should be fine
you were able to write to a different target already so really this should not be the issue
ahh
hmm
where is your state database?
are you using sqlite/the default?
k
us-east-1
no sqlite
what's state database? you mean redshift?
c
because we’re not seeing anything like attempting to retrieve the data from the tap which I believe occurs after reading from state
state database is where Meltano stores information like bookmarks from previous invocations
by default its SQLite and lives in the .meltano directory
k
I see. I didn't set it separately
c
if you haven’t, then it should still be the default state - something you’ll want to look into changing at some point in the future since EC2 instances are ephemeral - but probably not the issue right now
k
I see
meltano.db
under
.meltano/
directory.
c
yeah that should be the default
if you
echo MELTANO_DATABASE_URI
it comes up as an empty string, right?
err
echo $MELTANO_DATABASE_URI
(don’t paste results in Slack)
empty string = using default Meltano state database - non-empty string = tries to connect to that database (or S3) as the state backend
k
yes, empty string
c
alternatively in your meltano.yml make sure you don’t have a
state_backend.uri
defined to some value as well
k
right, no state_backend setting in meltano.yml
c
hmm how are we seeing BROKEN_PIPE before even printing anything from the tap logs
that’s the full logs from the run you pasted above right?
k
I have 15M output with debug log enabled.
but that may contain some sensitive data.
the portion of log I pasted above is the one at the end, when it errors out.
c
yeah don’t paste that, ultimately we’re looking for anything like
The config profile (aws_profile) could not be found
you’re seeing the tables get created so connectivity to Redshift is functional, but not seeing the files get created on S3
k
Right, no line like that.
I think I can upload this. Line before that is data it probably extracted from the source. After, it seems like it's trying to flush data (it has lines like that before this too), then eventually fails.
c
do you see any lines like ’writing # records to s3://…`?
Copy code
self.write_csv(records)
        <http://self.logger.info|self.logger.info>(f'writing {len(records)} records to s3://{self.config["s3_bucket"]}/{self.object}')
        self.copy_to_s3()
        self.copy_to_redshift(table, cursor)
this is from target_redshift’s sinks.py
k
right. no line like that.
interesting. so it does try to write to s3 first? weird.
Oh, I guess their guide does say s3 bucket is for 'S3 bucket to save staging files before using COPY command'. So.. I should see files in s3 if set correctly.
c
yeah
the typical workflow for loading data into Redshift is: 1. write data to S3 2. issue COPY command from connected Redshift client pointing the S3 key where data was written in step 1
k
ah, there's csv file in
temp
folder. I assume that's what it should upload.
c
yeah probably
I wonder
Copy code
2024-08-06T19:24:15.414824Z [info     ]     raise TypeError(f'Object of type {o.__class__.__name__} ' cmd_type=elb consumer=True job_name=staging:tap-hubspot-to-target-redshift name=target-redshift producer=False run_id=d7ca55e2-84c2-4ad7-b724-211055c09ea6 stdio=stderr string_id=target-redshift
2024-08-06T19:24:15.415020Z [info     ] TypeError: Object of type Decimal is not JSON serializable cmd_type=elb consumer=True job_name=staging:tap-hubspot-to-target-redshift name=target-redshift producer=False run_id=d7ca55e2-84c2-4ad7-b724-211055c09ea6 stdio=stderr string_id=target-redshift
does your CSV file appear truncated or does it contain the number of rows you might expect?
the log stack trace includes this:
Copy code
2024-08-06T19:24:15.413077Z [info     ]   File "/home/ec2-user/eval/.meltano/loaders/target-redshift/venv/lib/python3.9/site-packages/target_redshift/sinks.py", line 215, in <dictcomp> cmd_type=elb consumer=True job_name=staging:tap-hubspot-to-target-redshift name=target-redshift producer=False run_id=d7ca55e2-84c2-4ad7-b724-211055c09ea6 stdio=stderr string_id=target-redshift
k
hmm.. I can't tell, there's only one file in there. I assume it probably failed somewhere along the way.
It should actually have lot more files
c
line 215 in sinks.py is this:
Copy code
{key: (json.dumps(value).replace("None", "") if key in object_keys else value) for key, value in record.items()}
so I’m thinking its not completing the CSV dump
k
😞 hmm.. is that file just using boto3, using creds in .aws/credentials file?
https://github.com/TicketSwap/target-redshift/blob/main/target_redshift/sinks.py#L35 although s3 is supposed to be region free, so it may not matter.
c
yeah I don’t even think you’re getting to the S3 delivery part
it’s dying on line 215 due to a type error of some sort
my best guess is that it is related to this [for whatever reason tagged
info
-level] line in the log:
TypeError: Object of type Decimal is not JSON serializable
even in Python 3.12 Decimal is not JSON serializable:
Copy code
>>> import json
>>> from decimal import Decimal
>>> num = Decimal('3.14')
>>> json.dumps({'xyz': num})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Decimal is not JSON serializable
however, if
default=str
is specified it works:
Copy code
json.dumps({'xyz':num}, default=str)
'{"xyz": "3.14"}'
how… this exists in beyond me, I mean I have to assume people have used target-redshift in production and encountered this issue, decimals aren’t exactly rare
k
lol, odd.. so.. it's possible that this particular loader isn't working. 😞
c
I don’t know enough about Singer and Python to know if there are downstream issues with just adding
default=str
- there must be a reason why its not present (and that reason could be… someone just forgot that Decimal is a type)
k
there's only one other redshift loader, but last commit was 2021, so didn't want to use it. Maybe I should try that one. It's not like redshift changed that much in last three years.
c
yeah the redshift copy/unload semantics have remained mostly the same since I started using it back in like 2015 or so
k
let me try the other one and see if that works. 😞
c
I’d recommend creating an issue on the repo for this target it seems like they are pretty responsive
k
let me do that too. thank you.
c
if you create the issue link it here and I can put in additional info if necessary
Ended up switching to the other redshift variant and was able to make it work, though with that one, primary key setting needed to be set to false (not sure what kind of issues this will cause in the future, as it does seem to be updating the correct data).
👍 1