Hi all! Trying to read a UTF-8 CSV file with "tap...
# troubleshooting
a
Hi all! Trying to read a UTF-8 CSV file with "tap-csv". Unfortunately I always get the following error message:
'utf-8' codec can't decode byte 0xfc in position 61: invalid start byte
Command:
meltano invoke tap-csv > singer.jsonl
Can someone help me and give me a type or even a solution!? Thanks a lot!
c
Maybe your file has a BOM?
s
did you check the encoding? The tap-csv has an encoding setting, maybe your's isn't the default one: https://github.com/MeltanoLabs/tap-csv.
a
Thank's for your quick response! @Sven Balnojan I added the encoding with utf-8 and then it works, but the result is not correct.
I work with a Windows computer, Meltano directly installed without docker, as important info...
Example data: meltano.yaml
Copy code
version: 1
default_environment: dev
environments:
  - name: dev
  - name: staging
  - name: prod
project_id: maziolab-743a98d7-8954-4569-95a6-715c286b547d
plugins:
  extractors:
    - name: tap-csv
      variant: meltanolabs
      pip_url: git+<https://github.com/MeltanoLabs/tap-csv.git>
      config:
        files:
          - entity: values
            path: test-csv.csv
            keys:
              - id
            encoding: utf-8
  loaders:
    - name: target-csv
      variant: hotgluexyz
      pip_url: git+<https://github.com/hotgluexyz/target-csv.git>
      disable_collection: true
    - name: target-jsonl
      variant: andyh1203
      pip_url: target-jsonl
Command:
meltano run tap-csv target-jsonl
"test-csv.csv" content:
Copy code
id, row1, row2
0, aaaa, aqwree
1, aäaa, aüwree
2, aöaa, aqwree
3, aàaa, aqwree
4, aaaa, aèwree
5, aéaa, aqwree
Result JSONL:
Copy code
{"id": "0", " row1": " aaaa", " row2": " aqwree"}
{"id": "1", " row1": " a\u00e4aa", " row2": " a\u00fcwree"}
{"id": "2", " row1": " a\u00f6aa", " row2": " aqwree"}
{"id": "3", " row1": " a\u00e0aa", " row2": " aqwree"}
{"id": "4", " row1": " aaaa", " row2": " a\u00e8wree"}
{"id": "5", " row1": " a\u00e9aa", " row2": " aqwree"}
See "a\u00e4aa" at id:1 should be "aäaa" .env file:
Copy code
TARGET_CSV_DESTINATION_PATH='output'
TARGET_CSV_DELIMITER=';'

TARGET_JSONL_DESTINATION_PATH='output'
Did I forget something or is there a global setting for UTF-8 that I don't know? Is this a Windows problem? I will try it on Linux. Thanks!
s
Well, the unicode is fine, that's a unicode for "ä", I guess you need to check your default encoding so that it also displays as "ä". And now I'm out of my depth 😄 (Fwiw, caan't let this one go: My responses are slow compared to your data pipelines - once you get it to run with Meltano!)
e
I think the issue is target-jsonl doesn’t use
ensure_ascii=False
when dumping the records:
If _ensure_ascii_ is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If _ensure_ascii_ is false, these characters will be output as-is.
Copy code
>>> import simplejson as json
>>> json.dumps({"x": "aäaa"})
'{"x": "a\\u00e4aa"}'
>>> json.dumps({"x": "aäaa"}, ensure_ascii=False)
'{"x": "aäaa"}'
s
Hi, I have had a similar problem with tap-s3-csv. The csv file was saved from Excel. When a file is save from Excel it is encoded with UTF-8 BOM. The BOM part means there is a leading (Byte Order Mark). A simple fix is to re-save the file without UTF-8 BOM plain UTF-8. I'm not sure if this is the exact issue but it does sound similar. There is an overall fix for this in the singer-encoding package however the one used by pipeline in tap-s3-csv does not have the fix to be able to read both UTF-8 and UTF-8 BOM. I can't comment on tap-csv but wonder if it requires the same fix as this https://github.com/singer-io/singer-encodings/commit/e4164372b9788ac1590415d89cfaafb348209a39
c
Reproducing the tests on Linux. I suppose the output I get is the correct expected output ...
Copy code
/tmp/encoding » file -bi test-csv.csv
text/csv; charset=utf-8
/tmp/encoding » cat test-csv.csv
id, row1, row2
0, aaaa, aqwree
1, aäaa, aüwree
2, aöaa, aqwree
3, aàaa, aqwree
4, aaaa, aèwree
5, aéaa, aqwree
/tmp/encoding » meltano invoke tap-csv
2022-09-26T21:16:14.154190Z [info     ] Environment 'dev' is active
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=tap-csv v0.0.6, Meltano SDK v0.8.0
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Skipping parse of env var settings...
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Config validation passed with 0 warnings.
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Beginning full_table sync of 'values'...
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Tap has custom mapper. Using 1 provided map(s).
{"type": "SCHEMA", "stream": "values", "schema": {"properties": {"id": {"type": ["string", "null"]}, " row1": {"type": ["string", "null"]}, " row2": {"type": ["string", "null"]}}, "type": "object"}, "key_properties": ["id"]}
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Properties () were present in the 'values' stream but not found in catalog schema. Ignoring.
{"type": "RECORD", "stream": "values", "record": {"id": "0", " row1": " aaaa", " row2": " aqwree"}, "time_extracted": "2022-09-26T21:16:17.521045Z"}
{"type": "STATE", "value": {"bookmarks": {"values": {"starting_replication_value": null}}}}
{"type": "RECORD", "stream": "values", "record": {"id": "1", " row1": " a\u00e4aa", " row2": " a\u00fcwree"}, "time_extracted": "2022-09-26T21:16:17.521222Z"}
{"type": "RECORD", "stream": "values", "record": {"id": "2", " row1": " a\u00f6aa", " row2": " aqwree"}, "time_extracted": "2022-09-26T21:16:17.521335Z"}
{"type": "RECORD", "stream": "values", "record": {"id": "3", " row1": " a\u00e0aa", " row2": " aqwree"}, "time_extracted": "2022-09-26T21:16:17.521445Z"}
{"type": "RECORD", "stream": "values", "record": {"id": "4", " row1": " aaaa", " row2": " a\u00e8wree"}, "time_extracted": "2022-09-26T21:16:17.521551Z"}
{"type": "RECORD", "stream": "values", "record": {"id": "5", " row1": " a\u00e9aa", " row2": " aqwree"}, "time_extracted": "2022-09-26T21:16:17.521661Z"}
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=INFO METRIC: {"type": "counter", "metric": "record_count", "value": 6, "tags": {"stream": "values"}}
{"type": "STATE", "value": {"bookmarks": {"values": {}}}}
/tmp/encoding »
Testing with `target-postgres`:
Copy code
testdb=> \l testdb
                             List of databases
  Name  | Owner | Encoding |   Collate   |    Ctype    | Access privileges
--------+-------+----------+-------------+-------------+-------------------
 testdb | crm   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 |
(1 row)

testdb=> select * from tap_csv."values";
  row1 |   row2  | id
-------+---------+----
  aaaa |  aqwree | 0
  aäaa |  aüwree | 1
  aöaa |  aqwree | 2
  aàaa |  aqwree | 3
  aaaa |  aèwree | 4
  aéaa |  aqwree | 5
(6 rows)

testdb=>
Sounds like the solution is to use a target that encodes UTF-8 by default?
a
Thank you for all the feedback! The hint with the BOM is certainly helpful and to consider with files, thanks @steve_clarke! @christoph When I try with
target-postgres
I get an error:
Copy code
│ C:\Users\user\.local\pipx\venvs\meltano\lib\site-packages\meltano\core\logging\utils.py:196 in    │
│ _write_line_writer                                                                               │
│                                                                                                  │
│   193 │   │   │                                                                                  │
│   194 │   │   │   return False                                                                   │
│   195 │   else:                                                                                  │
│ > 196 │   │   writer.writeline(line.decode())                                                    │
│   197 │                                                                                          │
│   198 │   return True                                                                            │
│   199                                                                                            │
│                                                                                                  │
│ ┌─────────────────────────────────────────── locals ───────────────────────────────────────────┐ │
│ │   line = b'  File                                                                            │ │
│ │          "path....… │ │
│ │ writer = <meltano.core.logging.output_logger.Out object at 0x0000028DB70521D0>               │ │
│ └──────────────────────────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 61: invalid start byte

Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.

'utf-8' codec can't decode byte 0xfc in position 61: invalid start byte
PS C:\Users\path....... > meltano run tap-csv target-postgres
I don't understand it right now... The CSV file is UTF-8 encoded and the Postgres DB anyway.
message has been deleted
c
@christoph When I try with
target-postgres
I get an error:
I suspect it may be a problem that is specific to Python and Meltano on Windows only? I wouldn't really know where to start though other than the fact that Windows obviously has a long history of supporting Codepages (e.g. conhost.exe would probably be a very dodgy way to run Unicode code, since that is the first subsystem that comes to my mind in relation to Codepage settings on Windows) .... But I have no idea how that would relate to Python and Meltano ...
a
@christoph I'll try it once on the WSL and an Ubuntu machine, see where it works correctly and I find something out. Thank you!
s
I wondered about the codepage as well @christoph. I ran into a similar issue extracting data using some database taps against old databases (Sybase). We had the tap fail with invalid characters. Do we know what the original encoding is? We changed the tap to allow a setting for the characterset for the original coding in the database. By setting the characterset to cp1252 the characters were correctly returned and loaded into target-snowflake. Out of a matter of interest the tap-sybase where we implemented the support for the characterset is here https://github.com/s7clarke10/tap-sybase .
c
Yeah. Codepages on Windows can catch people out. What I was actually thinking is how the fact that
conhost.exe
used to be troublesome in regards to Unicode (since
conhost.exe
is a bit of a legacy beast) could relate to
stdin
and
stdout
in Python on Windows?! At least in Windows 7 (and probably also in Windows 10),
conhost.exe
is non-Unicode and if Python uses
conhost.exe
APIs for
stdin
and
stdout
... there would be a chance of encoding issues ... ?! I'm really just spitballing at the moment, since I have no idea how Python on Windows handles Unicode ....
Python definitely has platform specific Unicode handling on Windows ... so, chances are that Meltano will be impacted: https://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console
a
Tested the whole thing on Linux (Ubuntu 20, Python 3.9) and everything works fine... Windows and the little things that bother....
c
That's what I thought. Since meltano uses stdin and stdout, it will encounter the platform specific issues of Python Unicode on Windows.
I wonder if @visch has come across this type of situation before, considering he is probably one of the main users of Meltano on Windows
v
There's a bunch of different directions this thread has gone so I'll try to break down the questions I see. Question 1
UTF-8 CSV file with "tap-csv" fails
answer: add utf-8 encoding Question 2
target-jsonl doesn't have utf-8 characters as expected
answer: target-jsonl output is "wrong" if you want it to output that way https://meltano.slack.com/archives/C01TCRBBJD7/p1664207558904659?thread_ts=1664120890.041059&amp;cid=C01TCRBBJD7 Question 3
https://meltano.slack.com/archives/C01TCRBBJD7/p1664219685373569?thread_ts=1664120890.041059&amp;cid=C01TCRBBJD7 tap-s3-csv doesn't handle utf-8 with a bom
answer: you have to tell the tap that there's a bom Question 4
https://meltano.slack.com/archives/C01TCRBBJD7/p1664385759184169?thread_ts=1664120890.041059&amp;cid=C01TCRBBJD7 something in meltano fails when something is ran
@ammann_michael is this q/a setup right. If so to answer question 4 can you show me what command you're running exactly, and what the input is? It looks like
cat file | meltano invoke target-postgres
but it's impossible to know Question 5
https://meltano.slack.com/archives/C01TCRBBJD7/p1665130044553279?thread_ts=1664120890.041059&amp;cid=C01TCRBBJD7
Does meltano use stdin / stdout on Windows (Not the exact question but I"ll say it is :D)
Meltano does use stdin and stdout from procs, but
asyncio
tends to just make things work for you here which means you don't have much to worry about. Now it's very probable I haven't hit this exact situation yet so there's probably something off! Not clear to me what exactly the problem is though 😕
Not sure if that helps, but I needed it for me @ammann_michael if you could let me know / give me a test case I'll happily test it / look into it
Since we are all kind of spitballing there's some things I've learned about python along the way. Generally good on Windows and Linux is just that things tend to work, https://docs.python.org/3/howto/unicode.html is a pretty good read and goes over why a lot of the encoding problems are kind of just handled automagically. Since python 3.6 even Windows has defaulted to UTF-8 (for file names I guess this isn't for everything unfortunately ) https://docs.python.org/3/howto/unicode.html#:~:text=Python%20on%20MacOS%20has%20used%20UTF%2D8[…]%20to%20using%20UTF%2D8%20on%20Windows%20as%20well making a lot of this stuff "just work" Started a seperate thread about tap-csv and encoding going to add some stuff there Reading more into this I think I see the main discrepancy, and yeah I've hit this a few times now that my brain is back into those moments https://docs.python.org/3/library/functions.html#open:~:text=encoding%20is%20the,of%20supported%20encodings. Default encoding for linux systems tends to be utf-8, default encoding for Windows is cp1252 To change that you can also set the environment variable PYTHONIOENCODING to utf-8. Which will default Windows to start with UTF-8 as well.
That's probably our issue right there similar to @steve_clarke https://github.com/hotgluexyz/target-csv/pull/16/files this is what I did
s
If you want to be able to read both utf-8 and utf-8 with BOM. You can read the file like this. file_stream = codecs.iterdecode(iterable, encoding='*utf-8-sig*') From the codecs docs:
Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written ... On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file.
c
@visch: Out of curiosity, do you normally use the "old" conhost.exe or the new Windows Terminal for interactive meltano on Windows?
For anybody interested in the history of the Window command line ... https://devblogs.microsoft.com/commandline/windows-command-line-backgrounder/
v
I try to stick to pwsh (new one) , I've had success with PowerShell as well though
Windows terminal is a wrapper like conemu, not really anything too new. Pwsh is the real "game changer"
c
Windows terminal is a wrapper like conemu, not really anything too new.
I disagree. Windows Terminal is one of the main consumers of the "new" conPTY in Windows 10. (Which is what my question was aimed at) I believe that in theory,
conhost
on Windows 10 should theoretically behave in a similar way to native
conpty
applications like Windows Terminal when it comes to Unicode. https://devblogs.microsoft.com/commandline/windows-command-line-introducing-the-windows-pseudo-console-conpty/#welcome-to-the-windows-pseudo-console-conpty
https://docs.python.org/3/library/functions.html#open:~:text=encoding%20is%20the,of%20supported%20encodings. Default encoding for linux systems tends to be utf-8, default encoding for Windows is cp1252
To change that you can also set the environment variable PYTHONIOENCODING to utf-8. Which will default Windows to start with UTF-8 as well.
PYTHONIOENCODING
would seem to be the the key enabler to fix Michael's problem on Windows.
v
Ahh yes you're right I thought conemu switched to conpty, I also thought all of the conpty stuff was more important earlier for meltano and now I'm not as convinced but I haven't dove deep enough, what I know is pwsh works well, as does power shell
c
So, Python switched to supporting the new conhost/conpty infrastructure in Python 3.6 by the looks of the documentation: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONIOENCODING https://docs.python.org/3/using/cmdline.html#envvar-PYTHONLEGACYWINDOWSSTDIO
Great deep dive into Meltano on Windows. Thanks Derek!
v
@christoph ahhh
a
@visch Sorry for my late feedback, was just busy with me (with whom not...). The call was "meltano run tap-csv target-postgres" and a Simple CSV that I posted above (https://meltano.slack.com/archives/C01TCRBBJD7/p1664197519571499?thread_ts=1664120890.041059&amp;cid=C01TCRBBJD7). Now that I'm working on Linux and running Meltano in a Linux Docker, everything is OK for me. When I tried Meltano on Windows, it was a bit frustrating with the UTF-8 history. For me everything is clear so far and your hints will surely help me the next time I try it on Windows, thank you! @christoph A "tap-postgres" fix to make this work on Python 3.10 would be great, but you are already involved and working on it. Thanks a lot!