Hi all Trying to read a UTF 8 CSV file with tap csv Unfortun Meltano #troubleshooting

Hi all! Trying to read a UTF-8 CSV file with "tap...

ammann_michael

09/25/2022, 3:48 PM

Hi all! Trying to read a UTF-8 CSV file with "tap-csv". Unfortunately I always get the following error message:

'utf-8' codec can't decode byte 0xfc in position 61: invalid start byte

Command:

meltano invoke tap-csv > singer.jsonl

Can someone help me and give me a type or even a solution!? Thanks a lot!

christoph

09/25/2022, 9:51 PM

Maybe your file has a BOM?

Sven Balnojan

09/26/2022, 7:39 AM

did you check the encoding? The tap-csv has an encoding setting, maybe your's isn't the default one: https://github.com/MeltanoLabs/tap-csv.

ammann_michael

09/26/2022, 1:05 PM

Thank's for your quick response! @Sven Balnojan I added the encoding with utf-8 and then it works, but the result is not correct.

I work with a Windows computer, Meltano directly installed without docker, as important info...

Example data: meltano.yaml

Copy code

version: 1
default_environment: dev
environments:
  - name: dev
  - name: staging
  - name: prod
project_id: maziolab-743a98d7-8954-4569-95a6-715c286b547d
plugins:
  extractors:
    - name: tap-csv
      variant: meltanolabs
      pip_url: git+<https://github.com/MeltanoLabs/tap-csv.git>
      config:
        files:
          - entity: values
            path: test-csv.csv
            keys:
              - id
            encoding: utf-8
  loaders:
    - name: target-csv
      variant: hotgluexyz
      pip_url: git+<https://github.com/hotgluexyz/target-csv.git>
      disable_collection: true
    - name: target-jsonl
      variant: andyh1203
      pip_url: target-jsonl

Command:

meltano run tap-csv target-jsonl

"test-csv.csv" content:

Copy code

id, row1, row2
0, aaaa, aqwree
1, aäaa, aüwree
2, aöaa, aqwree
3, aàaa, aqwree
4, aaaa, aèwree
5, aéaa, aqwree

Result JSONL:

Copy code

{"id": "0", " row1": " aaaa", " row2": " aqwree"}
{"id": "1", " row1": " a\u00e4aa", " row2": " a\u00fcwree"}
{"id": "2", " row1": " a\u00f6aa", " row2": " aqwree"}
{"id": "3", " row1": " a\u00e0aa", " row2": " aqwree"}
{"id": "4", " row1": " aaaa", " row2": " a\u00e8wree"}
{"id": "5", " row1": " a\u00e9aa", " row2": " aqwree"}

See "a\u00e4aa" at id:1 should be "aäaa" .env file:

Copy code

TARGET_CSV_DESTINATION_PATH='output'
TARGET_CSV_DELIMITER=';'

TARGET_JSONL_DESTINATION_PATH='output'

Did I forget something or is there a global setting for UTF-8 that I don't know? Is this a Windows problem? I will try it on Linux. Thanks!

Sven Balnojan

09/26/2022, 1:29 PM

Well, the unicode is fine, that's a unicode for "ä", I guess you need to check your default encoding so that it also displays as "ä". And now I'm out of my depth 😄 (Fwiw, caan't let this one go: My responses are slow compared to your data pipelines - once you get it to run with Meltano!)

edgar_ramirez_mondragon

09/26/2022, 3:52 PM

I think the issue is target-jsonl doesn’t use

ensure_ascii=False

when dumping the records:

If _ensure_ascii_ is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If _ensure_ascii_ is false, these characters will be output as-is.

Copy code

>>> import simplejson as json
>>> json.dumps({"x": "aäaa"})
'{"x": "a\\u00e4aa"}'
>>> json.dumps({"x": "aäaa"}, ensure_ascii=False)
'{"x": "aäaa"}'

steve_clarke

09/26/2022, 7:14 PM

Hi, I have had a similar problem with tap-s3-csv. The csv file was saved from Excel. When a file is save from Excel it is encoded with UTF-8 BOM. The BOM part means there is a leading (Byte Order Mark). A simple fix is to re-save the file without UTF-8 BOM plain UTF-8. I'm not sure if this is the exact issue but it does sound similar. There is an overall fix for this in the singer-encoding package however the one used by pipeline in tap-s3-csv does not have the fix to be able to read both UTF-8 and UTF-8 BOM. I can't comment on tap-csv but wonder if it requires the same fix as this https://github.com/singer-io/singer-encodings/commit/e4164372b9788ac1590415d89cfaafb348209a39

christoph

09/26/2022, 9:17 PM

Reproducing the tests on Linux. I suppose the output I get is the correct expected output ...

Copy code

/tmp/encoding » file -bi test-csv.csv
text/csv; charset=utf-8
/tmp/encoding » cat test-csv.csv
id, row1, row2
0, aaaa, aqwree
1, aäaa, aüwree
2, aöaa, aqwree
3, aàaa, aqwree
4, aaaa, aèwree
5, aéaa, aqwree
/tmp/encoding » meltano invoke tap-csv
2022-09-26T21:16:14.154190Z [info     ] Environment 'dev' is active
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=tap-csv v0.0.6, Meltano SDK v0.8.0
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Skipping parse of env var settings...
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Config validation passed with 0 warnings.
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Beginning full_table sync of 'values'...
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Tap has custom mapper. Using 1 provided map(s).
{"type": "SCHEMA", "stream": "values", "schema": {"properties": {"id": {"type": ["string", "null"]}, " row1": {"type": ["string", "null"]}, " row2": {"type": ["string", "null"]}}, "type": "object"}, "key_properties": ["id"]}
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Properties () were present in the 'values' stream but not found in catalog schema. Ignoring.
{"type": "RECORD", "stream": "values", "record": {"id": "0", " row1": " aaaa", " row2": " aqwree"}, "time_extracted": "2022-09-26T21:16:17.521045Z"}
{"type": "STATE", "value": {"bookmarks": {"values": {"starting_replication_value": null}}}}
{"type": "RECORD", "stream": "values", "record": {"id": "1", " row1": " a\u00e4aa", " row2": " a\u00fcwree"}, "time_extracted": "2022-09-26T21:16:17.521222Z"}
{"type": "RECORD", "stream": "values", "record": {"id": "2", " row1": " a\u00f6aa", " row2": " aqwree"}, "time_extracted": "2022-09-26T21:16:17.521335Z"}
{"type": "RECORD", "stream": "values", "record": {"id": "3", " row1": " a\u00e0aa", " row2": " aqwree"}, "time_extracted": "2022-09-26T21:16:17.521445Z"}
{"type": "RECORD", "stream": "values", "record": {"id": "4", " row1": " aaaa", " row2": " a\u00e8wree"}, "time_extracted": "2022-09-26T21:16:17.521551Z"}
{"type": "RECORD", "stream": "values", "record": {"id": "5", " row1": " a\u00e9aa", " row2": " aqwree"}, "time_extracted": "2022-09-26T21:16:17.521661Z"}
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=INFO METRIC: {"type": "counter", "metric": "record_count", "value": 6, "tags": {"stream": "values"}}
{"type": "STATE", "value": {"bookmarks": {"values": {}}}}
/tmp/encoding »

christoph

09/26/2022, 9:28 PM

Testing with `target-postgres`:

Copy code

testdb=> \l testdb
                             List of databases
  Name  | Owner | Encoding |   Collate   |    Ctype    | Access privileges
--------+-------+----------+-------------+-------------+-------------------
 testdb | crm   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 |
(1 row)

testdb=> select * from tap_csv."values";
  row1 |   row2  | id
-------+---------+----
  aaaa |  aqwree | 0
  aäaa |  aüwree | 1
  aöaa |  aqwree | 2
  aàaa |  aqwree | 3
  aaaa |  aèwree | 4
  aéaa |  aqwree | 5
(6 rows)

testdb=>

Sounds like the solution is to use a target that encodes UTF-8 by default?

ammann_michael

09/28/2022, 5:22 PM

Thank you for all the feedback! The hint with the BOM is certainly helpful and to consider with files, thanks @steve_clarke! @christoph When I try with

target-postgres

I get an error:

Copy code

│ C:\Users\user\.local\pipx\venvs\meltano\lib\site-packages\meltano\core\logging\utils.py:196 in    │
│ _write_line_writer                                                                               │
│                                                                                                  │
│   193 │   │   │                                                                                  │
│   194 │   │   │   return False                                                                   │
│   195 │   else:                                                                                  │
│ > 196 │   │   writer.writeline(line.decode())                                                    │
│   197 │                                                                                          │
│   198 │   return True                                                                            │
│   199                                                                                            │
│                                                                                                  │
│ ┌─────────────────────────────────────────── locals ───────────────────────────────────────────┐ │
│ │   line = b'  File                                                                            │ │
│ │          "path....… │ │
│ │ writer = <meltano.core.logging.output_logger.Out object at 0x0000028DB70521D0>               │ │
│ └──────────────────────────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 61: invalid start byte

Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.

'utf-8' codec can't decode byte 0xfc in position 61: invalid start byte
PS C:\Users\path....... > meltano run tap-csv target-postgres

I don't understand it right now... The CSV file is UTF-8 encoded and the Postgres DB anyway.

ammann_michael

09/28/2022, 5:25 PM

message has been deleted

christoph

09/29/2022, 1:56 AM

@christoph When I try with
target-postgres
I get an error:

I suspect it may be a problem that is specific to Python and Meltano on Windows only? I wouldn't really know where to start though other than the fact that Windows obviously has a long history of supporting Codepages (e.g. conhost.exe would probably be a very dodgy way to run Unicode code, since that is the first subsystem that comes to my mind in relation to Codepage settings on Windows) .... But I have no idea how that would relate to Python and Meltano ...

ammann_michael

09/29/2022, 4:51 PM

@christoph I'll try it once on the WSL and an Ubuntu machine, see where it works correctly and I find something out. Thank you!

steve_clarke

09/29/2022, 11:37 PM

I wondered about the codepage as well @christoph. I ran into a similar issue extracting data using some database taps against old databases (Sybase). We had the tap fail with invalid characters. Do we know what the original encoding is? We changed the tap to allow a setting for the characterset for the original coding in the database. By setting the characterset to cp1252 the characters were correctly returned and loaded into target-snowflake. Out of a matter of interest the tap-sybase where we implemented the support for the characterset is here https://github.com/s7clarke10/tap-sybase .

christoph

09/30/2022, 12:13 AM

Yeah. Codepages on Windows can catch people out. What I was actually thinking is how the fact that

conhost.exe

used to be troublesome in regards to Unicode (since

conhost.exe

is a bit of a legacy beast) could relate to

stdin

and

stdout

in Python on Windows?! At least in Windows 7 (and probably also in Windows 10),

conhost.exe

is non-Unicode and if Python uses

conhost.exe

APIs for

stdin

and

stdout

... there would be a chance of encoding issues ... ?! I'm really just spitballing at the moment, since I have no idea how Python on Windows handles Unicode ....

christoph

09/30/2022, 12:39 AM

Python definitely has platform specific Unicode handling on Windows ... so, chances are that Meltano will be impacted: https://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console

ammann_michael

10/06/2022, 4:10 PM

Tested the whole thing on Linux (Ubuntu 20, Python 3.9) and everything works fine... Windows and the little things that bother....

christoph

10/07/2022, 8:07 AM

That's what I thought. Since meltano uses stdin and stdout, it will encounter the platform specific issues of Python Unicode on Windows.

christoph

10/07/2022, 8:09 AM

I wonder if @visch has come across this type of situation before, considering he is probably one of the main users of Meltano on Windows

visch

10/07/2022, 1:00 PM

There's a bunch of different directions this thread has gone so I'll try to break down the questions I see. Question 1

UTF-8 CSV file with "tap-csv" fails

answer: add utf-8 encoding Question 2

target-jsonl doesn't have utf-8 characters as expected

answer: target-jsonl output is "wrong" if you want it to output that way https://meltano.slack.com/archives/C01TCRBBJD7/p1664207558904659?thread_ts=1664120890.041059&cid=C01TCRBBJD7 Question 3

https://meltano.slack.com/archives/C01TCRBBJD7/p1664219685373569?thread_ts=1664120890.041059&cid=C01TCRBBJD7 tap-s3-csv doesn't handle utf-8 with a bom

answer: you have to tell the tap that there's a bom Question 4

https://meltano.slack.com/archives/C01TCRBBJD7/p1664385759184169?thread_ts=1664120890.041059&cid=C01TCRBBJD7 something in meltano fails when something is ran

@ammann_michael is this q/a setup right. If so to answer question 4 can you show me what command you're running exactly, and what the input is? It looks like

cat file | meltano invoke target-postgres

but it's impossible to know Question 5

https://meltano.slack.com/archives/C01TCRBBJD7/p1665130044553279?thread_ts=1664120890.041059&cid=C01TCRBBJD7

Does meltano use stdin / stdout on Windows (Not the exact question but I"ll say it is :D)

Meltano does use stdin and stdout from procs, but

asyncio

tends to just make things work for you here which means you don't have much to worry about. Now it's very probable I haven't hit this exact situation yet so there's probably something off! Not clear to me what exactly the problem is though 😕

visch

10/07/2022, 1:01 PM

Not sure if that helps, but I needed it for me @ammann_michael if you could let me know / give me a test case I'll happily test it / look into it

visch

10/07/2022, 1:25 PM

Since we are all kind of spitballing there's some things I've learned about python along the way. Generally good on Windows and Linux is just that things tend to work, https://docs.python.org/3/howto/unicode.html is a pretty good read and goes over why a lot of the encoding problems are kind of just handled automagically. Since python 3.6 even Windows has defaulted to UTF-8 (for file names I guess this isn't for everything unfortunately ) https://docs.python.org/3/howto/unicode.html#:~:text=Python%20on%20MacOS%20has%20used%20UTF%2D8[…]%20to%20using%20UTF%2D8%20on%20Windows%20as%20well making a lot of this stuff "just work" Started a seperate thread about tap-csv and encoding going to add some stuff there Reading more into this I think I see the main discrepancy, and yeah I've hit this a few times now that my brain is back into those moments https://docs.python.org/3/library/functions.html#open:~:text=encoding%20is%20the,of%20supported%20encodings. Default encoding for linux systems tends to be utf-8, default encoding for Windows is cp1252 To change that you can also set the environment variable PYTHONIOENCODING to utf-8. Which will default Windows to start with UTF-8 as well.

visch

10/07/2022, 1:32 PM

That's probably our issue right there similar to @steve_clarke https://github.com/hotgluexyz/target-csv/pull/16/files this is what I did

steve_clarke

10/08/2022, 3:09 AM

If you want to be able to read both utf-8 and utf-8 with BOM. You can read the file like this. file_stream = codecs.iterdecode(iterable, encoding='*utf-8-sig*') From the codecs docs:

Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written ... On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file.

christoph

10/09/2022, 9:26 PM

@visch: Out of curiosity, do you normally use the "old" conhost.exe or the new Windows Terminal for interactive meltano on Windows?

christoph

10/09/2022, 9:26 PM

For anybody interested in the history of the Window command line ... https://devblogs.microsoft.com/commandline/windows-command-line-backgrounder/

visch

10/09/2022, 9:30 PM

I try to stick to pwsh (new one) , I've had success with PowerShell as well though

visch

10/09/2022, 9:31 PM

Windows terminal is a wrapper like conemu, not really anything too new. Pwsh is the real "game changer"

christoph

10/09/2022, 9:36 PM

Windows terminal is a wrapper like conemu, not really anything too new.

I disagree. Windows Terminal is one of the main consumers of the "new" conPTY in Windows 10. (Which is what my question was aimed at) I believe that in theory,

conhost

on Windows 10 should theoretically behave in a similar way to native

conpty

applications like Windows Terminal when it comes to Unicode. https://devblogs.microsoft.com/commandline/windows-command-line-introducing-the-windows-pseudo-console-conpty/#welcome-to-the-windows-pseudo-console-conpty

christoph

10/09/2022, 9:43 PM

https://docs.python.org/3/library/functions.html#open:~:text=encoding%20is%20the,of%20supported%20encodings. Default encoding for linux systems tends to be utf-8, default encoding for Windows is cp1252

To change that you can also set the environment variable PYTHONIOENCODING to utf-8. Which will default Windows to start with UTF-8 as well.

PYTHONIOENCODING

would seem to be the the key enabler to fix Michael's problem on Windows.

visch

10/09/2022, 9:43 PM

Ahh yes you're right I thought conemu switched to conpty, I also thought all of the conpty stuff was more important earlier for meltano and now I'm not as convinced but I haven't dove deep enough, what I know is pwsh works well, as does power shell

christoph

10/09/2022, 9:46 PM

So, Python switched to supporting the new conhost/conpty infrastructure in Python 3.6 by the looks of the documentation: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONIOENCODING https://docs.python.org/3/using/cmdline.html#envvar-PYTHONLEGACYWINDOWSSTDIO

christoph

10/09/2022, 9:48 PM

Great deep dive into Meltano on Windows. Thanks Derek!

visch

10/09/2022, 10:01 PM

@christoph ahhh

ammann_michael

10/13/2022, 1:09 PM

@visch Sorry for my late feedback, was just busy with me (with whom not...). The call was "meltano run tap-csv target-postgres" and a Simple CSV that I posted above (https://meltano.slack.com/archives/C01TCRBBJD7/p1664197519571499?thread_ts=1664120890.041059&cid=C01TCRBBJD7). Now that I'm working on Linux and running Meltano in a Linux Docker, everything is OK for me. When I tried Meltano on Windows, it was a bit frustrating with the UTF-8 history. For me everything is clear so far and your hints will surely help me the next time I try it on Windows, thank you! @christoph A "tap-postgres" fix to make this work on Python 3.10 would be great, but you are already involved and working on it. Thanks a lot!

2 Views

Open in Slack

Previous Next