ammann_michael
09/25/2022, 3:48 PM'utf-8' codec can't decode byte 0xfc in position 61: invalid start byte
Command: meltano invoke tap-csv > singer.jsonl
Can someone help me and give me a type or even a solution!?
Thanks a lot!christoph
09/25/2022, 9:51 PMSven Balnojan
09/26/2022, 7:39 AMammann_michael
09/26/2022, 1:05 PMI work with a Windows computer, Meltano directly installed without docker, as important info...Example data: meltano.yaml
version: 1
default_environment: dev
environments:
- name: dev
- name: staging
- name: prod
project_id: maziolab-743a98d7-8954-4569-95a6-715c286b547d
plugins:
extractors:
- name: tap-csv
variant: meltanolabs
pip_url: git+<https://github.com/MeltanoLabs/tap-csv.git>
config:
files:
- entity: values
path: test-csv.csv
keys:
- id
encoding: utf-8
loaders:
- name: target-csv
variant: hotgluexyz
pip_url: git+<https://github.com/hotgluexyz/target-csv.git>
disable_collection: true
- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl
Command:
meltano run tap-csv target-jsonl
"test-csv.csv" content:
id, row1, row2
0, aaaa, aqwree
1, aäaa, aüwree
2, aöaa, aqwree
3, aàaa, aqwree
4, aaaa, aèwree
5, aéaa, aqwree
Result JSONL:
{"id": "0", " row1": " aaaa", " row2": " aqwree"}
{"id": "1", " row1": " a\u00e4aa", " row2": " a\u00fcwree"}
{"id": "2", " row1": " a\u00f6aa", " row2": " aqwree"}
{"id": "3", " row1": " a\u00e0aa", " row2": " aqwree"}
{"id": "4", " row1": " aaaa", " row2": " a\u00e8wree"}
{"id": "5", " row1": " a\u00e9aa", " row2": " aqwree"}
See "a\u00e4aa" at id:1 should be "aäaa"
.env file:
TARGET_CSV_DESTINATION_PATH='output'
TARGET_CSV_DELIMITER=';'
TARGET_JSONL_DESTINATION_PATH='output'
Did I forget something or is there a global setting for UTF-8 that I don't know?
Is this a Windows problem?
I will try it on Linux.
Thanks!Sven Balnojan
09/26/2022, 1:29 PMedgar_ramirez_mondragon
09/26/2022, 3:52 PMensure_ascii=False when dumping the records:
If _ensure_ascii_ is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If _ensure_ascii_ is false, these characters will be output as-is.
>>> import simplejson as json
>>> json.dumps({"x": "aäaa"})
'{"x": "a\\u00e4aa"}'
>>> json.dumps({"x": "aäaa"}, ensure_ascii=False)
'{"x": "aäaa"}'steve_clarke
09/26/2022, 7:14 PMchristoph
09/26/2022, 9:17 PM/tmp/encoding » file -bi test-csv.csv
text/csv; charset=utf-8
/tmp/encoding » cat test-csv.csv
id, row1, row2
0, aaaa, aqwree
1, aäaa, aüwree
2, aöaa, aqwree
3, aàaa, aqwree
4, aaaa, aèwree
5, aéaa, aqwree
/tmp/encoding » meltano invoke tap-csv
2022-09-26T21:16:14.154190Z [info ] Environment 'dev' is active
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=tap-csv v0.0.6, Meltano SDK v0.8.0
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Skipping parse of env var settings...
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Config validation passed with 0 warnings.
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Beginning full_table sync of 'values'...
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Tap has custom mapper. Using 1 provided map(s).
{"type": "SCHEMA", "stream": "values", "schema": {"properties": {"id": {"type": ["string", "null"]}, " row1": {"type": ["string", "null"]}, " row2": {"type": ["string", "null"]}}, "type": "object"}, "key_properties": ["id"]}
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=Properties () were present in the 'values' stream but not found in catalog schema. Ignoring.
{"type": "RECORD", "stream": "values", "record": {"id": "0", " row1": " aaaa", " row2": " aqwree"}, "time_extracted": "2022-09-26T21:16:17.521045Z"}
{"type": "STATE", "value": {"bookmarks": {"values": {"starting_replication_value": null}}}}
{"type": "RECORD", "stream": "values", "record": {"id": "1", " row1": " a\u00e4aa", " row2": " a\u00fcwree"}, "time_extracted": "2022-09-26T21:16:17.521222Z"}
{"type": "RECORD", "stream": "values", "record": {"id": "2", " row1": " a\u00f6aa", " row2": " aqwree"}, "time_extracted": "2022-09-26T21:16:17.521335Z"}
{"type": "RECORD", "stream": "values", "record": {"id": "3", " row1": " a\u00e0aa", " row2": " aqwree"}, "time_extracted": "2022-09-26T21:16:17.521445Z"}
{"type": "RECORD", "stream": "values", "record": {"id": "4", " row1": " aaaa", " row2": " a\u00e8wree"}, "time_extracted": "2022-09-26T21:16:17.521551Z"}
{"type": "RECORD", "stream": "values", "record": {"id": "5", " row1": " a\u00e9aa", " row2": " aqwree"}, "time_extracted": "2022-09-26T21:16:17.521661Z"}
time=2022-09-27 07:16:17 name=tap-csv level=INFO message=INFO METRIC: {"type": "counter", "metric": "record_count", "value": 6, "tags": {"stream": "values"}}
{"type": "STATE", "value": {"bookmarks": {"values": {}}}}
/tmp/encoding »christoph
09/26/2022, 9:28 PMtestdb=> \l testdb
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
--------+-------+----------+-------------+-------------+-------------------
testdb | crm | UTF8 | en_AU.UTF-8 | en_AU.UTF-8 |
(1 row)
testdb=> select * from tap_csv."values";
row1 | row2 | id
-------+---------+----
aaaa | aqwree | 0
aäaa | aüwree | 1
aöaa | aqwree | 2
aàaa | aqwree | 3
aaaa | aèwree | 4
aéaa | aqwree | 5
(6 rows)
testdb=>
Sounds like the solution is to use a target that encodes UTF-8 by default?ammann_michael
09/28/2022, 5:22 PMtarget-postgres I get an error:
│ C:\Users\user\.local\pipx\venvs\meltano\lib\site-packages\meltano\core\logging\utils.py:196 in │
│ _write_line_writer │
│ │
│ 193 │ │ │ │
│ 194 │ │ │ return False │
│ 195 │ else: │
│ > 196 │ │ writer.writeline(line.decode()) │
│ 197 │ │
│ 198 │ return True │
│ 199 │
│ │
│ ┌─────────────────────────────────────────── locals ───────────────────────────────────────────┐ │
│ │ line = b' File │ │
│ │ "path....… │ │
│ │ writer = <meltano.core.logging.output_logger.Out object at 0x0000028DB70521D0> │ │
│ └──────────────────────────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 61: invalid start byte
Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to
join our friendly Slack community.
'utf-8' codec can't decode byte 0xfc in position 61: invalid start byte
PS C:\Users\path....... > meltano run tap-csv target-postgres
I don't understand it right now...
The CSV file is UTF-8 encoded and the Postgres DB anyway.ammann_michael
09/28/2022, 5:25 PMchristoph
09/29/2022, 1:56 AM@christoph When I try withI suspect it may be a problem that is specific to Python and Meltano on Windows only? I wouldn't really know where to start though other than the fact that Windows obviously has a long history of supporting Codepages (e.g. conhost.exe would probably be a very dodgy way to run Unicode code, since that is the first subsystem that comes to my mind in relation to Codepage settings on Windows) .... But I have no idea how that would relate to Python and Meltano ...I get an error:target-postgres
ammann_michael
09/29/2022, 4:51 PMsteve_clarke
09/29/2022, 11:37 PMchristoph
09/30/2022, 12:13 AMconhost.exe used to be troublesome in regards to Unicode (since conhost.exe is a bit of a legacy beast) could relate to stdin and stdout in Python on Windows?!
At least in Windows 7 (and probably also in Windows 10), conhost.exe is non-Unicode and if Python uses conhost.exe APIs for stdin and stdout ... there would be a chance of encoding issues ... ?!
I'm really just spitballing at the moment, since I have no idea how Python on Windows handles Unicode ....christoph
09/30/2022, 12:39 AMammann_michael
10/06/2022, 4:10 PMchristoph
10/07/2022, 8:07 AMchristoph
10/07/2022, 8:09 AMvisch
10/07/2022, 1:00 PMUTF-8 CSV file with "tap-csv" failsanswer: add utf-8 encoding Question 2
target-jsonl doesn't have utf-8 characters as expectedanswer: target-jsonl output is "wrong" if you want it to output that way https://meltano.slack.com/archives/C01TCRBBJD7/p1664207558904659?thread_ts=1664120890.041059&cid=C01TCRBBJD7 Question 3
https://meltano.slack.com/archives/C01TCRBBJD7/p1664219685373569?thread_ts=1664120890.041059&cid=C01TCRBBJD7 tap-s3-csv doesn't handle utf-8 with a bomanswer: you have to tell the tap that there's a bom Question 4
https://meltano.slack.com/archives/C01TCRBBJD7/p1664385759184169?thread_ts=1664120890.041059&cid=C01TCRBBJD7 something in meltano fails when something is ran@ammann_michael is this q/a setup right. If so to answer question 4 can you show me what command you're running exactly, and what the input is? It looks like
cat file | meltano invoke target-postgres but it's impossible to know
Question 5
https://meltano.slack.com/archives/C01TCRBBJD7/p1665130044553279?thread_ts=1664120890.041059&cid=C01TCRBBJD7
Does meltano use stdin / stdout on Windows (Not the exact question but I"ll say it is :D)Meltano does use stdin and stdout from procs, but
asyncio tends to just make things work for you here which means you don't have much to worry about. Now it's very probable I haven't hit this exact situation yet so there's probably something off! Not clear to me what exactly the problem is though 😕visch
10/07/2022, 1:01 PMvisch
10/07/2022, 1:25 PMvisch
10/07/2022, 1:32 PMsteve_clarke
10/08/2022, 3:09 AMBefore any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written ... On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file.
christoph
10/09/2022, 9:26 PMchristoph
10/09/2022, 9:26 PMvisch
10/09/2022, 9:30 PMvisch
10/09/2022, 9:31 PMchristoph
10/09/2022, 9:36 PMWindows terminal is a wrapper like conemu, not really anything too new.I disagree. Windows Terminal is one of the main consumers of the "new" conPTY in Windows 10. (Which is what my question was aimed at) I believe that in theory,
conhost on Windows 10 should theoretically behave in a similar way to native conpty applications like Windows Terminal when it comes to Unicode.
https://devblogs.microsoft.com/commandline/windows-command-line-introducing-the-windows-pseudo-console-conpty/#welcome-to-the-windows-pseudo-console-conptychristoph
10/09/2022, 9:43 PMhttps://docs.python.org/3/library/functions.html#open:~:text=encoding%20is%20the,of%20supported%20encodings. Default encoding for linux systems tends to be utf-8, default encoding for Windows is cp1252
To change that you can also set the environment variable PYTHONIOENCODING to utf-8. Which will default Windows to start with UTF-8 as well.
PYTHONIOENCODING would seem to be the the key enabler to fix Michael's problem on Windows.visch
10/09/2022, 9:43 PMchristoph
10/09/2022, 9:46 PMchristoph
10/09/2022, 9:48 PMvisch
10/09/2022, 10:01 PMammann_michael
10/13/2022, 1:09 PM