jeremy_joly
08/28/2023, 12:39 PMtap-spreadsheets-anywhere
. And it's giving me a hard time.
I created s3 folder and subfolders where I drop files based of their format: acme-ingestion-test/in/(csv, json, jsonl, xls)
.
The csv
extractor works perfectly:
- name: jeremy_csv
inherit_from: tap-spreadsheets-anywhere
variant: ets
pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
label: client_name
config:
tables:
- name: csv-stream
path: <s3://acme-ingestion-test>
pattern: "in/csv/"
key_properties: []
start_date: '2017-05-01T00:00:00Z'
format: csv
delimiter: ","
Same with `jsonl`:
- name: jeremy_jsonl
inherit_from: tap-spreadsheets-anywhere
variant: ets
pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
label: client_name
config:
tables:
- name: jsonl-stream
path: <s3://acme-ingestion-test>
pattern: "in/jsonl/"
key_properties: []
start_date: '2017-05-01T00:00:00Z'
format: jsonl
but following the same logic for the json
,
- name: jeremy_json
inherit_from: tap-spreadsheets-anywhere
variant: ets
pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
label: client_name
config:
tables:
- name: json-stream
path: <s3://acme-ingestion-test>
pattern: "in/json/"
key_properties: []
start_date: '2017-05-01T00:00:00Z'
format: json
gives me an error - and I'm following this format:
INFO Found credentials in shared credentials file: ~/.aws/credentials
INFO Found 68 files.
INFO Checking 68 resolved objects for any that match regular expression "in/json/" and were modified since 2017-05-01 00:00:00+00:00
INFO Processing 2 resolved objects that met our criteria. Enable debug verbosity logging for more details.
INFO Sampling in/json/ (1000 records, every 5th record).
ERROR Unable to write Catalog entry for 'json-stream' - it will be skipped due to error <s3://acme-ingestion-test/in/json/> could not be parsed: Expecting value: line 1 column 1 (char 0)
...
INFO Processing 0 selected streams from Catalog
and the xls
doesn't work either:
- name: jeremy_xls
inherit_from: tap-spreadsheets-anywhere
variant: ets
pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
label: client_name
config:
tables:
- name: xls-stream
path: <s3://acme-ingestion-test>
pattern: "in/xls/"
key_properties: []
start_date: '2017-05-01T00:00:00Z'
format: excel
worksheet_name: "Apartments"
it gives me the following error:
INFO Found credentials in shared credentials file: ~/.aws/credentials
INFO Found 68 files.
INFO Checking 68 resolved objects for any that match regular expression "in/xls/" and were modified since 2017-05-01 00:00:00+00:00
INFO Processing 2 resolved objects that met our criteria. Enable debug verbosity logging for more details.
INFO Sampling in/xls/ (1000 records, every 5th record).
ERROR Unable to write Catalog entry for 'xls-stream' - it will be skipped due to error File is not a zip file
...
INFO Processing 0 selected streams from Catalog
Any ideas to troubleshoot this further?Andy Carter
08/28/2023, 7:09 PMAndy Carter
08/28/2023, 7:09 PMmeltano invoke tap-spreadsheets-anywhere --dev
to give a bit more info on what files it it trying to read/sample. If you find some unexpected files then either move them from the folder or adjust your regex appropriately to exclude themAndy Carter
08/28/2023, 7:10 PMAndy Carter
08/28/2023, 7:11 PMjeremy_joly
08/29/2023, 10:46 AMjson
now. However, I'm still getting an error when trying to read xls and xlsx.
I've stumble upon this error (54) that seems to be dealing with the issue.
- name: jeremy_local_xls
inherit_from: tap-spreadsheets-anywhere
config:
tables:
- name: "xls_stream"
path: "file:///Users/jeremyjoly/Downloads/xls"
format: "detect"
worksheet_name: "Sheet1"
pattern: ".*.(xlsx|xls)"
key_properties: []
start_date: '2020-01-01T00:00:00Z'
ā test-tap-jeremy meltano --log-level error invoke jeremy_local_xls --dev
INFO Generating catalog through sampling.
INFO Walking /Users/jeremyjoly/Downloads/xls.
INFO Found 2 files.
INFO Checking 2 resolved objects for any that match regular expression ".*.(xlsx|xls)" and were modified since 2020-01-01 00:00:00+00:00
INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details.
INFO Sampling file_example_XLSX_10.xlsx (1000 records, every 5th record).
ERROR Unable to write Catalog entry for 'xls_stream' - it will be skipped due to error File is not a zip file
INFO Processing 0 selected streams from Catalog
What do you suggest I try next?Andy Carter
08/29/2023, 10:58 AMjeremy_joly
08/29/2023, 1:42 PMAndy Carter
08/29/2023, 2:03 PMjeremy_joly
08/29/2023, 2:06 PMAndy Carter
08/29/2023, 2:07 PMHenning Holgersen
09/19/2023, 6:55 AMUnable to write Catalog entry for 'testfile' - it will be skipped due to error File is not a zip file
) too, so Iām jumping in: From what I can tell, it comes from passing the file stream into openpyexcel. I have made an ugly workaround using temp files for this here: https://github.com/radbrt/tap-spreadsheets-anywhere/blob/xlback/tap_spreadsheets_anywhere/excel_handler.py#L66. If this seems like a sensible direction, I suggest we clean up the fix and open a PR to original repo.Henning Holgersen
09/21/2023, 6:49 AM