Hi team, Working with the `tap-spreadsheets-anywhe...
# troubleshooting
j
Hi team, Working with the
tap-spreadsheets-anywhere
. And it's giving me a hard time. I created s3 folder and subfolders where I drop files based of their format:
acme-ingestion-test/in/(csv, json, jsonl, xls)
. The
csv
extractor works perfectly:
Copy code
- name: jeremy_csv
    inherit_from: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
    label: client_name
    config:
      tables:
      - name: csv-stream
        path: <s3://acme-ingestion-test>
        pattern: "in/csv/"
        key_properties: []
        start_date: '2017-05-01T00:00:00Z'
        format: csv
        delimiter: ","
Same with `jsonl`:
Copy code
- name: jeremy_jsonl
    inherit_from: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
    label: client_name
    config:
      tables:
      - name: jsonl-stream
        path: <s3://acme-ingestion-test>
        pattern: "in/jsonl/"
        key_properties: []
        start_date: '2017-05-01T00:00:00Z'
        format: jsonl
but following the same logic for the
json
,
Copy code
- name: jeremy_json
    inherit_from: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
    label: client_name
    config:
      tables:
      - name: json-stream
        path: <s3://acme-ingestion-test>
        pattern: "in/json/"
        key_properties: []
        start_date: '2017-05-01T00:00:00Z'
        format: json
gives me an error - and I'm following this format:
Copy code
INFO Found credentials in shared credentials file: ~/.aws/credentials
INFO Found 68 files.
INFO Checking 68 resolved objects for any that match regular expression "in/json/" and were modified since 2017-05-01 00:00:00+00:00
INFO Processing 2 resolved objects that met our criteria. Enable debug verbosity logging for more details.
INFO Sampling in/json/ (1000 records, every 5th record).
ERROR Unable to write Catalog entry for 'json-stream' - it will be skipped due to error <s3://acme-ingestion-test/in/json/> could not be parsed: Expecting value: line 1 column 1 (char 0)
...
INFO Processing 0 selected streams from Catalog
and the
xls
doesn't work either:
Copy code
- name: jeremy_xls
    inherit_from: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
    label: client_name
    config:
      tables:
      - name: xls-stream
        path: <s3://acme-ingestion-test>
        pattern: "in/xls/"
        key_properties: []
        start_date: '2017-05-01T00:00:00Z'
        format: excel
        worksheet_name: "Apartments"
it gives me the following error:
Copy code
INFO Found credentials in shared credentials file: ~/.aws/credentials
INFO Found 68 files.
INFO Checking 68 resolved objects for any that match regular expression "in/xls/" and were modified since 2017-05-01 00:00:00+00:00
INFO Processing 2 resolved objects that met our criteria. Enable debug verbosity logging for more details.
INFO Sampling in/xls/ (1000 records, every 5th record).
ERROR Unable to write Catalog entry for 'xls-stream' - it will be skipped due to error File is not a zip file
...
INFO Processing 0 selected streams from Catalog
Any ideas to troubleshoot this further?
a
This tap will try to sample all the files that match your regex to determine column types before beginning the sync. From both your error messages it sounds like there are other files matching the regex that it is trying to read as json or xls but failing.
if you can, invoke your tap with
meltano invoke tap-spreadsheets-anywhere --dev
to give a bit more info on what files it it trying to read/sample. If you find some unexpected files then either move them from the folder or adjust your regex appropriately to exclude them
This is a very clever tap, having forked it myself to remove a lot of the smart functionality because it didn't lend itself to my exact use.
j
Hey Andy, Thanks a lot for your answer. I've improved my regex and it works for
json
now. However, I'm still getting an error when trying to read xls and xlsx. I've stumble upon this error (54) that seems to be dealing with the issue.
Copy code
- name: jeremy_local_xls
    inherit_from: tap-spreadsheets-anywhere
    config:
      tables:
      - name: "xls_stream"
        path: "file:///Users/jeremyjoly/Downloads/xls"
        format: "detect"
        worksheet_name: "Sheet1"
        pattern: ".*.(xlsx|xls)"
        key_properties: []
        start_date: '2020-01-01T00:00:00Z'
Copy code
āžœ  test-tap-jeremy meltano --log-level error invoke jeremy_local_xls --dev
INFO Generating catalog through sampling.
INFO Walking /Users/jeremyjoly/Downloads/xls.
INFO Found 2 files.
INFO Checking 2 resolved objects for any that match regular expression ".*.(xlsx|xls)" and were modified since 2020-01-01 00:00:00+00:00
INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details.
INFO Sampling file_example_XLSX_10.xlsx (1000 records, every 5th record).
ERROR Unable to write Catalog entry for 'xls_stream' - it will be skipped due to error File is not a zip file
INFO Processing 0 selected streams from Catalog
What do you suggest I try next?
a
Have you tried saving your file as xls instead, maybe that might work temporarily. Otherwise it looks like you are waiting for that merge to be approved. šŸ˜ž
j
I've tried with both xls and xlsx files, same error. I'll be waiting then šŸ˜€ Thanks for you help, appreciated.
a
j
Interesting, I've tried to use xls files of my own creation, others from templates downloaded from the web, no luck. I'll investigate some more. Thanks again.
a
If the xlsx is not sensitive you could share it here
h
Hi. I’m seeing this error (
Unable to write Catalog entry for 'testfile' - it will be skipped due to error File is not a zip file
) too, so I’m jumping in: From what I can tell, it comes from passing the file stream into openpyexcel. I have made an ugly workaround using temp files for this here: https://github.com/radbrt/tap-spreadsheets-anywhere/blob/xlback/tap_spreadsheets_anywhere/excel_handler.py#L66. If this seems like a sensible direction, I suggest we clean up the fix and open a PR to original repo.
Of course, this was already an issue, with a proposed fix way simpler than mine, in the repo: https://github.com/ets/tap-spreadsheets-anywhere/issues/54