Hi all! I'm very new to Meltano and trying to impl...
# getting-started
n
Hi all! I'm very new to Meltano and trying to implement what seems like a very simple solution but I am facing various roadblocks. Effectively, I have an FTP server that I want to connect to, download a specific file, unzip it and load it into my SQL db. I tried using
tap-spreadsheets-anywhere
but I'm facing an error with it that I suspect is related to the file compression
Copy code
WARNING unable to transparently decompress <_io.BufferedReader name=7> because it seems to lack a string-like .name
ERROR Unable to write Catalog entry for 'myfeed' - it will be skipped due to error underlying stream is not seekable
Ideally I would expect that we could achieve this with an extractor to just download the file, a mapper to extract the file and then a loader to actually load the data into my db
a
Can you share your
meltano.yml
config for the extractor?
the tap-spreadsheets-anywhere extractor should do both the download and extract the file, you won't need a mapper for this.
e
The error message suggests that setting
max_sampling_read: 0
and overriding schema would fix it. The problem seems to be that the file can't be inspected for inferring a schema because the stream isn't seekable. I would hope the tap or FS library would then fall back to reopening the stream, but maybe it's not possible.
n
Here's my config for spreadsheets-anywhere, including the max_sampling_read change. I am still unclear on overriding the schema. I am still facing an error currently
Copy code
- name: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+<https://github.com/ets/tap-spreadsheets-anywhere.git>
    config:
      tables:
      - path: <ftp://myserver.com/>
        max_sampling_read: 0
        name: myfeed
        pattern: myfile.txt.gz
        start_date: '2024-10-14T00:00:00Z'
        key_properties: []
        format: csv
        delimiter: "|"
        field_names:
        - id
        - title
        - sku
        - category
Here's an extended snippet of the error received.
Copy code
2024-10-16 11:59:00 INFO Checking 354 resolved objects for any that match regular expression "myfile.txt.*" and were modified since 2024-10-14 
2024-10-16 11:59:00 00:00:00+00:00
2024-10-16 11:59:00 INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details.
2024-10-16 11:59:00 INFO Sampling myfile.txt.gz (0 records, every 5th record).
2024-10-16 11:59:00 WARNING unable to transparently decompress <_io.BufferedReader name=4> because it seems to lack a string-like .name
2024-10-16 11:59:00 ERROR Unable to write Catalog entry for 'myfeed' - it will be skipped due to error line contains NUL
2024-10-16 11:59:00 CRITICAL line contains NUL
2024-10-16 11:59:00 Traceback (most recent call last):
2024-10-16 11:59:00   File "/projects/.meltano/extractors/tap-spreadsheets-anywhere/venv/bin/tap-spreadsheets-anywhere", line 8, in <module>
2024-10-16 11:59:00     sys.exit(main())
2024-10-16 11:59:00   File "/projects/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/singer/utils.py", line 235, in wrapped
2024-10-16 11:59:00     return fnc(*args, **kwargs)
2024-10-16 11:59:00   File "/projects/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/tap_spreadsheets_anywhere/__init__.py",
2024-10-16 11:59:00 line 151, in main
2024-10-16 11:59:00     catalog = discover(tables_config)
2024-10-16 11:59:00   File "/projects/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/tap_spreadsheets_anywhere/__init__.py",
2024-10-16 11:59:00 line 92, in discover
2024-10-16 11:59:00     raise err
2024-10-16 11:59:00   File "/projects/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/tap_spreadsheets_anywhere/__init__.py",
2024-10-16 11:59:00 line 68, in discover
2024-10-16 11:59:00     samples = file_utils.sample_files(table_spec, target_files,sample_rate=sample_rate,
2024-10-16 11:59:00   File 
2024-10-16 11:59:00 "/projects/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/tap_spreadsheets_anywhere/file_utils.py", line
2024-10-16 11:59:00 111, in sample_files
2024-10-16 11:59:00     to_return += sample_file(table_spec, target_file['key'], sample_rate, max_records)
2024-10-16 11:59:00   File 
2024-10-16 11:59:00 "/projects/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/tap_spreadsheets_anywhere/file_utils.py", line
2024-10-16 11:59:00 87, in sample_file
2024-10-16 11:59:00     for row in iterator:
2024-10-16 11:59:00   File 
2024-10-16 11:59:00 "/projects/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.9/site-packages/tap_spreadsheets_anywhere/csv_handler.py", 
2024-10-16 11:59:00 line 8, in generator_wrapper
2024-10-16 11:59:00     for row in reader:
2024-10-16 11:59:00   File "/usr/local/lib/python3.9/csv.py", line 111, in __next__
2024-10-16 11:59:00     row = next(self.reader)
2024-10-16 11:59:00 _csv.Error: line contains NUL
e
Oh, the csv may not be valid?
n
I believe it should be valid? I can download and extract the file manually with no issues. This is the reason I was asking for multiple stages that I can debug independently.
From what I can tell, the
smart_open
library seems to be failing to read the file name so it never decompresses the file
I've confirmed spreadsheets-anywhere works with the local file. Seems it is a bug in how either it or
smart_open
handles FTP files