Hi! I load multiple CSV files (same schema, differ...
# troubleshooting
d
Hi! I load multiple CSV files (same schema, different content) to one table using
tap-csv
→`target-postgres`. Each file has it’s own unique name which is related to unique source, but rows are not unique between the files. Is there a way to get the source file name from
tap-csv
?
I’ve tried
tap-spreadsheets-anywhere
, but it doesn’t have an option to set encoding and requires addition plugin to do inline mappings
p
It doesnt look like its supported in the meltanolabs tap-csv right now but it wouldnt be a hard thing to implement. It would mean: • add a new config option for something like
append_file_metadata
• add the file name as a header and add to each record if that config is set
d
Makes sense! I’ve tested a dummy implementation and it works fine
Ah, I see you are the maintainer of this tap. What would be the best way to implement this option for production use?
p
Yep! I've seen
_sdc_source_file
used in a couple places like s3-csv and sftp. I think defaulting to current behavior with a config option to include those _sdc columns would be a good implementation. Does that makes sense? Was that what you were asking?
d
It looks like
append_file_metadata
should be on the tap options level, not on the files one. Right?
I could try to implement it if you don’t mind
p
yeah I think at the tap top level makes sense to me. Also I'm open to names other than
append_file_metadata
, that was just the first thing that came to mind. I wonder if theres precedence elsewhere for a name.
I could try to implement it if you don’t mind
100% that would be awesome and much appreciated! 🚀
d
I’ve also spotted the issue https://github.com/MeltanoLabs/tap-csv/issues/13, the file’s modification date meta could be used as a feature for state as implemented in
tap-spreadsheets-anywhere
p
yeah I good point, I think I see it implemented here
Some people argued that file timestamps arent reliable, I cant remember the context of that though
d
Agree, for me it’s a good option to skip reloading the same file again every time the pipeline starts. So probably it should be an optional way to track the state
I’ve created a pull request https://github.com/MeltanoLabs/tap-csv/pull/126 If option
add_metadata_columns
is enabled, adds 3 new columns to the stream:
_sdc_source_file
_sdc_source_lineno
_sdc_source_file_mtime
p
@Denis I. I left a few minor comments and suggestions but overall it looks awesome. Thanks again for contributing this 🙏
d
@pat_nadolny I pushed the updates
p
@Denis I. its merged, release is published, and the hub is updated with the new setting. You can run
meltano lock tap-csv --update
to pull in that hub update
d
@pat_nadolny great, thank you!
s
I did a lot of enhancements to the tap-s3-csv tap here : https://github.com/s7clarke10/pipelinewise-tap-s3-csv . I use the _sdc_source_file as it is useful to get a unique name as the source system writes out a timestamp as part of the file name. It was super useful. We did quite a few other enhancements like: 1. supporting getting the s3 file via private link i.e. ignoring a proxy would would go over the public internet 2. removing certain characters from the file 3. override the target stream name to add a suffix to make tables unique if you have multiple providers. 4. treating the absence of a value as NULL for database targets 5. overriding any detected datatype and treating the values as strings regardless of whether they are dates or numbers. 6. Support for UTF-8-BOM csv's (one's created by Microsoft Excel) It would be kind of nice if tap-csv is the way forward to push some of this logic as enhancements to this tap. I'm not sure however if these tap supports s3, azure blob storage etc as a data source so that would be the first step?
d
Hi @steve_clarke! During my csv extractors research I found multiple solutions published on Meltano Hub: 1. https://hub.meltano.com/extractors/tap-csv a. Local files only b. Meltano Singer SDK c. Only CSV format, standard parsing settings, strings only output d. Support encodings 2. https://hub.meltano.com/extractors/tap-s3-csv/ a. S3 files only b. Singer.io c. Only CSV format, limited parsing settings, strings only output with manual date format override d. Doesn’t support encodings 3. https://hub.meltano.com/extractors/tap-gmail-csv/ a. Gmail attachments only b. Singer.io c. CSV/Excel formats, standard parsing settings, schema override, decompression 4. https://hub.meltano.com/extractors/tap-spreadsheets-anywhere/ a. Advanced file sources list (local, s3, http(s), ssh/scp/sftp, webhdfs, gcs, azure) with pattern-based files filtering b. Singer Tap c. Multiple formats, including CSV, extended parsing settings and additional bad format fixes, schema override, decompression d. Doesn’t support encodings (I’ve recently made a pull request with this feature https://github.com/ets/tap-spreadsheets-anywhere/pull/43) Each of the implementations has it’s own benefits. The last one is the most advanced in terms of supported sources and formats, the first one is the simplest. IMO the cleanest way to keep the sources/formats zoo manageable is to separate file sources layer from file formats layer and keep extending each of them independently. So it might be a good idea to take the transfer layer from
tap-spreadsheets-anywhere
and use it as a base SDK for taps like
tap-csv
,
tap-json
, etc. @pat_nadolny @Matt Menzenski @edgar_ramirez_mondragon @aaronsteers what do you think guys?
m
I’m not opposed to this idea
e
IMO the cleanest way to keep the sources/formats zoo manageable is to separate file sources layer from file formats layer and keep extending each of them independently.
I agree! The SDK feels like the right place for the transfer layer. File tap and stream helpers would only need to be implemented once.
d
Or as suggested there it could be solved by introducing something like a plugin wrapper for rclone
@pat_nadolny I pushed minor changes related to meta data columns here: https://github.com/MeltanoLabs/tap-csv/pull/132 Please, take a look