Hi I load multiple CSV files same schema different content t Meltano #troubleshooting

Hi! I load multiple CSV files (same schema, differ...

Denis I.

03/21/2023, 3:00 PM

Hi! I load multiple CSV files (same schema, different content) to one table using

tap-csv

→`target-postgres`. Each file has it’s own unique name which is related to unique source, but rows are not unique between the files. Is there a way to get the source file name from

tap-csv

Denis I.

03/21/2023, 3:01 PM

I’ve tried

tap-spreadsheets-anywhere

, but it doesn’t have an option to set encoding and requires addition plugin to do inline mappings

pat_nadolny

03/21/2023, 3:15 PM

It doesnt look like its supported in the meltanolabs tap-csv right now but it wouldnt be a hard thing to implement. It would mean: • add a new config option for something like

append_file_metadata

• add the file name as a header and add to each record if that config is set

Denis I.

03/21/2023, 3:22 PM

Makes sense! I’ve tested a dummy implementation and it works fine

Denis I.

03/21/2023, 3:26 PM

Ah, I see you are the maintainer of this tap. What would be the best way to implement this option for production use?

pat_nadolny

03/21/2023, 3:32 PM

Yep! I've seen

_sdc_source_file

used in a couple places like s3-csv and sftp. I think defaulting to current behavior with a config option to include those _sdc columns would be a good implementation. Does that makes sense? Was that what you were asking?

pat_nadolny

03/21/2023, 3:36 PM

I opened an issue to track it https://github.com/MeltanoLabs/tap-csv/issues/125

Denis I.

03/21/2023, 3:44 PM

It looks like

append_file_metadata

should be on the tap options level, not on the files one. Right?

Denis I.

03/21/2023, 3:45 PM

I could try to implement it if you don’t mind

pat_nadolny

03/21/2023, 3:47 PM

yeah I think at the tap top level makes sense to me. Also I'm open to names other than

append_file_metadata

, that was just the first thing that came to mind. I wonder if theres precedence elsewhere for a name.

pat_nadolny

03/21/2023, 3:47 PM

I could try to implement it if you don’t mind

100% that would be awesome and much appreciated! 🚀

Denis I.

03/21/2023, 3:53 PM

I’ve also spotted the issue https://github.com/MeltanoLabs/tap-csv/issues/13, the file’s modification date meta could be used as a feature for state as implemented in

tap-spreadsheets-anywhere

pat_nadolny

03/21/2023, 4:02 PM

yeah I good point, I think I see it implemented here

pat_nadolny

03/21/2023, 4:04 PM

Some people argued that file timestamps arent reliable, I cant remember the context of that though

Denis I.

03/21/2023, 4:13 PM

Agree, for me it’s a good option to skip reloading the same file again every time the pipeline starts. So probably it should be an optional way to track the state

Denis I.

03/21/2023, 7:01 PM

I’ve created a pull request https://github.com/MeltanoLabs/tap-csv/pull/126 If option

add_metadata_columns

is enabled, adds 3 new columns to the stream:

_sdc_source_file

_sdc_source_lineno

_sdc_source_file_mtime

pat_nadolny

03/21/2023, 7:46 PM

@Denis I. I left a few minor comments and suggestions but overall it looks awesome. Thanks again for contributing this 🙏

Denis I.

03/21/2023, 8:31 PM

@pat_nadolny I pushed the updates

pat_nadolny

03/21/2023, 10:07 PM

@Denis I. its merged, release is published, and the hub is updated with the new setting. You can run

meltano lock tap-csv --update

to pull in that hub update

Denis I.

03/22/2023, 6:56 AM

@pat_nadolny great, thank you!

steve_clarke

03/27/2023, 12:28 AM

I did a lot of enhancements to the tap-s3-csv tap here : https://github.com/s7clarke10/pipelinewise-tap-s3-csv . I use the _sdc_source_file as it is useful to get a unique name as the source system writes out a timestamp as part of the file name. It was super useful. We did quite a few other enhancements like: 1. supporting getting the s3 file via private link i.e. ignoring a proxy would would go over the public internet 2. removing certain characters from the file 3. override the target stream name to add a suffix to make tables unique if you have multiple providers. 4. treating the absence of a value as NULL for database targets 5. overriding any detected datatype and treating the values as strings regardless of whether they are dates or numbers. 6. Support for UTF-8-BOM csv's (one's created by Microsoft Excel) It would be kind of nice if tap-csv is the way forward to push some of this logic as enhancements to this tap. I'm not sure however if these tap supports s3, azure blob storage etc as a data source so that would be the first step?

Denis I.

03/27/2023, 11:41 AM

Hi @steve_clarke! During my csv extractors research I found multiple solutions published on Meltano Hub: 1. https://hub.meltano.com/extractors/tap-csv a. Local files only b. Meltano Singer SDK c. Only CSV format, standard parsing settings, strings only output d. Support encodings 2. https://hub.meltano.com/extractors/tap-s3-csv/ a. S3 files only b. Singer.io c. Only CSV format, limited parsing settings, strings only output with manual date format override d. Doesn’t support encodings 3. https://hub.meltano.com/extractors/tap-gmail-csv/ a. Gmail attachments only b. Singer.io c. CSV/Excel formats, standard parsing settings, schema override, decompression 4. https://hub.meltano.com/extractors/tap-spreadsheets-anywhere/ a. Advanced file sources list (local, s3, http(s), ssh/scp/sftp, webhdfs, gcs, azure) with pattern-based files filtering b. Singer Tap c. Multiple formats, including CSV, extended parsing settings and additional bad format fixes, schema override, decompression d. Doesn’t support encodings (I’ve recently made a pull request with this feature https://github.com/ets/tap-spreadsheets-anywhere/pull/43) Each of the implementations has it’s own benefits. The last one is the most advanced in terms of supported sources and formats, the first one is the simplest. IMO the cleanest way to keep the sources/formats zoo manageable is to separate file sources layer from file formats layer and keep extending each of them independently. So it might be a good idea to take the transfer layer from

tap-spreadsheets-anywhere

and use it as a base SDK for taps like

tap-csv

tap-json

, etc. @pat_nadolny @Matt Menzenski @edgar_ramirez_mondragon @aaronsteers what do you think guys?

Matt Menzenski

03/27/2023, 6:15 PM

I’m not opposed to this idea

edgar_ramirez_mondragon

03/27/2023, 6:28 PM

IMO the cleanest way to keep the sources/formats zoo manageable is to separate file sources layer from file formats layer and keep extending each of them independently.

I agree! The SDK feels like the right place for the transfer layer. File tap and stream helpers would only need to be implemented once.

Denis I.

03/29/2023, 2:59 PM

Or as suggested there it could be solved by introducing something like a plugin wrapper for rclone

Denis I.

03/31/2023, 3:10 PM

@pat_nadolny I pushed minor changes related to meta data columns here: https://github.com/MeltanoLabs/tap-csv/pull/132 Please, take a look

Open in Slack

Previous Next