Hi I am trying to figure out what to do with file taps How d Meltano #getting-started

Hi, I am trying to figure out what to do with file...

alex_b

04/25/2023, 11:14 AM

Hi, I am trying to figure out what to do with file taps. How does one add a new id column (autoincrement) if the base file has no key column? For instance, using the

iris

dataset as a CSV file:

Copy code

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa

I have been tinkering with the

tap-csv

extractor to no avail. Should I define my own custom extractor? Many thanks!

mert_bakir

04/25/2023, 11:38 AM

I think you should do it after loading the data. tap -> target -> transform

visch

04/25/2023, 12:31 PM

I agree with mert, you could also look at mappers @alex_b https://sdk.meltano.com/en/latest/stream_maps.html

alex_b

04/25/2023, 3:10 PM

Thank you both. I might be missing something, but I do not see any way to create a new, unique value for each row with the builtin functionalities of the mapper (md5, datetime, random, float, int, str). I would need something like

uuid.uuid4()

or an infinite int generator

alex_b

04/26/2023, 6:31 AM

I wish I found a way to do this using meltano, but this awk command works just fine

awk '{printf "%s,%s\n", NR==1 ? "id" : NR-1, $0}' iris.csv > iris_with_id.csv

user

04/26/2023, 9:36 AM

@alex_b I'm curious; the recommended way to do this would be as @mert_bakir and @visch suggested, to first load your raw data and then manipulate it. Why do you want to manipulate it before loading it?

alex_b

04/26/2023, 11:04 AM

The

tap_csv

requires passing the

keys

parameter for each file. If I use the

target-postgres

sink to complete the EL pipeline, this parameter is then used as the primary key for the table. This is an issue since this example's values are not unique, so I cannot load the raw data without losing rows. I tried the mapper approach, but I could not figure out how to generate a unique value per row with the built-in functions. I also thought I could use one of the metadata columns by overriding

__key_properties__

in the sink config, but

_sdc_extracted_at

and the others are not unique as well. I also tried to define the table schema before running the EL pipeline with an autoincrement id column, but then the loader would try to insert null values inside. At this point, I concluded I would be better off using a bash script

Open in Slack

Previous Next