Hi, I am trying to figure out what to do with file...
# getting-started
a
Hi, I am trying to figure out what to do with file taps. How does one add a new id column (autoincrement) if the base file has no key column? For instance, using the
iris
dataset as a CSV file:
Copy code
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
I have been tinkering with the
tap-csv
extractor to no avail. Should I define my own custom extractor? Many thanks!
m
I think you should do it after loading the data. tap -> target -> transform
v
I agree with mert, you could also look at mappers @alex_b https://sdk.meltano.com/en/latest/stream_maps.html
a
Thank you both. I might be missing something, but I do not see any way to create a new, unique value for each row with the builtin functionalities of the mapper (md5, datetime, random, float, int, str). I would need something like
uuid.uuid4()
or an infinite int generator
I wish I found a way to do this using meltano, but this awk command works just fine
awk '{printf "%s,%s\n", NR==1 ? "id" : NR-1, $0}' iris.csv > iris_with_id.csv
u
@alex_b I'm curious; the recommended way to do this would be as @mert_bakir and @visch suggested, to first load your raw data and then manipulate it. Why do you want to manipulate it before loading it?
a
The
tap_csv
requires passing the
keys
parameter for each file. If I use the
target-postgres
sink to complete the EL pipeline, this parameter is then used as the primary key for the table. This is an issue since this example's values are not unique, so I cannot load the raw data without losing rows. I tried the mapper approach, but I could not figure out how to generate a unique value per row with the built-in functions. I also thought I could use one of the metadata columns by overriding
__key_properties__
in the sink config, but
_sdc_extracted_at
and the others are not unique as well. I also tried to define the table schema before running the EL pipeline with an autoincrement id column, but then the loader would try to insert null values inside. At this point, I concluded I would be better off using a bash script