Alex Maras
02/15/2024, 9:11 AMairbyte/source-azure-blob-storage
image to clone data from azure blob storage to s3. This may be not a particularly intended use-case, but if it helps us keep our data pipeline orchestration and auth centralised, it'd be pretty handy.
I'm successfully pulling data out of azure blob using the airbyte wrapper. At the moment, it's only got a single stream with file type jsonl
. With target-jsonl
, it outputs all data into a single file with a line per file from the source system, each of which has a _ab_source_file_url
field. I'd like to use that field to specify an output filename with target-s3
, so that the one stream can be split and essentially mirror the storage from azure blob to s3.
Can anyone point me in the right direction? I imagine stream maps and flattening might have some bearing, but I'm struggling to map the relationship between the jsonl output from from the airbyte connector to the target-s3 output.Edgar Ramírez (Arch.dev)
02/15/2024, 7:45 PMtarget-s3
only seems to use the stream name
https://github.com/crowemi/target-s3/blob/24d451a9c1b38910b247fdaf478960f5a8084b27/target_s3/formats/format_base.py#L109
and the batch timestamp
https://github.com/crowemi/target-s3/blob/24d451a9c1b38910b247fdaf478960f5a8084b27/target_s3/formats/format_base.py#L121
to determine the file path.
That means you'd need to map _ab_source_file_url
to the stream name, which is not a use case currently supported by stream maps. The good news is you can write your own mapper script that does exactly what you need. See https://github.com/edgarrmondragon/singer-playground/blob/main/merge_streams/map.py for an example of a very simple mapper.Alex Maras
02/15/2024, 11:52 PMEdgar Ramírez (Arch.dev)
02/15/2024, 11:57 PMAlex Maras
02/16/2024, 7:24 AMtarget-s3
one too, so that I could force it to use the stream name directly instead of trying to append .json
and always gzip
stuff.
Thanks again for the help! If I manage to clean up the mapping plugin to be generic enough to work - i.e. so that you could split a table by a field within that table into multiple streams - then I'll look at putting it up as a public plugin.
It should be relatively easy, I just need to handle multiple streams initially, as I'm just dealing with a single stream here and my config won't account for multiple streams with multiple schemas.