Hi all I m a little new to meltano but enjoying my time with Meltano #troubleshooting

Hi all - I'm a little new to meltano, but enjoying...

Alex Maras

02/15/2024, 9:11 AM

Hi all - I'm a little new to meltano, but enjoying my time with it so far. I'm running the airbyte wrapper with the

airbyte/source-azure-blob-storage

image to clone data from azure blob storage to s3. This may be not a particularly intended use-case, but if it helps us keep our data pipeline orchestration and auth centralised, it'd be pretty handy. I'm successfully pulling data out of azure blob using the airbyte wrapper. At the moment, it's only got a single stream with file type

jsonl

. With

target-jsonl

, it outputs all data into a single file with a line per file from the source system, each of which has a

_ab_source_file_url

field. I'd like to use that field to specify an output filename with

target-s3

, so that the one stream can be split and essentially mirror the storage from azure blob to s3. Can anyone point me in the right direction? I imagine stream maps and flattening might have some bearing, but I'm struggling to map the relationship between the jsonl output from from the airbyte connector to the target-s3 output.

✅ 1

Edgar Ramírez (Arch.dev)

02/15/2024, 7:45 PM

Hi @Alex Maras!

target-s3

only seems to use the stream name https://github.com/crowemi/target-s3/blob/24d451a9c1b38910b247fdaf478960f5a8084b27/target_s3/formats/format_base.py#L109 and the batch timestamp https://github.com/crowemi/target-s3/blob/24d451a9c1b38910b247fdaf478960f5a8084b27/target_s3/formats/format_base.py#L121 to determine the file path. That means you'd need to map

_ab_source_file_url

to the stream name, which is not a use case currently supported by stream maps. The good news is you can write your own mapper script that does exactly what you need. See https://github.com/edgarrmondragon/singer-playground/blob/main/merge_streams/map.py for an example of a very simple mapper.

Alex Maras

02/15/2024, 11:52 PM

Thanks @Edgar Ramírez (Arch.dev)! I wrote a simple mapper for handling prefixing tables with a given string for a sync job between MSSQL and Snowflake, so I should be able to build off that basis to get this working. Thanks for the pointer.

Edgar Ramírez (Arch.dev)

02/15/2024, 11:57 PM

Awesome!

Alex Maras

02/16/2024, 7:24 AM

ok, got it working - I ended up having to fork the

target-s3

one too, so that I could force it to use the stream name directly instead of trying to append

.json

and always

gzip

stuff. Thanks again for the help! If I manage to clean up the mapping plugin to be generic enough to work - i.e. so that you could split a table by a field within that table into multiple streams - then I'll look at putting it up as a public plugin. It should be relatively easy, I just need to handle multiple streams initially, as I'm just dealing with a single stream here and my config won't account for multiple streams with multiple schemas.

👍 1

8 Views

Open in Slack

Previous Next