I m running meltano inside dagster For a csv file with 99k r Meltano #getting-started

I'm running meltano inside dagster. For a csv file...

taeef_najib

02/29/2024, 8:14 AM

I'm running meltano inside dagster. For a csv file with 99k rows to be loaded into iceberg, it's taking a long time. I'm using

tap-csv

and a custom loader

target-iceberg

. Is there any way to enable parallel processing to speed up the integration process? I read in some thread that loaders, that deal with large volume of data usually slows down the process. It was also suggested to change the

batch size

. I don't know how I can change the batch size in my custom target. Currently,

max_size

is 10000. Will it help if I increase it?

Edgar Ramírez (Arch.dev)

02/29/2024, 4:32 PM

If you bump your custom target to the singer-sdk 0.36.0, you should get a new

batch_size_rows

setting for free

taeef_najib

02/29/2024, 5:54 PM

Hi @Edgar Ramírez (Arch.dev) I have to somehow lock my version of meltano to an older one, unfortunately. Apart from that I just realized that changing the batch size will not actually make much of a difference. Because no matter how many records I write per batch, I'm seeing delay in each record. Maybe enabling parallel processing will help. Does meltano support parallel processing for

process_batch

? Just curious.

Edgar Ramírez (Arch.dev)

02/29/2024, 5:57 PM

The sdk already drains records from streams in parallel: https://github.com/meltano/sdk/blob/6070c58c1393a76923bf9e59c475bd66f259cf14/singer_sdk/target_base.py#L515-L525 I'm also happy to fix a bug or review a PR if we're missing something there.

taeef_najib

02/29/2024, 6:22 PM

Great! Thanks. I'll log the time of processing each record in local and check a few other things. I'll give you an update if the increased time is caused by meltano.

Edgar Ramírez (Arch.dev)

02/29/2024, 6:37 PM

Thanks!

🙏 1

4 Views

Open in Slack

Previous Next