This has been on my mind for a long time: The perf...
# contributing
j
This has been on my mind for a long time: The performance of tapping/loading huge datasets is not perfect. If the source of the tap is fast enough (e.g. database with fast export) and target can load huge files quickly (e.g. Snowflake/Vertica), if I understand it well, the bottle-neck is in the middle, in the Singer. I am thinking about if it would be technically feasible to implement an alternative backend part of Singer using technologies like Arrow(Flight)? Any thoughts?
p
@jan_soubusta theres already a Singer concept called BATCH messages or fast sync that allows you to skip most of the slow stdout piping by having the tap write directly to a file backend like S3 or even local files, then the target gets a small number of BATCH messages over stdout with pointers to those files. This allows the pipeline to take advantage of very efficient import/export features that many platforms provide, for example snowflakes COPY command. This feature is relatively new to Meltano though so its not yet supported by a lot of connectors, hopefully soon!
There was also a conversation in the last office hours about apache arrow as an interchange format

https://youtu.be/JlZpdUUsquA?t=155

with discussion in https://github.com/meltano/sdk/issues/1684
p
It’d be great to have a faster interchange format. I know that Meltano is experimenting with the new Batch Messages feature, but in my testing, batch messages merely cut the load time in half, which is still quite slow for large datasets. The root problem seems to be the use of JSON as the interchange format.