Hi All, I have a query regarding the processing ti...
# troubleshooting
s
Hi All, I have a query regarding the processing time of Meltano VS a custom Python script I wrote ( which does the same job). So now, I have made a custom tap to read Splunk Windows logs. For Target I'm using target-parquet ( available OOB on meltano hub). My Job is simply to parse logs and convert them into Parquet. So Here are the Time Stats: ---------------------------------------- For 1 GB of Splunk Windows Logs : Meltano pipeline ( without Batch Capability) : 8 min 7 secs (average of 3 runs) Meltano pipeline ( with Batch Capability) : 3 min 52 secs (avg. of 3 runs) My Custom Python Script : 1 min 40 secs (avg. of 3 runs) For 5GB of Splunk Windows Logs: Meltano pipeline ( without Batch Capability) : 40 min 55 secs (average of 3 runs) Meltano pipeline ( with Batch Capability) : 15 min 48 secs (avg. of 3 runs) My Custom Python Script : 8 min 25 secs (avg. of 3 runs) Now, I have some questions. 1)Is the time required by Meltano within the expected range? 2) Is there any way in which I can further decrease the processing time for Meltano? ( Only got one article i.e 6X YOUR SPEED USING BATCHING (https://meltano.com/blog/6x-more-speed-for-your-data-pipelines-with-batch-messages/) Although the above article did help me to reduce processing time as provided above in the stats, I still need to improve more on performance for my use case. 3) Is there any parameter, config etc which is there in Meltano, which can help me boost the performance? 4) Why my custom Python script is much faster than Meltano. Where is Meltano taking time? And if that can be changed? Thanks in Advance!
v
Here's a nice writeup on how to profile the code to analyze what's taking so long https://sdk.meltano.com/en/latest/dev_guide.html#testing-performance It being that much slower with batch is surprising to me but we'd have to dive into the differences in implementation.
Very curious in your results!
a
Which version of meltano are you running?
meltano --version
e
I'm curious what the numbers would be using the tap and target outside of Meltano. My guess is there's very little overhead these days, but it'd be interesting if we're leaving some performance on the table. That said, it's hard to compare without looking at your code but Meltano/Singer heavily does JSON SerDe that can become a bottleneck. There's also the fact that the sdk tries to be smart and clean up every record before the tap emits it, so that could be another bottleneck.
s
I am using Version 3.3.2