-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestions on improvement for memory performance regarding Regex matching #2091
Comments
Great findings @asdfghjkxd! Perhaps you could create a separate issue for investigating |
@gok99 Thank you! I will work with the team to create another issue to investigate Regarding the link, it should be fixed. Apologies if it didn't work in the first place, the draft PR was not created before this issue and hence I couldn't link to it initially! |
Improve memory usage by refactoring Regex compilation Currently, Regex checking is used in conjunction with iteration. This pattern of coding is frowned upon due to the excessive Regex pattern compilation, causing the program to run slower and consume more memory. By moving the Regex pattern compilation outside of the iteration, and by using `Matcher` objects to check if the strings match the Regex performance, we can potentially remove this performance bottleneck. Let's move to refactor the code and remove such instances of Regex use in iterative loops.
* Enhance existing Regex code * Consolidate typical Regex patterns --------- Co-authored-by: Charisma Kausar <[email protected]> Co-authored-by: Gokul Rajiv <[email protected]> Co-authored-by: Marcus Tang <[email protected]>
What feature(s) would you like to see in RepoSense
Currently, some areas of RepoSense use Regex checks within iterations.
For example, consider
StringsUtil::filterText
.As observed, the method
String::matches
is called for as many times as there are lines in the input text, which is not performant asString::matches
creates and compiles a new Regex pattern every time it is called. This behaviour is evident in the source code for theString::matches
method, which makes a call toPattern.matches(regex, string)
.It is recommended by the documentation for the
Pattern
class as well as some external sources to avoid such a pattern of code since it causes many unnecessary Regex compilations, which takes more time and consumes more memory.After implementing some fixes for this behaviour, and from some preliminary profiling using Intellij's built-in Java Profiler, I have observed that the improvements lead to a small decrease in runtime and a significant decrease in memory usage when Regex patterns are precompiled first before being used in a
for
loop.Here are some of my findings:
Command used to test
Runtime
Some key points to note:
Memory Consumption
Some key points to note:
StreamGobbler
consumes a significant chunk of memory, perhaps we could also look into reducing the memory usage of that class?Limitations
Is the feature request related to a problem?
This feature is not exactly related to a particular problem in general, but it is linked to the overall goal of increasing RepoSense's performance and memory usage.
If possible, describe the solution
The solution involves the pre-compilation of the offending Regex pattern into a
Pattern
object, and using aMatcher
object created fromPattern
to test if the string matches the Regex pattern.The fixes are found in a draft PR here.
If applicable, describe alternatives you've considered
No other viable alternatives have been explored or tried at this time. It seems on preliminary inspection that these Regex operations are crucial and cannot be removed or replaced with other built-in String methods.
Additional context
N/A
The text was updated successfully, but these errors were encountered: