Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions on improvement for memory performance regarding Regex matching #2091

Closed
georgetayqy opened this issue Jan 22, 2024 · 2 comments · Fixed by #2092
Closed

Suggestions on improvement for memory performance regarding Regex matching #2091

georgetayqy opened this issue Jan 22, 2024 · 2 comments · Fixed by #2092

Comments

@georgetayqy
Copy link
Contributor

georgetayqy commented Jan 22, 2024

What feature(s) would you like to see in RepoSense

Currently, some areas of RepoSense use Regex checks within iterations.

For example, consider StringsUtil::filterText.

public static String filterText(String text, String regex) {
    String[] split = text.split("\n");
    StringBuilder sb = new StringBuilder();
    for (String line: split) {
        if (line.matches(regex)) {
            sb.append(line + "\n");
        }
    }

    return sb.toString();
}

As observed, the method String::matches is called for as many times as there are lines in the input text, which is not performant as String::matches creates and compiles a new Regex pattern every time it is called. This behaviour is evident in the source code for the String::matches method, which makes a call to Pattern.matches(regex, string).

It is recommended by the documentation for the Pattern class as well as some external sources to avoid such a pattern of code since it causes many unnecessary Regex compilations, which takes more time and consumes more memory.

After implementing some fixes for this behaviour, and from some preliminary profiling using Intellij's built-in Java Profiler, I have observed that the improvements lead to a small decrease in runtime and a significant decrease in memory usage when Regex patterns are precompiled first before being used in a for loop.

Here are some of my findings:

Command used to test

./gradlew run -Dargs="--repos https://github.com/reposense/RepoSense.git https://github.com/CATcher-org/CATcher.git https://github.com/MarkBind/markbind.git --output ./report_folder --since 31/1/2017 --formats java adoc xml --view --ignore-standalone-config --last-modified-date --timezone UTC+08 --find-previous-authors"

Runtime

Before Improvement After Improvement
Screenshot 2024-01-22 at 10 00 30 PM Screenshot 2024-01-22 at 10 00 21 PM

Some key points to note:

  • The runtime being inspected is the CPU time (time that the code is executed, ignoring gradle builds)
  • 11.69% decrease in runtime

Memory Consumption

Before Improvement After Improvement
Screenshot 2024-01-22 at 10 07 12 PM Screenshot 2024-01-22 at 10 07 01 PM

Some key points to note:

  • 43.32% (!) decrease in memory usage
  • It also appears that StreamGobbler consumes a significant chunk of memory, perhaps we could also look into reducing the memory usage of that class?

Limitations

  • These preliminary profiling tasks are done on a M1 Mac with MacOS. These performance values may vary across different OS and processors
  • Only 3 profiling samples were taken, more samples might have to be taken, across multiple platforms and processors, to decide if these enhancements have a positive impact on the performance of RepoSense

Is the feature request related to a problem?

This feature is not exactly related to a particular problem in general, but it is linked to the overall goal of increasing RepoSense's performance and memory usage.

If possible, describe the solution

The solution involves the pre-compilation of the offending Regex pattern into a Pattern object, and using a Matcher object created from Pattern to test if the string matches the Regex pattern.

The fixes are found in a draft PR here.

If applicable, describe alternatives you've considered

No other viable alternatives have been explored or tried at this time. It seems on preliminary inspection that these Regex operations are crucial and cannot be removed or replaced with other built-in String methods.

Additional context

N/A

@gok99
Copy link
Contributor

gok99 commented Jan 22, 2024

Great findings @asdfghjkxd! Perhaps you could create a separate issue for investigating StreamGobbler.

@georgetayqy
Copy link
Contributor Author

@gok99 Thank you! I will work with the team to create another issue to investigate StreamGobbler and find out why it consumes a significant amount of memory.

Regarding the link, it should be fixed. Apologies if it didn't work in the first place, the draft PR was not created before this issue and hence I couldn't link to it initially!

ckcherry23 pushed a commit that referenced this issue Jan 26, 2024
Improve memory usage by refactoring Regex compilation

Currently, Regex checking is used in conjunction with iteration. This
pattern of coding is frowned upon due to the excessive Regex pattern
compilation, causing the program to run slower and consume more
memory.

By moving the Regex pattern compilation outside of the iteration, and
by using `Matcher` objects to check if the strings match the Regex
performance, we can potentially remove this performance bottleneck.

Let's move to refactor the code and remove such instances of Regex use
in iterative loops.
@github-project-automation github-project-automation bot moved this to Closed/Completed in RepoSense Roadmap Jan 26, 2024
MarcusTXK added a commit that referenced this issue Mar 26, 2024
* Enhance existing Regex code

* Consolidate typical Regex patterns

---------

Co-authored-by: Charisma Kausar <[email protected]>
Co-authored-by: Gokul Rajiv <[email protected]>
Co-authored-by: Marcus Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Closed/Completed
Development

Successfully merging a pull request may close this issue.

2 participants