You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What feature(s) would you like to see in RepoSense
As detailed in issue #2091, the StreamGobbler class consumes a large amount of memory when in use, upwards of around > 500 MB per run.
After some digging around through the codebase, and looking through the different source codes for String and StringBuilder class, it appears that there might be some performance bottlenecks with the way that the code is currently written.
Currently, the code is implemented as such:
ReadableByteChannelch = Channels.newChannel(is);
intlen;
while ((len = ch.read(buffer)) > 0) {
sb.append(newString(buffer.array(), 0, len));
buffer.rewind();
}
value = sb.toString();
We can observe that a new String is created for every 8 KB of data read into the buffer and that the string is subsequently appended with the other strings stored in StringBuilder before the buffer is rewound and overwritten in the next file read operation.
After reading through the String API, I noticed that the creation of a new string from the buffer array possibly creates a new copy of the array with Arrays::copyOf or Arrays::copyOfRange within the StringCoding class, which handles the decoding of String objects.
Moreover, the String appending process for StringBuilder could possibly make a call to the AbstractStringBuilder::getBytes method, which makes another call to System::arraycopy.
The combination of both method calls means that repeated work may have been done, first to copy the byte buffer into the byte array contained in a String, and thereafter to copy the byte array in the String out into the internal byte array of StringBuilder.
This repeated work, as well as the creation of multiple String objects (which could be problematic when the files are huge, since each String object can only contain at most 8 KB of data from the file), could result in a significant decrease in runtime performance (also possibly from garbage collection) and an increase in (heap) memory usage.
We could look into finding new ways to read all data in an input stream and avoid repeated work to improve both runtime and memory performance.
Is the feature request related to a problem?
This issue is not related to a problem, but it is related to the overall goal of making RepoSense more performant.
If possible, describe the solution
Currently, I am unable to find a solution that works sufficiently well. Improving memory performance necessarily means that runtime performance would degrade and vice versa.
Some resources that we might wish to take a look at would be this. I have tried out using BufferedReader from the guide and it seems to reduce the overall runtime and memory usage but it seems that it is occasionally failing test cases and system test cases.
Here is the result of one of the profiling runs:
The overall runtime and memory usage were lower compared to the improvements made in #2091.
georgetayqy
changed the title
Suggestions for reducing runtime and memory usage for StreamGobbler
Investigation into the reduction of runtime and memory usage for StreamGobblerJan 26, 2024
What feature(s) would you like to see in RepoSense
As detailed in issue #2091, the
StreamGobbler
class consumes a large amount of memory when in use, upwards of around > 500 MB per run.After some digging around through the codebase, and looking through the different source codes for
String
andStringBuilder
class, it appears that there might be some performance bottlenecks with the way that the code is currently written.Currently, the code is implemented as such:
We can observe that a new
String
is created for every 8 KB of data read into the buffer and that the string is subsequently appended with the other strings stored inStringBuilder
before the buffer is rewound and overwritten in the next file read operation.After reading through the
String
API, I noticed that the creation of a new string from the buffer array possibly creates a new copy of the array withArrays::copyOf
orArrays::copyOfRange
within theStringCoding
class, which handles the decoding ofString
objects.Moreover, the String appending process for
StringBuilder
could possibly make a call to theAbstractStringBuilder::getBytes
method, which makes another call toSystem::arraycopy
.The combination of both method calls means that repeated work may have been done, first to copy the byte buffer into the byte array contained in a
String
, and thereafter to copy the byte array in theString
out into the internal byte array ofStringBuilder
.This repeated work, as well as the creation of multiple
String
objects (which could be problematic when the files are huge, since eachString
object can only contain at most 8 KB of data from the file), could result in a significant decrease in runtime performance (also possibly from garbage collection) and an increase in (heap) memory usage.We could look into finding new ways to read all data in an input stream and avoid repeated work to improve both runtime and memory performance.
Is the feature request related to a problem?
This issue is not related to a problem, but it is related to the overall goal of making RepoSense more performant.
If possible, describe the solution
Currently, I am unable to find a solution that works sufficiently well. Improving memory performance necessarily means that runtime performance would degrade and vice versa.
Some resources that we might wish to take a look at would be this. I have tried out using
BufferedReader
from the guide and it seems to reduce the overall runtime and memory usage but it seems that it is occasionally failing test cases and system test cases.Here is the result of one of the profiling runs:
The overall runtime and memory usage were lower compared to the improvements made in #2091.
The code tested is as follow:
If applicable, describe alternatives you've considered
Currently, no other alternatives have been considered.
Additional context
N/A
The text was updated successfully, but these errors were encountered: