-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(misconf): Improve cause performance #6586
Conversation
We only need to get the offending cause if the result is a failure. Signed-off-by: Simar <[email protected]>
There's probably still room for improvement here as ultimately it doesn't address the fact that we will still end up parsing the required files that are responsible for causing failures. We also do quite a bit of string manipulation with raw file content as an example which further could be improved. To some extent this is necessary as it is part of the PostAnalyze step. My benchmark so far to evaluate this as an improvement has scanning the minikube repo. It's a fairly large repo with a lot of things to scan for. Previously the scan didn't not finish (in a reasonable time) as shown in the issue this PR resolves but after this change, on my setup the scan takes on average 2min to finish and takes roughly ~35-40MB of memory. |
Great! I have one more question. Even if successful files were processed, would parsing a plaintext file use 9.5 GB of memory? JSON and YAML files are usually a few megabytes at most, so it is doubtful they would consume that much memory unless there is a memory leak. Are there any huge plaintext files in Minikube? |
Yes, see here it's around 7MB. Without this file, the entire scan finishes in 30s. As far as processing is concerned, we do quite a bit of the internally. There are a couple of places we can improve on for instance:
As for point 2 above, I would say such situations of big files can occur in other repos as well and the users might not realize. Therefore adding the option to disable causes completely (in addition to only running them for failures which is this PR), can help until we have a better way to solve point 2. |
@nikpivkin since you are back, I'd like to welcome any ideas if you have any as well. |
I think the file is not huge—less than 10 MB is small. Does parsing 7 MB of YAML consume several GB of memory? Even if there were a large number of 7 MB YAML files in a repository, if they were processed linearly, the memory consumption would not be that high. Is there another factor in memory consumption?
I took a quick look at |
I also agree with you. I will keep looking. |
Here's the profile data if anyone's interested in taking a look: |
@simar7 This solves the memory usage problem. // rawLines := strings.Split(string(content), "\n")
var rawLines []string
bs := bufio.NewScanner(bytes.NewReader(content))
for bs.Scan() {
rawLines = append(rawLines, bs.Text())
}
if bs.Err() != nil {
return nil, fmt.Errorf("failed to scan file : %w", err)
} |
@nikpivkin that does help! The heap does grow over time but not at the same rate as before. Settles around ~800MB See below BeforeAfterThe scan (with your patch but without this PR's changes) took around 20min on my machine to finish. With my changes + your patch, it takes around 2.5min, along with using less memory. I'll update this PR to add your changes in as I think doing both 1) getting cause only for failures 2) using a |
I tried not to load the entire content into memory, but only the necessary lines. It reduces memory consumption. Before (3483943)After (knqyf263@f365780)It will be effective when processing 50MB, 100MB files or even bigger files, but it is unlikely that there will be JSON or YAML that large 😆 . This may be premature optimisation. |
Impressive! Do you think we should add this change in as well? I don't have a strong opinion either way. |
If we see a memory issue again, we can come back to my patch. |
@simar7 Can we merge the PR now so it will be included in v0.51.0? |
Okay sounds good! |
Signed-off-by: Simar <[email protected]>
Description
We only need to get the offending cause if the result is a failure.
Today we end up getting a cause for every single result type, which results in unnecessary compute being done.
As a side note, I also verified that the JSON output (when including the
--include-non-failures
flag) also does not contain any info on cause (for code excerpts) for results that are aPASS
.Related issues
Related PRs
--disable-causes
flag #6585Checklist