-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation in recent commits #44
Comments
One of the reasons for those changes were to avoid some problems with the reader getting stuck in an infinite loop and/or going round in circles before throwing OutOfMemoryException, which doesn't seem like a thing to be left to callers? Possibly it can be done in a better way though (seems a shame that HashSet doesn't have a means of reserving capacity until the latest .NET runtime versions) |
In essence, we are facing the standard fat filesystem consistency check. I don't remember by heart but I am confident that we can come with a solution ( :) this is a standard interview question to find a loop in linked list) |
Benchmark project commited. @michaelp do you have the time to try somthing different from Hashset container to check for loop AND keep performance at optimal levels? |
@ironfede yes I will try it this weekend. I will try Floyd's algorithm for loop detection |
@ironfede
|
As you're looking in this area, and #40 is related to the checks anyway: openmcdf/sources/OpenMcdf/CompoundFile.cs Line 1276 in a62d584
With header.DIFATSectorsNumber set to 268435456, which in #40 was causing it to run out of memory before the existing 'validationCount' check failed. |
@michaelp sorry for delay in responses but I'm currently really busy with main work activities. I'm really interested in your points to improve performance and architectur, especially the last one.. Do you think that a separate branch could work to explore some ideas? |
@ironfede I am writing some tests and I need a few more evening to come with a solution to the current task. (Just keeping you in the loop). I want to deliver what I've started and then we can talk. |
I do realize it has been 5 years, but did anything ever come out of this? I have used this in the OpenMcdf.PerfTest project:
Would you accept a PR for an additional Configuration flag in the |
I wanted to have a look at the earlier suggested algorithm for detecing loops at some point, but then fell over a number of bugs that needed to be fixed first, and never got to it :-( |
The approach in the v3 proof of concept (#194) essentially avoids any issues with loop in the chains, since the chain is enumerated on the fly when reading/writing, so it won't attempt to exceed the length of the stream stated in its corresponding directory entry. The enumerator approach is also much faster and dramatically reduces memory requirements. For belt and braces, I've also added detection of loops by checking for an overflow past the maximum allowable number of sectors in a chain. An exception is also thrown if the chain is not sufficiently long for the stream's stated length. In it's current form, consolidation on commit will also result in any loops being removed. |
I am trying to take a look at the performance degradation in the last few commits.
Regardless of the results, I would like to share the initial benchmark that I created.
Introduction
There are several test projects in the openmcdf solution that have name/intention to be performance targeting. I might be missing something, but it seems like they are more like an internal tool used by core maintenance team and not an attempt to have a benchmark test suite of any kind. In addition, the lack of the standard format for benchmark results makes it difficult to reproduce results on different machines and difficult to communicate the changes/finding/improvements.
Assumptions:
Goals:
In addition, my initial scenario is very simple.
Here are the starting numbers, right of the master
Notice that we read same amount of information and the results are sorted from the fastest to slowest. Last lines are the problematic one.
Quick investigation led to an hypotisis that
private static void EnsureUniqueSectorIndex(int nextSecID, HashSet<int> processedSectors)
in the file CompoundFile.cs is one of the causes of performance slowdown. Quick review of the area shows that HashSet without proper capacity set will be constantly growing and reallocating internal arrays. Which in turn will cause GC pressure and might slow down the performance.To validate my hypothesis i made a single change
And results are way better
Conclusion:
The text was updated successfully, but these errors were encountered: