Optimize item Remove operations #6716

ladipro · 2021-07-28T21:30:10Z

Context

Removing items by referencing other item types or by multiple single-item Remove operations has bad performance characteristics because it leads to N*M comparisons, each involving path normalization of both sides. For projects with as few as small thousands of items, evaluating them becomes a bottleneck with direct impact on build time and VS responsiveness.

Changes Made

Replaced ImmutableList<ItemData> with a new collection type OrderedItemDataCollection to be used when representing the intermediate list of items. In addition to being a list, the new collection also lazily creates a dictionary of items keyed by normalized paths, which allows for fast matching.

Testing

Functional:

Existing unit tests, one new unit test for a previously uncovered scenario.
Experimental VS insertion.

Performance:
Used a benchmark project with 2000 files in a subdirectory.

Evaluation time of:

    <A Include="**/*.*" />
    <B Include="**/*.*" />
    <B Remove="@(A)" />

went from 7300 ms down to 180 ms.

Evaluation time of:

    <A Include="**/*.*" />    
    <A Remove='subdir\file1'/>
    <A Remove='subdir\file2'/>
    ...
    <A Remove='subdir\file2000'/>

went from 7900 ms down to 330 ms.

Evaluation time of a Hello World .NET Core project went got slightly faster as well (520 -> 515 ms, note that this was without NGEN so most of it is JITting).

Notes

I opted for a rather simple implementation of OrderedItemDataCollection based on ImmutableList and Dictionary. It has limitations but seems to work well for the scenarios of interest. It should be possible to implement a completely custom data structure for even better results (requirements: immutable, maintains insertion order for enumeration).

src/Build/Evaluation/ItemSpec.cs

src/Build/Evaluation/LazyItemEvaluator.cs

src/Build/Evaluation/ItemSpec.cs

src/Build/Evaluation/LazyItemEvaluator.cs

cdmihai · 2021-07-29T01:44:26Z

src/Build/Evaluation/LazyItemEvaluator.cs

+                /// as long as practical. If an operation would result in too much of such work, the dictionary is simply dropped and recreated
+                /// later if/when needed.
+                /// </remarks>
+                public void RemoveMatchingItems(ItemSpec<P, I> itemSpec)


Wouldn't it be better design-wise to ask the ItemSpec to compute its intersection with a given OrderedItemDataCollection via a new IntersectWith(OrderedItemDataCollection) instead of passing the ItemSpec to RemoveMatchingItems?

That way you:

prevent leaking the implementation detail that some itemspec fragments do not support enumeration

can neatly cache normalized values of the itemspec values without leaking that implementation detail. This would be useful if itemspec values get repeatedly normalized (for example by repeatedly calling any of the IsMatch, MatchCount, or IntersectWith methods on the same ItemSpec instance)

can reuse the IntersectWith operation to remove the same O(n^2) issue with the Update operation.

To support this, OrderedItemDataCollection (or the Builder) could provide a string indexer which would be the normalized key. Ideally you could even hide OrderedItemDataCollection behind some collection interface but maybe that's too much for now.

This is a great suggestion. I think I had the code structured differently but hit a technical issue which had me settle for this mess. I'll wait for CI to start working again to see where I stand with unit tests and will then tackle this. Thank you!

I have refactored it a bit as suggested and it is indeed nicer. For Update operation, it is possible that we can now revert the previous optimization (#5853) and achieve the same using the new datastructure but I'd like to try this separately later.

src/Build/Evaluation/LazyItemEvaluator.cs

Forgind

I don't see anything wrong with this, but it's late for me, so I don't want to approve.

src/Build/Evaluation/LazyItemEvaluator.cs

src/Build/Utilities/FileSpecMatchTester.cs

src/Build/Evaluation/ItemSpec.cs

ladipro · 2021-08-03T07:10:45Z

/azp run

azure-pipelines · 2021-08-03T07:10:54Z

Azure Pipelines successfully started running 1 pipeline(s).

Co-authored-by: Marius Ungureanu <[email protected]>

This reverts commit efb5e4e.

This reverts commit 813de90.

…ctionary.Builder

ladipro · 2021-08-03T20:42:10Z

/azp run

azure-pipelines · 2021-08-03T20:42:19Z

Azure Pipelines successfully started running 1 pipeline(s).

Forgind

I'm a little curious how few things you would have to remove for it to be not worth it to make this ordered collection. Probably not important, since it would be fast either way. Thanks!

src/Build/Evaluation/LazyItemEvaluator.OrderedItemDataCollection.cs

Forgind · 2021-08-06T01:38:57Z

src/Build/Evaluation/LazyItemEvaluator.OrderedItemDataCollection.cs

+
+                public void RemoveAll(ICollection<I> itemsToRemove)
+                {
+                    _listBuilder.RemoveAll(item => itemsToRemove.Contains(item.Item));


Just for my own benefit—Except would look pretty here, but it would both remove all the items in itemsToRemove and remove duplicates, possibly reordering the list, so it is not ideal.

src/Build/Evaluation/LazyItemEvaluator.OrderedItemDataCollection.cs

src/Build/Evaluation/LazyItemEvaluator.RemoveOperation.cs

Co-authored-by: Forgind <[email protected]>

ladipro · 2021-08-06T20:37:23Z

I'm a little curious how few things you would have to remove for it to be not worth it to make this ordered collection. Probably not important, since it would be fast either way.

The remove operation is dominated by normalization of item values. This will run on the first remove either way. The extra overhead of building the dictionary should be relatively small and I'd be surprised if two removes were not already faster. I will measure and confirm, it's a fair question.

ladipro · 2021-08-11T20:06:46Z

There is a measurable regression in "small scale" removes, just as you suspected @Forgind. It slows down evaluation of a Hello World .NET Core project by a few percent. I will profile and make changes to address this.

…ipro/msbuild into 6696-linear-remove-operation

ladipro · 2021-08-16T10:00:20Z

I've made the following changes:

No longer maintaining the dictionary as immutable so it is forgotten when the list is added to the cache in MemoizedOperation. As @Therzok pointed out earlier, ImmutableDictionary has a cost and it turns out that we rarely take advantage of this memoization in real-world projects (when building OC I see <300 cache hits where we further modify the list as opposed to 20k hits with no modification).
Enabled the new code path only for larger items lists (env variable configurable, >= 100 items by default) because the dictionary set-up work still showed in profiles for small ones.

I have updated the description with fresh measurements. There is a positive impact on evaluation time even if the dictionary is not used, due to other smaller optimizations, primarily caching the normalized value on ItemData and refactoring FileSpecMatchTester to prevent repeated normalizations.

src/Build/Evaluation/ItemDataCollectionValue.cs

rainersigwald · 2021-08-18T15:29:01Z

src/Build/Utilities/FileSpecMatchTester.cs

+            // Historically we've used slightly different normalization logic depending on the type of matching
+            // performed in IsMatchNormalized. We have to keep doing it for compat.


Can you give a bit more color on the types of scenarios where the difference is observable?

Trying to unify the code to use the same path normalization function for pattern matching and for single path matching resulted in unit test failures for me. I can revisit if you want but generally I treat this as something that should not be changed just for code cleanliness, even if it affects only corner cases.

src/Build/Definition/Project.cs

…eration

Make remove items operation linear

6c458aa

Therzok reviewed Jul 28, 2021

View reviewed changes

src/Build/Evaluation/ItemSpec.cs Outdated Show resolved Hide resolved

Therzok reviewed Jul 28, 2021

View reviewed changes

src/Build/Evaluation/ItemSpec.cs Show resolved Hide resolved

Therzok reviewed Jul 28, 2021

View reviewed changes