Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize item Remove operations #6716

Merged
merged 29 commits into from
Aug 23, 2021

Conversation

ladipro
Copy link
Member

@ladipro ladipro commented Jul 28, 2021

Fixes #6696

Context

Removing items by referencing other item types or by multiple single-item Remove operations has bad performance characteristics because it leads to N*M comparisons, each involving path normalization of both sides. For projects with as few as small thousands of items, evaluating them becomes a bottleneck with direct impact on build time and VS responsiveness.

Changes Made

Replaced ImmutableList<ItemData> with a new collection type OrderedItemDataCollection to be used when representing the intermediate list of items. In addition to being a list, the new collection also lazily creates a dictionary of items keyed by normalized paths, which allows for fast matching.

Testing

Functional:

  • Existing unit tests, one new unit test for a previously uncovered scenario.
  • Experimental VS insertion.

Performance:
Used a benchmark project with 2000 files in a subdirectory.

Evaluation time of:

    <A Include="**/*.*" />
    <B Include="**/*.*" />
    <B Remove="@(A)" />

went from 7300 ms down to 180 ms.

Evaluation time of:

    <A Include="**/*.*" />    
    <A Remove='subdir\file1'/>
    <A Remove='subdir\file2'/>
    ...
    <A Remove='subdir\file2000'/>

went from 7900 ms down to 330 ms.

Evaluation time of a Hello World .NET Core project went got slightly faster as well (520 -> 515 ms, note that this was without NGEN so most of it is JITting).

Notes

I opted for a rather simple implementation of OrderedItemDataCollection based on ImmutableList and Dictionary. It has limitations but seems to work well for the scenarios of interest. It should be possible to implement a completely custom data structure for even better results (requirements: immutable, maintains insertion order for enumeration).

/// as long as practical. If an operation would result in too much of such work, the dictionary is simply dropped and recreated
/// later if/when needed.
/// </remarks>
public void RemoveMatchingItems(ItemSpec<P, I> itemSpec)
Copy link
Contributor

@cdmihai cdmihai Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better design-wise to ask the ItemSpec to compute its intersection with a given OrderedItemDataCollection via a new IntersectWith(OrderedItemDataCollection) instead of passing the ItemSpec to RemoveMatchingItems?

That way you:

  • prevent leaking the implementation detail that some itemspec fragments do not support enumeration
  • can neatly cache normalized values of the itemspec values without leaking that implementation detail. This would be useful if itemspec values get repeatedly normalized (for example by repeatedly calling any of the IsMatch, MatchCount, or IntersectWith methods on the same ItemSpec instance)
  • can reuse the IntersectWith operation to remove the same O(n^2) issue with the Update operation.

To support this, OrderedItemDataCollection (or the Builder) could provide a string indexer which would be the normalized key. Ideally you could even hide OrderedItemDataCollection behind some collection interface but maybe that's too much for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great suggestion. I think I had the code structured differently but hit a technical issue which had me settle for this mess. I'll wait for CI to start working again to see where I stand with unit tests and will then tackle this. Thank you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored it a bit as suggested and it is indeed nicer. For Update operation, it is possible that we can now revert the previous optimization (#5853) and achieve the same using the new datastructure but I'd like to try this separately later.

Copy link
Member

@Forgind Forgind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything wrong with this, but it's late for me, so I don't want to approve.

src/Build/Evaluation/LazyItemEvaluator.cs Outdated Show resolved Hide resolved
src/Build/Evaluation/LazyItemEvaluator.cs Outdated Show resolved Hide resolved
src/Build/Evaluation/LazyItemEvaluator.cs Outdated Show resolved Hide resolved
src/Build/Utilities/FileSpecMatchTester.cs Show resolved Hide resolved
src/Build/Evaluation/ItemSpec.cs Show resolved Hide resolved
src/Build/Evaluation/ItemSpec.cs Show resolved Hide resolved
@ladipro
Copy link
Member Author

ladipro commented Aug 3, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ladipro
Copy link
Member Author

ladipro commented Aug 3, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@Forgind Forgind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little curious how few things you would have to remove for it to be not worth it to make this ordered collection. Probably not important, since it would be fast either way. Thanks!


public void RemoveAll(ICollection<I> itemsToRemove)
{
_listBuilder.RemoveAll(item => itemsToRemove.Contains(item.Item));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own benefit—Except would look pretty here, but it would both remove all the items in itemsToRemove and remove duplicates, possibly reordering the list, so it is not ideal.

src/Build/Evaluation/LazyItemEvaluator.RemoveOperation.cs Outdated Show resolved Hide resolved
@ladipro
Copy link
Member Author

ladipro commented Aug 6, 2021

I'm a little curious how few things you would have to remove for it to be not worth it to make this ordered collection. Probably not important, since it would be fast either way.

The remove operation is dominated by normalization of item values. This will run on the first remove either way. The extra overhead of building the dictionary should be relatively small and I'd be surprised if two removes were not already faster. I will measure and confirm, it's a fair question.

@ladipro
Copy link
Member Author

ladipro commented Aug 11, 2021

There is a measurable regression in "small scale" removes, just as you suspected @Forgind. It slows down evaluation of a Hello World .NET Core project by a few percent. I will profile and make changes to address this.

@ladipro
Copy link
Member Author

ladipro commented Aug 16, 2021

I've made the following changes:

  • No longer maintaining the dictionary as immutable so it is forgotten when the list is added to the cache in MemoizedOperation. As @Therzok pointed out earlier, ImmutableDictionary has a cost and it turns out that we rarely take advantage of this memoization in real-world projects (when building OC I see <300 cache hits where we further modify the list as opposed to 20k hits with no modification).
  • Enabled the new code path only for larger items lists (env variable configurable, >= 100 items by default) because the dictionary set-up work still showed in profiles for small ones.

I have updated the description with fresh measurements. There is a positive impact on evaluation time even if the dictionary is not used, due to other smaller optimizations, primarily caching the normalized value on ItemData and refactoring FileSpecMatchTester to prevent repeated normalizations.

src/Build/Evaluation/ItemDataCollectionValue.cs Outdated Show resolved Hide resolved
Comment on lines +58 to +59
// Historically we've used slightly different normalization logic depending on the type of matching
// performed in IsMatchNormalized. We have to keep doing it for compat.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a bit more color on the types of scenarios where the difference is observable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to unify the code to use the same path normalization function for pattern matching and for single path matching resulted in unit test failures for me. I can revisit if you want but generally I treat this as something that should not be changed just for code cleanliness, even if it affects only corner cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

src/Build/Definition/Project.cs Outdated Show resolved Hide resolved
@Forgind Forgind added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Aug 23, 2021
@Forgind Forgind merged commit 6806583 into dotnet:main Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Item Remove's lack caching and lead to O(N^2) time
5 participants