-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PriorityQueue<T> to Collections #14032
Comments
I think this should be Θ(n). You need to at least iterate the input. |
+1 |
Rx has a highly production-tested priority queue class: |
I'd lean towards using the struct enumerator to be consistent with |
Also perhaps a method for batch inserts? Can always then sort and continue from previous insertion point rather than starting at the beginning if that would help?: public void Enqueue(List<T> items) {
items.Sort(_comparer);
... insertions ...
} |
I've tossed a copy of the initial implementation here. Test coverage is far from complete, but if anyone is curious please take a look and let me know what you think. I've tried to follow the existing coding conventions from the System.Collections classes as much as possible. |
Cool. Some initial feedback:
|
Also...
|
Also...
|
@justinvp Great feedback, thanks!
|
Not anymore: https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Array.cs#L1060-L1069 |
I've migrated the code over to ebickle/corefx in the issue-574 branch. Implemented the Array.Empty() change and plugged everything into the regular build pipeline. One slight temporary kludge I had to introduce was having the System.Collections project depend on the System.Collections nuget package, as Comparer isn't open source yet. Will be fixed once issue dotnet/corefx#966 is complete. One key area I'm looking for feedback is how ToArray, CopyTo, and the enumerator should be handled. Currently they're optimized for performance, which means the target array is a heap and not sorted by the comparer. There are three options:
Another thing I'd like feedback on is whether the queue should be stable for two items with the same priority (compare result of 0). Generally priority queues make no guarantees that two items with the same priority will be dequeued in the order they were enqueued, but I noticed that internal priority queue implementation used in System.Reactive.Core took some additional overhead to ensure this property. My preference would be to not do this, but I'm not completely sure which option is better in terms developers' expectations. |
I happened upon this PR because I too was interested in adding a priority queue to .NET. Glad to see someone went to the effort of making this proposal :). After reviewing the code, I noticed the following:
|
|
@mbeidler I'd considered adding some form of automatic array shrinking in dequeue, but as @svick pointed out it doesn't exist in the reference implementations of similar data structures. I'd be curious to hear from anyone on the .NET Core/BCL team whether there's any particular reason they chose that style of implementation. Update: I checked List, Queue, Queue, and ArrayList - none of them shrink the size of the internal array on a remove/dequeue. Enqueue should support nulls, and is documented as allowing them. Did you run into a bug? I can't remember how robust the unit tests area in the area yet. |
Interesting, I noticed in the reference source linked by @svick that Concerning the use of |
@mbeidler Good point. According to MSDN, IComparer / IComparable only guarantees that a value of zero has the same sort order. However, it looks as though the same issue exists in the other collection classes. If I modify the code to operate on SortedList<Patient, Patient>, the test case still fails on ContainsKey. The implementation of SortedList<TKey, TValue>.ContainsKey calls Array.BinarySearch, which relies on IComparer to check for equality. The same holds true for SortedSet. Arguably it's a bug in the existing collection classes as well. I'll dug through the rest of the collections classes and see if there are any other data structures that accept an IComparer but test for equality separately. You're right though, for a priority queue you would expect custom ordering behavior that's completely independent from equality. |
Committed a fix and the test case into my fork branch. The new implementation of Contains is based directly on the behavior of List.Contains. Since List doesn't accept an IEqualityComparer, the behavior is functionally equivalent. When I get some time later today, I'll likely submit bug reports for the other built-in collections. The probably can't be fixed due to regression behavior, but at very least the documentation needs to be updated. |
I think it makes sense for It appears that in the MSDN documentation for |
@terrajobst How do you feel about the API proposal so far? Do you feel this a good fit for CoreFX? |
👍 |
Thanks for filing this. I believe we've enough data to make a formal review of this proposal, hence I labelled it as 'ready for api review' |
As |
@andrewgmorris related https://github.com/dotnet/corefx/issues/4316 "Add TryDequeue to Queue" |
We had a basic review of this and we agree that we want a ProrityQueue in the framework. However, we need to get someone to help drive the design and implementation for it. Whoever grabs the issue can work @terrajobst on finalizing the API. |
So what work is left on the API? |
This is missing from the API proposal above: |
I don't know that array-based priority queues are best. Memory allocation in .NET is really fast. We don't have the same search-small-blocks issue that the old malloc dealt with. You are welcome to use my priority queue code from here: https://github.com/BrannonKing/Kts.Astar/tree/master/Kts.AStar |
It would be helpful if you could elaborate more, or just say what you're trying to say. I'd need to disambiguate, there would be ping-pong, and that would turn into a lengthy discussion. Alternatively, we could arrange a call. |
I'm saying we want to avoid any Enqueue operation requiring an allocation, whether on the part of the caller or the part of the implementation (ammortized internal allocation is fine, e.g. to expand an array used in the implementation). I'm trying to understand how that's possible with a node-based heap (e.g. if those node objects are exposed to the caller, that prohibits pooling by the implementation due to concerns around inappropriate reuse / aliasing). I want to be able to write: pq.Enqueue(42, 84); and have that not allocate. How do the implementations you refer to achieve that?
I thought I was. |
Where does this desire come from? It's a nice to have side effect of a solution, not a requirement that 99.9% customers need to have satisfied. I don't see why you would choose this low-impact dimension to make design choices between solutions. We're not making design choices based on optimizations for the 0.1% of customers if these negatively impact 12% of the customers in another dimension. "caring about no allocations" + "dealing with two value types" is an edge case. I find the dimension of supported behavior/functionality far more important, especially when designing a general-purpose versatile data structure for a wide audience and a variety of use cases. |
From wanting the core collection types to be usable in scenarios that care about performance. You say the node-based solution would support 100% of use cases: that is not the case if every enqueue allocates, just as Why do you believe it's only "0.1%" that care about the allocation overheads of these methods? From where is that data coming?
It is not. It's also not just "two value types". As I understand it, the proposed solution would require either a) an allocation on every enqueue regardless of the Ts involved, or b) would require the element type to derive from some known base type which in turn prohibits a large number of possible uses to avoid extra allocation. |
@eiriktsarpalis We might still decide that because there's extra performance available to the 88% or 12% from an implementation which doesn't need an updatable data structure, or is optimized for one in the first place, its better to provide options 2 and 3, than option 4. But I thought we should not forget another option exists. [Or I suppose you could just see this as a better option 1 and update the description of 1 to say that bookkeeping is not forced, but lazy, and correct equatable behavior is only required when updates are used...] |
@stephentoub That's exactly what I had in mind about saying simply what you want to say, thank you :)
From intuition, i.e. the same source based on which you believe it's more important to prioritize "no additional allocations" over "ability to conduct updates". At least for the update mechanism, we have the data that 11-12% customers do need to have this behavior supported. I don't think remotely close customers would care about "no additional allocations". In either case, you are for some reason choosing to fixate on the memory dimension, forgetting about other dimensions, e.g. raw speed, which is another trade-off for your preferred approach. An array-based implementation providing "no additional allocations" would be slower than a node-based implementation. Again, I think it's arbitrary here to prioritize memory over speed. Let's take a step back and focus on what the customers want. We have a design choice that may or may not make the data structure unusable for 12% of the customers. I think we would need to be very careful with providing reasons for why we would choose not to support these. |
Please share the two C# implementations you're using to perform that comparison and the benchmarks used to come to that conclusion. Theoretical papers are certainly valuable but they are only one small piece of the puzzle. The more important thing is when the rubber meets the road, factoring in the details of the given platform and given implementations, and you're able to validate on the specific platform with the specific implementation and typical/expected data sets / usage patterns. It could very well be that your assertion is correct. It also may not be. I'd like to see the implementations / data to understand better. |
This is a valid point, the paper I quote only compares and benchmarks implementations in C++. It conducts multiple benchmarks with different data sets and usage patterns. I am quite confident this would be transferable to C#, but if you believe this is something we need to double down, I think the best course of action would be for you to ask a colleague to conduct such a study. |
@pgolebiowski I would be interested in better understanding the nature of your objection. The proposal advocates for two separate types, would that not cover your requirements? |
I would probably classify that as a performance optimization for option 1, however I do see a couple of issues with that particular approach:
|
@eiriktsarpalis Its only O(n) once, and O(1) afterwards, which is O(1) amortized. And you can postpone validating uniqueness until the first update. But maybe this is overly clever. Two classes is easier to explain. |
I've spent the last few days prototyping two PriorityQueue implementations: a basic implementation without update support and an implementation that supports updates using element equality. I've named the former The implementations can be found in this repo. Both classes are implemented using array-based quad heaps. The updatable implementation also uses a dictionary that maps elements to internal heap indices. Basic PriorityQueuenamespace System.Collections.Generic
{
public class PriorityQueue<TElement, TPriority> : IReadOnlyCollection<(TElement Element, TPriority Priority)>
{
// Constructors
public PriorityQueue();
public PriorityQueue(int initialCapacity);
public PriorityQueue(IComparer<TPriority>? comparer);
public PriorityQueue(int initialCapacity, IComparer<TPriority>? comparer);
public PriorityQueue(IEnumerable<(TElement Element, TPriority Priority)> values);
public PriorityQueue(IEnumerable<(TElement Element, TPriority Priority)> values, IComparer<TPriority>? comparer);
// Properties
public int Count { get; }
public IComparer<TPriority> Comparer { get; }
// O(log(n)) push operation
public void Enqueue(TElement element, TPriority priority);
// O(1) peek operations
public TElement Peek();
public bool TryPeek(out TElement element, out TPriority priority);
// O(log(n)) pop operations
public TElement Dequeue();
public bool TryDequeue(out TElement element, out TPriority priority);
// Combined push/pop, generally more efficient than sequential Enqueue();Dequeue() calls.
public TElement EnqueueDequeue(TElement element, TPriority priority);
public void Clear();
public Enumerator GetEnumerator();
public struct Enumerator : IEnumerator<(TElement Element, TPriority Priority)>, IEnumerator;
}
} Here's a basic sample using the type var queue = new PriorityQueue<string, int>();
queue.Enqueue("John", 1940);
queue.Enqueue("Paul", 1942);
queue.Enqueue("George", 1943);
queue.Enqueue("Ringo", 1940);
Assert.Equal("John", queue.Dequeue());
Assert.Equal("Ringo", queue.Dequeue());
Assert.Equal("Paul", queue.Dequeue());
Assert.Equal("George", queue.Dequeue()); Updatable PriorityQueuenamespace System.Collections.Generic
{
public class PrioritySet<TElement, TPriority> : IReadOnlyCollection<(TElement Element, TPriority Priority)> where TElement : notnull
{
// Constructors
public PrioritySet();
public PrioritySet(int initialCapacity);
public PrioritySet(IComparer<TPriority> comparer);
public PrioritySet(int initialCapacity, IComparer<TPriority>? priorityComparer, IEqualityComparer<TElement>? elementComparer);
public PrioritySet(IEnumerable<(TElement Element, TPriority Priority)> values);
public PrioritySet(IEnumerable<(TElement Element, TPriority Priority)> values, IComparer<TPriority>? comparer, IEqualityComparer<TElement>? elementComparer);
// Members shared with baseline PriorityQueue implementation
public int Count { get; }
public IComparer<TPriority> Comparer { get; }
public void Enqueue(TElement element, TPriority priority);
public TElement Peek();
public bool TryPeek(out TElement element, out TPriority priority);
public TElement Dequeue();
public bool TryDequeue(out TElement element, out TPriority priority);
public TElement EnqueueDequeue(TElement element, TPriority priority);
// Update methods and friends
public bool Contains(TElement element); // O(1)
public bool TryRemove(TElement element); // O(log(n))
public bool TryUpdate(TElement element, TPriority priority); // O(log(n))
public void EnqueueOrUpdate(TElement element, TPriority priority); // O(log(n))
public void Clear();
public Enumerator GetEnumerator();
public struct Enumerator : IEnumerator<(TElement Element, TPriority Priority)>, IEnumerator;
}
} Performance ComparisonI wrote a simple heapsort benchmark that compares the two implementations in their most basic application. I also included a sort benchmark that uses Linq for comparison:
As can be expected, the overhead of tracking element locations adds a significant performance hit, roughly 40-50% slower compared to the baseline implementation. |
I appreciate all the effort, I see it took a lot of time and energy.
|
tl;dr:
I work in high-performance search (motion-planning) & computational geometry code (e.g. sweepline algorithms) that is relevant to both robotics and games, I use a lot of custom hand-rolled priority queues. The other common use-case I have is a Top-K query where updatable priority is not useful. Some feedback regarding the two-implementations (yes vs no update support) debate. Naming:
Performance:
APIs:
That being said, for a significant number of my algorithms, my priority queue items contain their priorities, and storing that twice (once in the PQ, once in the items) sounds inefficient. This is especially the case if I am ordering by numerous keys (similar use-case to OrderBy.ThenBy.ThenBy). This API also cleans a lot of the inconsistencies where Insert takes a priority but Peek doesn't return it. Finally, it is worth noting that I often insert indices of an array into a Priority Queue, rather than array elements themselves. This is supported by all APIs discussed so far, though. As an example, if I am processing the beginning/ends of intervals on a x-axis line, I might have priority queue events |
@pgolebiowski I took the liberty of including your proposed pairing heap implementation to the benchmarks, just so we can compare instances of all three approaches directly. Here are the results:
Key takeaways
EDIT: updated the results after a bugfix in my benchmarks, thanks @VisualMelon |
@eiriktsarpalis Your benchmark for the (I did exactly the same thing when I first implemented it) Note that this doesn't mean the pairing heap is faster or slower, rather it seems to depend heavily on the distribution/order of the data supplied. |
@eiriktsarpalis re: the usefulness of PrioritySet... The litmus test for seeing whether PrioritySet is useful should be benchmarking algorithms which use priority updates, versus a non-updating implementation of same algorithm, enqueueing the duplicate values and ignoring duplicates when dequeueing. |
Thanks @VisualMelon, I updated my results and comments after your suggested fix.
I believe it might have benefited from the fact that the enqueued priorities were monotonic.
@TimLovellSmith my goal here was to measure performance for the most common PriorityQueue application: rather than measure the performance of updates, I wanted to see the impact for the case where updates are not needed at all. It might however make sense to produce a separate benchmark that compares pairing heap with "PrioritySet" updates. @miyu thanks for your detailed feedback, it is very much appreciated! |
@TimLovellSmith I wrote a simple benchmark that uses updates:
|
On a separate note, has their been discussion/feedback on the lack of stability being an issue (or non-issue) for people's usecases? |
None of the implementations guarantee stability, however it should be pretty straightforward for users to obtain stability by augmenting the ordinal with a timestamp: var pq = new PriorityQueue<string, (int priority, DateTime timestamp)>();
foreach (string element in elements)
{
int priority = 42;
pq.Enqueue(element, (priority, DateTime.Now));
} |
To summarize some of my previous posts, I've been trying to identify what a popular .NET priority queue would look like, so I went through the following data:
Which yielded the following takeaways:
Next StepsGoing forward, I propose we take the following actions for .NET 6:
At this moment, I would like to thank the contributors to this thread, in particular @pgolebiowski and @TimLovellSmith. Your feedback has played a huge role in guiding our design process. I hope to keep receiving your input as we iron out the design for the updatable priority queue. |
Sounds good :)
If we have the decision by the codebase owners that this direction is approved and desired, can I continue leading the API design for that bit and provide the final implementation?
It was quite a journey :D |
The API for I'm going to close this issue, thank you all for your contributions! |
Maybe someone can do a write up on this journey! A whole 6 years for one API. :) Any chance to win a Guinness? |
See LATEST Proposal in corefxlab repo.
Second Proposal options
Proposal from https://github.com/dotnet/corefx/issues/574#issuecomment-307971397
Assumptions
Elements in priority queue are unique. If they are not, we would have to introduce 'handles' of items to enable their update/remove. Or the update/remove semantics would have to apply to first/all, which is weird.
Modeled after
Queue<T>
(MSDN link)API
Open questions:
PriorityQueue
vs.Heap
IHeap
and constructor overload? (Should we wait for later?)IPriorityQueue
? (Should we wait for later -IDictionary
example)(TElement element, TPriority priority)
vs.KeyValuePair<TPriority, TElement>
Peek
andDequeue
rather haveout
argument instead of tuple?Peek
andDequeue
throwing useful at all?Original Proposal
Issue https://github.com/dotnet/corefx/issues/163 requested the addition of a priority queue to the core .NET collection data structures.
This post, while a duplicate, is intended to act the formal submission to the corefx API Review Process. The issue contents are the speclet for a new System.Collections.Generic.PriorityQueue type.
I will be contributing the PR, if approved.
Rationale and Usage
The .NET Base Class Libraries (BCL) currently lacks support for ordered producer-consumer collections. A common requirement of many software applications is the ability generate a list of items over time and process them in an order different from the order they were received in.
There are three generic data structures within the System.Collections hierarchy of namespaces that supported a sorted collection of items; System.Collections.Generic.SortedList, System.Collections.Generic.SortedSet, and System.Collections.Generic.SortedDictionary.
Of these, SortedSet and SortedDictionary are not appropriate for producer-consumer patterns that generate duplicate values. The complexity of SortedList is Θ(n) worst case for both Add and Remove.
A much more memory and time efficient data structure for ordered collections with producer-consumer usage patterns is a priority queue. Other than when capacity resizing is necessary, worse case insertion (enqueue) and remove top (dequeue) performance is Θ(log n) - far better than the existing options that exist in the BCL.
Priority queues have a wide degree of applicability across different classes of applications. The Wikipedia page on Priority Queues offers a list of many different well understand use cases. While highly specialized implementations may still require custom priority queue implementations, a standard implementation would cover a broad range of usage scenarios. Priority queues are particularly useful in scheduling the output of multiple producers, which is an important pattern in highly parallelized software.
It's worth noting that both the C++ standard library and Java offer priority queue functionality as part of their basic APIs.
Proposed API
Details
Open Questions
Updates
The text was updated successfully, but these errors were encountered: