-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it easier to store other types of link disambiguating information in the path hierarchy #828
Make it easier to store other types of link disambiguating information in the path hierarchy #828
Conversation
In the local benchmark I ran just now, this change shows no measurable difference in build time but a possible small improvement in memory usage
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patshaughnessy , @binamaniar , and I took a look at this today.
I'm leaving a few questions that we were curious about as we started to go through these changes.
// The 'hash' is more unique than the 'kind', so compare the 'hash' first. | ||
self.hash == hash && self.kind == kind | ||
} | ||
var isPlaceholderValue: Bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we ever create a node with a nil kind and hash? It looks like has something to do with an "unfindable" symbol placeholder, but I'm not sure what that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only happens when the convert service builds documentation for a single page. In this case, DocC is passed a "partial" symbol graph file that only contains a single symbol, which doesn't have to be a top-level symbol.
For example, if the convert service builds documentation for someFunction()
and it's a member of SomeClass
, then the symbol graph file will only contain the someFunction()
symbol and not SomeClass
but the function's path components still include "SomeClass". In this case, DocC creates an "unfindable" symbol placeholder named "SomeClass" that it someFunction()
becomes a member of. This ensures that the nodes in the hierarchy are connected and that there's the same number of nodes—with the same names—between the module node and the someFunction()
node as there would be in the full symbol graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for explaining. It might be helpful to have a comment somewhere explaining what a placeholder value is in the context of the PathHierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a a couple of paragraphs of internal implementation documentation about this in 50cc145
// It is possible for articles and other non-symbols to collide with unfindable symbol placeholder nodes. | ||
mutating func add(_ value: PathHierarchy.Node, kind: String?, hash: String?) { | ||
// When adding new elements to the container, it's sufficient to check if the hash and kind match. | ||
if let existing = storage.first(where: { $0.matches(kind: kind, hash: hash) }) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there ever be multiple Elements with the same kind and hash? Do we need an assert for that? This looks safe as long as it's the only place we ever add new Elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if processing multiple symbol graphs with different language representations of the same symbol there could be multiple calls to add(_:kind:hash:)
with the same kind
and hash
values. That's why this implementation checks if there's already an element with that kind and hash and merge the two nodes (combining the different language representations of that symbol).
The storage
is a private property so this is the only palace that can add new elements.
// Given this expected amount of data, a nested dictionary implementation performs well. | ||
private(set) var storage: [String: [String: PathHierarchy.Node]] = [:] | ||
// Given this expected amount of data, linear searches through an array performs well. | ||
private(set) var storage = ContiguousArray<Element>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What made you decide to use an array (or a ContiguousArray<Element>
to be precise) and not a Set? As a list of unique items, if we make Element
conform to Equatable
, Set<Element>
would automatically guarantee that all the elements are unique.
Maybe the Element
s aren't actually equal if the hash and kind are equal but not the node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow up question, after taking a look at ContiguousArray
's documentation:
If the array’s Element type is a struct or enumeration,
Array
andContiguousArray
should have similar efficiency.
What is the benefit of using ContiguousArray
over Array
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reason for not using a Set here is twofold:
- First, when an element already exists, the
DisambiguationContainer
needs to merge the new value with the existing value. Because of this, it's not sufficient to rely on the Set's uniqueness. Colliding elements also need to be merged. - Second, the
DisambiguationContainer
also needs to merge new and existing values in cases other than exact kind and hash matches. Currently this only happens when the container has a single placeholder value but in the future—when there could be more types of disambiguation—the container would need to look for existing values in more ways.
I could implement this with a Set and still use storage.first(where: { $0.matches(kind: kind, hash: hash) })
to check for existing matches but the Set doesn't bring any benefit over an Array in that case.
I could implement this with a Set and check for exact kind and hash matches with Set.remove(_:)
but it would only work for the exact kind and hash match.
let element = Element(...)
if let existing = storage.remove(element) {
existing.node.merge(with: value)
storage.insert(existing)
} else if ...
I could also separate the kind and hash from the node and create a compound key
struct Key {
let kind: String?
let hash: String?
}
private(set) var storage = [Key: Node]()
This makes it very easy to look for exact matches
let key = Key(kind: kind, hash: hash)
if let existing = storage[key] {
merge(with: value)
} else if ...
but when there are other types of disambiguation this breaks down and need to use .first(where:)
to inspect the individual values again.
Ultimately, since the purpose of this refactoring is to make it easier to store other types of disambiguation in the future (#643 already has one type of disambiguation that I want to add), I went with the solution that doesn't prioritize exact kind and hash matches.
I thought that there would be a bigger difference between ContiguousArray
and Array
but I can't measure any different except in extreme micro benchmarks. I can change it to a regular array if it makes the code easier to understand. Otherwise I don't mind keeping the ContiguousArray
since it's a low-level implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thank you! I wonder if it would be useful for future developers you were to add a comment (similar to your bullet points above) briefly explaining the unique requirements for the DisambiguationContainer
's storage? I'm trying to think what may be useful for a future developer to know if they wanted to add a new type of disambiguation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I briefly mentioned why a Set
wouldn't help in a new comment in 50cc145
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it simplifies the use of disambiguations in some places too, which is great!
(Reviewed together with @patshaughnessy and @binamaniar)
// The 'hash' is more unique than the 'kind', so compare the 'hash' first. | ||
self.hash == hash && self.kind == kind | ||
} | ||
var isPlaceholderValue: Bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for explaining. It might be helpful to have a comment somewhere explaining what a placeholder value is in the context of the PathHierarchy.
let component = PathParser.parse(pathComponent: component[...]) | ||
let nodeWithoutSymbol = Node(name: String(component.name)) | ||
nodeWithoutSymbol.isDisfavoredInCollision = true | ||
// If 'known disambiguated path components' was provided, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like part of the comment is missing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this—and the surrounding internal documentation—in 50cc145
@@ -207,9 +207,17 @@ struct PathHierarchy { | |||
parent.children[components.first!] == nil, | |||
"Shouldn't create a new sparse node when symbol node already exist. This is an indication that a symbol is missing a relationship." | |||
) | |||
guard knownDisambiguatedPathComponents != nil else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we not have to do this before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to do this but it's a small optimization to avoid doing unnecessary work. Before we always parsed each placeholder node's path component for disambiguation but now we only do it if we know that the path hierarchy has been provided some custom path components (which could include disambiguation that we need to parse)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a brief comment that talks about this in 50cc145
// Subtree contains more than one match | ||
throw Error.lookupCollision(kinds.map { ($0.value[hash]!, $0.key) }) | ||
switch disambiguation { | ||
case .kindAndHash(let kind, let hash): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you avoid a nested switch statement here by using a guard to check whether disambiguation
is nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do that now, but when we add another type of disambiguation we'll need both layers of switch statements.
@swift-ci please test |
1 similar comment
@swift-ci please test |
Bug/issue #, if applicable:
Summary
This is a small refactor to change how the path hierarchy internally stores its disambiguated elements, changing the implementation from nested dictionaries per name collision per node to small lists per name collision per node.
The benefit of a small list is that new disambiguating information can be added without increasing the depth of the nested dictionaries at each node.
Dependencies
None.
Testing
Nothing in particular.
Checklist
Make sure you check off the following items. If they cannot be completed, provide a reason.
[ ] Added tests./bin/test
script and it succeeded