-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a specialized varint + bitpacking scheme for DepGraph encoding #110050
Conversation
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 0e05ccdfbc10f96411b16b0503b6234f701f1ab9 with merge cf71c3fe420cb3197ee017eaebcddfe19bc54720... |
This comment has been minimized.
This comment has been minimized.
💔 Test failed - checks-actions |
0e05ccd
to
58405b6
Compare
This comment has been minimized.
This comment has been minimized.
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 58405b6d16dcf0f5d865664b70c57e1a2ae240d1 with merge c297bb0502ac94c26a93f91543d3680a54b51ad5... |
☀️ Try build successful - checks-actions |
1 similar comment
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (c297bb0502ac94c26a93f91543d3680a54b51ad5): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
The icount reductions are nice, but cycles and wall-time don't look that good. |
58405b6
to
44cb85f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a lot of good comments on individual fields and functions, but it would be great to have a top-level comment somewhere containing much of the information from the PR description, which provides motivation and a high-level view.
TaskDepsRef::Allow(deps) => edges.extend(deps.lock().reads.iter().copied()), | ||
TaskDepsRef::Allow(deps) => { | ||
for index in deps.lock().reads.iter().copied() { | ||
edges.push(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be hard to impl Extend
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope! I really did not think of it at the time.
/// Amount of padding we need to add to the edge list data so that we can retrieve every | ||
/// SerializedDepNodeIndex with a fixed-size read then mask. | ||
const DEP_NODE_PAD: usize = DEP_NODE_SIZE - 1; | ||
/// Amount of bits we need to store the number of used bytes in a SerializedDepNodeIndex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Amount of bits we need to store the number of used bytes in a SerializedDepNodeIndex. | |
/// Number of bits we need to store the number of used bytes in a SerializedDepNodeIndex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid saying "number" twice in a sentence, but if that construction reads well to you I agree that "amount" is a bit of an odd word to use in this context.
let mut nodes = IndexVec::with_capacity(node_count); | ||
let mut fingerprints = IndexVec::with_capacity(node_count); | ||
let mut edge_list_indices = IndexVec::with_capacity(node_count); | ||
let mut edge_list_data = Vec::with_capacity(edge_count); | ||
// This slightly over-estimates the amount of bytes used for all the edge data but never by | ||
// more than ~6%, because over-estimation only occurs for large nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it overestimate? Where does the ~6% come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This comment needed to be updated, it was referring to an encoding scheme from a few iterations ago.
// Bit fields are | ||
// 0..? length of the edge | ||
// ?..?+2 bytes per index | ||
// ?+2..16 kind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use M
and N
instead of ?
, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
r=me with the comments addressed, thanks. @bors delegate=saethlin |
✌️ @saethlin, you can now approve this pull request! If @nnethercote told you to " |
Encode DepKind as u16 The derived Encodable/Decodable impls serialize/deserialize as a varint, which results in a lot of code size around the encoding/decoding of these types which isn't justified: The full range of values here is rather small but doesn't quite fit in to a `u8`. Growing _all_ serialized `DepKind` to 2 bytes costs us on average 1% size in the incr comp dep graph, which I plan to recoup in rust-lang/rust#110050 by taking advantage of the unused bits in all the serialized `DepKind`. r? `@nnethercote`
@bors r=nnethercote |
☀️ Test successful - checks-actions |
Finished benchmarking commit (f00c139): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDNext Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 628.991s -> 628.823s (-0.03%) |
As seen in the previous perf runs, this is a case where the reductions in icounts don't seem to match cycles and walltime, so investigating the small secondary regressions is probably not needed. |
@rustbot label: +perf-regression-triaged |
The previous scheme here uses leb128 to encode the edge tables that represent the incr comp dependency graph. The problem with that scheme is that leb128 has overhead for larger values, and generally relies on the distribution of encoded values being heavily skewed towards smaller values. That is definitely not the case for a dep node index, since they are handed out sequentially and the whole range is covered, the distribution is actually biased in the opposite direction: Most dep nodes are large.
This PR implements a different varint encoding scheme. Instead of applying varint encoding to individual dep node indices (which is extremely branchy) we now apply it per node.
While being built, each node now stores its edges in a
SmallVec
with a bit of extra logic to track the max value of each edge. Then we varint encode the whole batch. This is a gamble: We save on space by only claiming 2 bits per node instead of ~3 bits per edge which is a nice savings but needs to balance out with the space overhead that a single large index in a node with a lot of edges will encode unnecessary bytes in each of that node's edge indices.Then, to keep the runtime overhead of this encoding scheme down we deserialize our indices by loading 4 bytes for each then masking off the bytes that are't ours. This is much less code and branches than leb128, but relies on having some readable bytes past the end of each edge list. We explicitly add such padding to the in-memory data during decoding. And we also do this decoding lazily, turning a dense on-disk encoding into a peak memory reduction.
Then we apply a bit-packing scheme; since in #115391 we now have unused bits on
DepKind
, we use those unused bits (currently there are 7!) to store the 2 bits that we need for the byte width of the edges in each node, then use the remaining bits to store the length of the edge list, if it fits.r? @nnethercote