-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use named array instead of array in normalizing score #80901
feat: use named array instead of array in normalizing score #80901
Conversation
/retest |
/hold |
/cc @liu-cong |
@alculquicondor: GitHub didn't allow me to request PR reviews from the following users: liu-cong. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I sent this PR to start a discussion about normalizing score phase as we met problems when transforming the original priority functions into the scheduling framework. Feel free to leave comments and opinions in this PR. |
/assign |
This does involve an interface change, please document it in Does this PR introduce a user-facing change?. This seems to be a reasonable change. I cannot think of any objections to it besides more memory usage. I will wait for opinions from others. |
It's not clear to me why is the change needed. Normalization should be totally independent from the node information. |
Ideally, normalization should be totally independent of the node information, many of the priority functions now indeed do not use node name in Currently, the reduce phase in priority function is similar to the normalizing phase in the scheduling framework, these reduce function accepts a list of If you checked out the priority functions with reducing phase, some of them use And we need to provide node names in the normalizing phase because of the following reasons:
klog.Infof("%v -> %v: EvenPodsSpreadPriority, Score: (%d)", pod.Name, nodeName, *score)
klog.Infof("%v -> %v: InterPodAffinityPriority, Score: (%d)", pod.Name, node.Name, int(fScore))
"%v -> %v: SelectorSpreadPriority, Score: (%d)", pod.Name, result[i].Host, int(fScore), InterPodAffinity and EvenPodsSpread priorities are using deprecated pattern and quite different from the other priority function, in this WIP PR #80898, I tried to refactor it with the scheduling framework and met this problem. Even if we could find another approach to refactor them without node name, Based on the reasons above, I'd like to add node name to normalizing score phase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate why you are recommending a map over a "named array"?
score, status := pl.Score(pc, pod, nodes[index].Name) | ||
if !status.IsSuccess() { | ||
errCh.SendErrorWithCancel(fmt.Errorf(status.Message()), cancel) | ||
return | ||
} | ||
pluginToNodeScoreMap[pl.Name()][index] = score | ||
pluginToNodeScoreMap[pl.Name()][nodeName] = score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you need to synchronize access when writing to the map (a pre-allocated slice doesn't) since this practically adds a new key-value to it.
I don't have a preference. What do you think of these two options? |
I think it is better to follow the sig-arch recommendation. |
I'll make the change. |
b152920
to
68dc5da
Compare
/retest
…On Aug 8, 2019, 12:19 AM +0800, Kubernetes Prow Robot ***@***.***>, wrote:
@draveness: The following test failed, say /retest to rerun them all:
Test name
Commit
Details
Rerun command
pull-kubernetes-integration
58dda26
link
/test pull-kubernetes-integration
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@draveness I may miss something, but my impression is that this PR still doesn't resolve the core issue of the deprecated priorities - which is: the nodes
slice here are the ones that passed Predicates. So the Score()
function still doesn't have the handle of all nodes info?
// Run the Score plugins.
scoresMap, scoreStatus := framework.RunScorePlugins(pluginContext, pod, nodes)
(The nodes
come from one of the parameters of PrioritizeNodes()
)
for _, scoreList := range scoresMap { | ||
for i := range nodes { | ||
result[i].Score += scoreList[i] | ||
for _, nodeScoreList := range scoresMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be merged into previous for i := range nodes
loop, after L806?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, Done
58dda26
to
aa5f9fd
Compare
We could get nodes from the
/hold cancel I'll cancel the hold |
Looks good.
Thanks. Let's see if it works. |
/retest |
@@ -31,10 +31,16 @@ import ( | |||
type Code int | |||
|
|||
// NodeScoreList declares a list of nodes and their scores. | |||
type NodeScoreList []int | |||
type NodeScoreList []NodeScore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to use a map from node names to node score (map[string]int
) instead of NodeScoreList
? With the current data structure, finding the score of a node requires iterating over a NodeScoreList.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to use a map from node names to node score (
map[string]int
) instead ofNodeScoreList
? With the current data structure, finding the score of a node requires iterating over a NodeScoreList.
I don't have a preference here
I recalled that in the API convention documents, kubernetes prefer the named array to map. I could go with the second option if it is the convention we must follow here, but I prefer the first for now. What do you think of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to think carefully about this question: "Do we need to find score of a particular node in our code?" If the answer is no, we should definitely go with the current List instead of a map. Lists are faster to access compared to maps if we don't need to search for something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to think carefully about this question: "Do we need to find score of a particular node in our code?"
+1
If the answer is no, we should definitely go with the current List instead of a map. Lists are faster to access compared to maps if we don't need to search for something.
For the implementation of InterPodAffinity and EvenPodSpread, we don't need to find the score of a particular node in NormalizeScore, and we only use node name in the named array (or map) to find score from the internal data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we have a strong reason to use map (which I doubt we will), my recommendation is to use a list.
for i := range nodes { | ||
result[i].Score += scoreList[i] | ||
for j := range scoresMap { | ||
result[i].Score += scoresMap[j][i].Score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that the original code will be faster because it has better cache locality when reading scores from the scoresMap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need more evidence on this, before that I prefer to keep it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory sequential (or strided) reads are the fastest form of memory access patterns because of prefetching and cache locality.
The other thing in this nested loop that could make it slower is that it makes n*m map lookups (n=number of nodes, m = number of priority functions/score plugins), while if we make the node loop inside, the number of scoresMap lookups will be zero. So it might be even faster (and easier to read) to flip the whole thing like this:
for i := range nodes {
result = append(result, schedulerapi.HostPriority{Host: nodes[i].Name, Score: 0})
}
for j, priorityList := range priorityConfigs {
for i := range nodes {
result[i].Score += priorityList[i].Score * priorityConfigs[j].Weight
}
}
for _, scoreList := range scoresMap {
for i := range nodes {
result[i].Score += scoreList[i].Score
}
}
We need more evidence on this, before that I prefer to keep it this way.
The theory says this change makes things slower, so if you don't have evidence to the contrary, then we shouldn't make the change, especially that it is not related to the purpose of the PR "using named array in normalize scores".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory sequential (or strided) reads are the fastest form of memory access patterns because of prefetching and cache locality.
I know the theory of the cache locality and here are the benchmark numbers:
package main
import "testing"
func locality(result []int, scoresMap map[int][]int, anotherScoresMap map[int][]int) {
for _, scores := range scoresMap {
for i := range scores {
result[i] = result[i] + scores[i]
}
}
for _, scores := range anotherScoresMap {
for i := range scores {
result[i] = result[i] + scores[i]
}
}
}
func nonlocality(result []int, scoresMap map[int][]int, anotherScoresMap map[int][]int) {
for i := range result {
for j := range scoresMap {
result[i] = result[i] + scoresMap[j][i]
}
for j := range anotherScoresMap {
result[i] = result[i] + anotherScoresMap[j][i]
}
}
}
func BenchmarkLocality(b *testing.B) {
result := make([]int, 5000)
scores := make(map[int][]int, 10)
anotherScores := make(map[int][]int, 10)
for i := 0; i < 10; i++ {
scores[i] = make([]int, 5000)
anotherScores[i] = make([]int, 5000)
for j := 0; j < 5000; j++ {
scores[i][j] = j
anotherScores[i][j] = j
}
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
locality(result, scores, anotherScores)
}
}
func BenchmarkNonLocality(b *testing.B) {
result := make([]int, 5000)
scores := make(map[int][]int, 10)
anotherScores := make(map[int][]int, 10)
for i := 0; i < 10; i++ {
scores[i] = make([]int, 5000)
anotherScores[i] = make([]int, 5000)
for j := 0; j < 5000; j++ {
scores[i][j] = j
anotherScores[i][j] = j
}
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
locality(result, scores, anotherScores)
}
}
$ go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/golang
BenchmarkLocality-8 30000 44607 ns/op
BenchmarkNonLocality-8 30000 44245 ns/op
PASS
According to the benchmarks, I prefer to keep it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange, it is either golang's compiler is so good, or new architectures have such powerful memory controllers. I would like to do the same test in C++ to see if I get different results.
Do you have a theory why they are the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that the number of priorities (j
values) is very small, and so is the number of nodes. We are only talking about ~200kb, which can fit in the caches pretty easily.
Minor note in the posted benchmark: it should be map[string][]int
. but again, we are only doing lookups among 10 values.
On the other hand, I don't understand the motivation to do the change. I'm not going to hold the review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re ran the test and set the sizes to two orders of magnitude bigger, no difference.
The issue about cache is not only about size, but also what you move to the cache (how much of the cache line being moved from memory is actually used), so it is about effective memory bandwidth.
@@ -31,10 +31,16 @@ import ( | |||
type Code int | |||
|
|||
// NodeScoreList declares a list of nodes and their scores. | |||
type NodeScoreList []int | |||
type NodeScoreList []NodeScore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we have a strong reason to use map (which I doubt we will), my recommendation is to use a list.
Hi @bsalamat, do you have any thoughts on this? |
/lgtm |
Hi @ahg-g, can I get approval on this or I have to bother Bobby? |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, draveness The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest
…On Aug 14, 2019, 2:18 PM +0800, Kubernetes Prow Robot ***@***.***>, wrote:
@draveness: The following test failed, say /retest to rerun them all:
Test name
Commit
Details
Rerun command
pull-kubernetes-integration
aa5f9fd
link
/test pull-kubernetes-integration
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
/retest |
/kind feature
/sig scheduling
/priority important-soon
/assign @bsalamat @ahg-g
What this PR does / why we need it:
I was refactoring the interpod-affinity priority with scheduling framework and found out it is quite hard to connect the node which passing to the
Score
plugin to theNodeScoreList
passing to theNormalizeScore
with index.The priority functions are using a HostPriority which contains node name and score, it would be better to provide the node name information in normalizing score phase.
We have two options for optimising this:
nodeName -> score
map, which this PR try to introduceHostPriority
liked struct, ex:I recalled that in the API convention documents, kubernetes prefer the named array to map. I could go with the second option if it is the convention we must follow here, but I prefer the first for now. What do you think of this? @bsalamat @ahg-g
Which issue(s) this PR fixes:
ref: #80667
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: