-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TopologyAwareHints: Take lock in HasPopulatedHints #118189
TopologyAwareHints: Take lock in HasPopulatedHints #118189
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @Miciah. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
was this not fixed by #117249? if not, can we add a test like the one I suggested in #117249 (comment) to cover the regression? |
It was not, but thanks for pointing me to that change! I was able to reproduce the race condition by modifying Patch for TestTopologyCacheRace--- a/pkg/controller/endpointslice/topologycache/topologycache_test.go
+++ b/pkg/controller/endpointslice/topologycache/topologycache_test.go
@@ -686,6 +686,9 @@ func TestTopologyCacheRace(t *testing.T) {
go func() {
cache.AddHints(sliceInfo)
}()
+ go func() {
+ cache.HasPopulatedHints(sliceInfo.ServiceKey)
+ }()
}
// Test Helpers Test output
Is modifying |
2e25130
to
b322858
Compare
Thanks @Miciah! /ok-to-test |
/cc |
maybe it would be good to increase the chance by calling
|
modifying it LGTM, duplicating the test too ... is just to avoid regressions |
I mean, modifying if it covers both races, otherwise just duplicate it |
return t.hasPopulatedHintsLocked(serviceKey) | ||
} | ||
|
||
func (t *TopologyCache) hasPopulatedHintsLocked(serviceKey string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the naming might be a bit misleading, what about hasPopulatedHintsWithoutLock
or hasPopulatedHintsThreadUnsafe
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? I see the naming pattern of foo
(or Foo
) that takes a lock and fooLocked
that assumes a lock is held all over the Kubernetes codebase when I grep for func .*Locked(
. Here's one example out of dozens:
kubernetes/staging/src/k8s.io/client-go/tools/cache/heap.go
Lines 188 to 216 in decf1e1
// AddIfNotPresent inserts an item, and puts it in the queue. If an item with | |
// the key is present in the map, no changes is made to the item. | |
// | |
// This is useful in a single producer/consumer scenario so that the consumer can | |
// safely retry items without contending with the producer and potentially enqueueing | |
// stale items. | |
func (h *Heap) AddIfNotPresent(obj interface{}) error { | |
id, err := h.data.keyFunc(obj) | |
if err != nil { | |
return KeyError{obj, err} | |
} | |
h.lock.Lock() | |
defer h.lock.Unlock() | |
if h.closed { | |
return fmt.Errorf(closedMsg) | |
} | |
h.addIfNotPresentLocked(id, obj) | |
h.cond.Broadcast() | |
return nil | |
} | |
// addIfNotPresentLocked assumes the lock is already held and adds the provided | |
// item to the queue if it does not already exist. | |
func (h *Heap) addIfNotPresentLocked(key string, obj interface{}) { | |
if _, exists := h.data.items[key]; exists { | |
return | |
} | |
heap.Push(h.data, &itemKeyValue{key, obj}) | |
} |
I see one match for func.*WithoutLock
, but it actually takes a lock:
kubernetes/vendor/github.com/peterbourgon/diskv/diskv.go
Lines 546 to 551 in decf1e1
// cacheWithoutLock acquires the store's (write) mutex and calls cacheWithLock. | |
func (d *Diskv) cacheWithoutLock(key string, val []byte) error { | |
d.mu.Lock() | |
defer d.mu.Unlock() | |
return d.cacheWithLock(key, val) | |
} |
I see only a couple matches for func.*ThreadUnsafe
, e.g.:
kubernetes/pkg/kubelet/util/manager/watch_based_manager.go
Lines 66 to 82 in decf1e1
func (i *objectCacheItem) stop() bool { | |
i.lock.Lock() | |
defer i.lock.Unlock() | |
return i.stopThreadUnsafe() | |
} | |
func (i *objectCacheItem) stopThreadUnsafe() bool { | |
if i.stopped { | |
return false | |
} | |
i.stopped = true | |
close(i.stopCh) | |
if !i.immutable { | |
i.store.unsetInitialized() | |
} | |
return true | |
} |
In sum:
fooWithoutLock
seems to mean the opposite of what we want here.fooThreadUnsafe
is unusual but has precedent.fooLocked
seems to be the prevailing pattern.
However, if you strongly prefer "hasPopulatedHintsThreadUnsafe", I'll make the change. Let me know if you indeed want me to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the prevailing pattern is fooLocked
. It just seems odd to me that the we describe a precondition to a function in the function name instead of the intent what the function will do.
We also have the fooUnlocked
variant:
kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go
Lines 856 to 864 in 2eb4eac
func (npm *nominator) DeleteNominatedPodIfExists(pod *v1.Pod) { | |
npm.lock.Lock() | |
npm.deleteNominatedPodIfExistsUnlocked(pod) | |
npm.lock.Unlock() | |
} | |
func (npm *nominator) deleteNominatedPodIfExistsUnlocked(pod *v1.Pod) { | |
npm.delete(pod) | |
} |
So we have no actual consistency in the codebase and it is probably the best to always check the implementation.
Since fooLocked
is used the most, I am fine with doing it the old way, even though semantically fooThreadUnsafe
would be better.
t.setHintsLocked(serviceKey, addrType, allocatedHintsByZone) | ||
} | ||
|
||
func (t *TopologyCache) setHintsLocked(serviceKey string, addrType discovery.AddressType, allocatedHintsByZone EndpointZoneInfo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+here
/priority important-soon |
@Miciah can you please rebase the PR? |
Prevent potential concurrent map access by taking a lock before reading the topology cache's hintsPopulatedByService map. * staging/src/k8s.io/endpointslice/topologycache/topologycache.go (setHintsLocked, hasPopulatedHintsLocked): New helper functions. These are the same as the existing SetHints and HasPopulatedHints methods except that these helpers assume that a lock is already held. (SetHints): Use setHintsLocked. (HasPopulatedHints): Take a lock and use hasPopulatedHintsLocked. (AddHints): Take a lock and use setHintsLocked and hasPopulatedHintsLocked. * staging/src/k8s.io/endpointslice/topologycache/topologycache_test.go (TestTopologyCacheRace): Add a goroutine that calls HasPopulatedHints.
b322858
to
43f8ccf
Compare
Rebased for #118953. |
seems this is a flake |
Thanks @Miciah! Would like to get another set of eyes on this for final LGTM in case I'm missing anything here. /cc @aojea @swetharepakula |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Miciah, robscott The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
LGTM label has been added. Git tree hash: ed106aa738293b299a1b366e75e2874e9990fc26
|
…89-origin-release-1.26 Automated cherry pick of #118189: TopologyAwareHints: Take lock in HasPopulatedHints
…89-origin-release-1.28 Automated cherry pick of #118189: TopologyAwareHints: Take lock in HasPopulatedHints
…89-origin-release-1.27 Automated cherry pick of #118189: TopologyAwareHints: Take lock in HasPopulatedHints
What type of PR is this?
/kind bug
What this PR does / why we need it:
Prevent potential concurrent map access by taking a lock before reading the topology cache's
hintsPopulatedByService
map.staging/src/k8s.io/endpointslice/topologycache/topologycache.go
(setHintsLocked
,hasPopulatedHintsLocked
): New helper functions. These are the same as the existingSetHints
andHasPopulatedHints
methods except that these helpers assume that a lock is already held.(
SetHints
): UsesetHintsLocked
.(
HasPopulatedHints
): Take a lock and usehasPopulatedHintsLocked
.(
AddHints
): Take a lock and usesetHintsLocked
andhasPopulatedHintsLocked
.staging/src/k8s.io/endpointslice/topologycache/topologycache_test.go
(TestTopologyCacheRace
): Add a goroutine that callsHasPopulatedHints
.Which issue(s) this PR fixes:
Fixes #118188.
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: