Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lint: add check to ensure mutex Unlock() calls are deferred #105366

Closed
abarganier opened this issue Jun 22, 2023 · 6 comments · Fixed by #107577
Closed

lint: add check to ensure mutex Unlock() calls are deferred #105366

abarganier opened this issue Jun 22, 2023 · 6 comments · Fixed by #107577
Assignees
Labels
A-linters A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) db-cy-23 O-postmortem Originated from a Postmortem action item. quality-friday A good issue to work on on Quality Friday

Comments

@abarganier
Copy link
Contributor

abarganier commented Jun 22, 2023

Is your feature request related to a problem? Please describe.
We recently dealt with some deadlocks due to code written in this style:

func myFunc() error {
  // do some work...
  p.Lock()
  if err := somethingThatCanPanic(); err != nil {
    p.Unlock()
    return err
  }
  p.Unlock()
  // do some more work...
  return nil
}

A safer way to write this code would be something like:

func myFunc() error {
  // do some work...
  doRiskyWork := func() error {
    p.Lock()
    defer p.Unlock()
    return somethingThatCanPanic()
  }
  if err := doRiskyWork(); err != nil {
    return err
  }
  // do some more work...
  return nil
}

It would be nice to have a lint rule that enforces .Unlock() calls on mutexes use the defer keyword, to help avoid this pitfall in the future (and identify lingering cases of it).

We recognize that in some cases, using defer is not ideal, so we should ensure that programmers have the option to use lint:ignore for this rule. However, as the organization scales, we should set the default expectation to help avoid this pattern from being used in the future.

Jira issue: CRDB-29001

@abarganier abarganier added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-linters labels Jun 22, 2023
@abarganier abarganier self-assigned this Jun 22, 2023
@dhartunian dhartunian added the O-postmortem Originated from a Postmortem action item. label Jun 22, 2023
@tbg
Copy link
Member

tbg commented Jul 7, 2023

This would've caught #106078 (comment)

@tbg
Copy link
Member

tbg commented Jul 11, 2023

Returning to this issue due to #106568. Just reiterating that it could be high-value to introduce this lint since it might point us at the location of the replica mutex leak referred to in that issue.

@tbg
Copy link
Member

tbg commented Jul 11, 2023

re: the UX that results from the linter, it is often cumbersome to introduce anonymous functions to guide the defer calls. Another approach could be the following:

// package syncutil
type IdempotentUnlocker struct{mu interface{ Unlock() }}

func (iu *IdempotentUnlocker) IdempotentUnlock() {
  if iu.mu == nil { return }
  iu.mu.Unlock()
  iu.mu = nil
}

func Lock(mu *sync.Mutex) IdempotentUnlocker() {
  mu.Lock()
  return IdempotentUnlocker{mu}
}

// something similar for RLock
func (r *Replica) foo() {
  mu := syncutil.Lock(r.mu)
  defer mu.IdempotentUnlock()

  if riskyWorkFailed() {
    return nil
  }

  mu.IdempotentUnlock()
  // do some more work outside of crit section
}

I think that one will not allocate (have to check though) so it could be suitable for hot code paths as well.

@tbg tbg added the quality-friday A good issue to work on on Quality Friday label Jul 20, 2023
@exalate-issue-sync exalate-issue-sync bot removed db-cy-23 quality-friday A good issue to work on on Quality Friday labels Jul 20, 2023
@exalate-issue-sync exalate-issue-sync bot added db-cy-23 quality-friday A good issue to work on on Quality Friday labels Jul 20, 2023
@abarganier
Copy link
Contributor Author

@tbg the team is picking this up as part of Code Yellow. Thanks for the pointers above 🙏 we will post updates here!

@srosenberg
Copy link
Member

srosenberg commented Jul 24, 2023

I wrote a crude linter using revive framework.

The linter performs super naive flow propagation to track if a lock is held during function calls. It doesn't perform a proper data flow propagation (needed for precision), nor is it aware of control-flow reachability. Instead, it uses an exclusion list (see prettyPrintLockExprs) of hand-vetted functions which don't panic. (Reachability analysis performed by govulncheck can be extended to do this automatically; I'll come back to it when I have more time.) The linter counts the number of function calls and ifStmts while a lock is held and outputs them, e.g.,

lockExpr: w.latency.Current.RecordValue
lockExpr: elapsed.Nanoseconds
lockExpr: z.zipfGenMu.r.Float64
lockExpr: math.Pow
found "Unlock" outside of defer at /Users/srosenberg/workspace/go/src/github.com/cockroachdb/cockroach/pkg/workload/ycsb/zipfgenerator.go:146:2
2 lockExprs, 3 ifStmts

The linter also has another mode which performs a crude analysis to determine if a lock is potentially leaked. It does this by maintaining a stack (using lexical order) and popping each time Unlock is seen (outside of defer). If the stack isn't empty on return, we have a potentially leaked lock, e.g.,

lock "ex.mu" may have leaked at /Users/srosenberg/workspace/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2927:4

Despite being super naive, it was able to find a recently fixed leak [1], as well as, some new ones. (I'll open separate issues for those [2].)

To run,

./revive -config ./test.toml $GOPATH/src/github.com/cockroachdb/cockroach/pkg/... > ~/unsafe_lock.txt

As you can imagine, this type of linter will produce a lot of false positives,

grep "found \"Unlock" ~/unsafe_lock.txt|wc -l
     273

A manual audit is probably still warranted. However, with a bit more work (e.g., reachability analysis to exclude all functions that don't panic), I am confident we could reduce the false positives to a more manageable size.

Attaching the output for both modes against master,

leaked_mutex.txt
unsafe_lock.txt

[1] e3c0767
[2] #107433

@Santamaura
Copy link
Contributor

Santamaura commented Jul 24, 2023

I wrote a crude linter using revive framework.

The linter performs super naive flow propagation to track if a lock is held during function calls. It doesn't perform a proper data flow propagation (needed for precision), nor is it aware of control-flow reachability. Instead, it uses an exclusion list (see prettyPrintLockExprs) of hand-vetted functions which don't panic. (Reachability analysis performed by govulncheck can be extended to do this automatically; I'll come back to it when I have more time.) The linter counts the number of function calls and ifStmts while a lock is held and outputs them, e.g.,

lockExpr: w.latency.Current.RecordValue
lockExpr: elapsed.Nanoseconds
lockExpr: z.zipfGenMu.r.Float64
lockExpr: math.Pow
found "Unlock" outside of defer at /Users/srosenberg/workspace/go/src/github.com/cockroachdb/cockroach/pkg/workload/ycsb/zipfgenerator.go:146:2
2 lockExprs, 3 ifStmts

The linter also has another mode which performs a crude analysis to determine if a lock is potentially leaked. It does this by maintaining a stack (using lexical order) and popping each time Unlock is seen (outside of defer). If the stack isn't empty on return, we have a potentially leaked lock, e.g.,

lock "ex.mu" may have leaked at /Users/srosenberg/workspace/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2927:4

Despite being super naive, it was able to find a recently fixed leak [1], as well as, some new ones. (I'll open separate issues for those [2].)

To run,

./revive -config ./test.toml $GOPATH/src/github.com/cockroachdb/cockroach/pkg/... > ~/unsafe_lock.txt

As you can imagine, this type of linter will produce a lot of false positives,

grep "found \"Unlock" ~/unsafe_lock.txt|wc -l
     273

A manual audit is probably still warranted. However, with a bit more work (e.g., reachability analysis to exclude all functions that don't panic), I am confident we could reduce the false positives to a more manageable size.

Attaching the output for both modes against master,

leaked_mutex.txt unsafe_lock.txt

[1] e3c0767 [2] #107433

@srosenberg this is great! So would the path forward on this issue be adding the reachability analysis, doing the manual audit, moving this logic into an analyzer so it can be used in CI?

Santamaura added a commit to Santamaura/cockroach that referenced this issue Jul 25, 2023
This commit adds a new pass which checks for
`...Unlock()` expressions without `defer` which
could result in a lock being held indefinitely.

Resolves cockroachdb#105366

Release note: None
Santamaura added a commit to Santamaura/cockroach that referenced this issue Jul 26, 2023
This commit adds a new pass which checks for
`...Unlock()` expressions without `defer` which
could result in a lock being held indefinitely.

Resolves cockroachdb#105366

Release note: None
Santamaura added a commit to Santamaura/cockroach that referenced this issue Jul 26, 2023
This commit adds a new pass which checks for
`...Unlock()` expressions without `defer` which
could result in a lock being held indefinitely.

Resolves cockroachdb#105366

Release note: None
Santamaura added a commit to Santamaura/cockroach that referenced this issue Jul 26, 2023
This commit adds a new pass which checks for
`...Unlock()` expressions without `defer` which
could result in a lock being held indefinitely.

Resolves cockroachdb#105366

Release note: None
Santamaura added a commit to Santamaura/cockroach that referenced this issue Aug 10, 2023
This commit adds a new pass which checks for
`...Unlock()` expressions without `defer` which
could result in a lock being held indefinitely.

Resolves cockroachdb#105366

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
Extract a method to enable the use of deferred lock release.

Release note: None

Part of: cockroachdb#105366
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
Allows the use of deferred unlock.

Release note: None

Part of: cockroachdb#105366
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
Allows the use of deferred unlock.

Release note: None

Part of: cockroachdb#105366
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
Allows the use of deferred unlocks.

Release note: None

Part of: cockroachdb#105366
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
Extract into a method to allow the use of defer unlock.

Part of: cockroachdb#105366

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
This commit doesn't change the behavior of the method, but makes it more
explicit that the unlock calls will be specifically tied to the replicas
that are subsumed. This method intentionally "leaks" an unlocked raftMu,
therefore it is still necessary to add `nolint:deferunlockcheck`.

Part of: cockroachdb#105366

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue May 14, 2024
As part of cockroachdb#107577 an exclusion was added to replica_raft.go. After this
PR this exclusion is no longer necessary.

Part of: cockroachdb#105366

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-linters A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) db-cy-23 O-postmortem Originated from a Postmortem action item. quality-friday A good issue to work on on Quality Friday
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants