Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are… #67

Merged
merged 1 commit into from
May 23, 2024

Conversation

PaddyMc
Copy link
Collaborator

@PaddyMc PaddyMc commented May 23, 2024

… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107)

cometbft#3091

The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


PR checklist



PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments

… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107)

cometbft#3091 

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <[email protected]>
Co-authored-by: Sergio Mena <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
@PaddyMc PaddyMc added the S:backport/v25 backport to the osmo-v25/v0.37.4 branch label May 23, 2024
@ValarDragon ValarDragon merged commit 925143e into osmo/v0.37.4 May 23, 2024
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S:backport/v25 backport to the osmo-v25/v0.37.4 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants