Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: WAL failover secondary duration negative #136317

Closed
jbowens opened this issue Nov 27, 2024 · 0 comments
Closed

storage: WAL failover secondary duration negative #136317

jbowens opened this issue Nov 27, 2024 · 0 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team

Comments

@jbowens
Copy link
Collaborator

jbowens commented Nov 27, 2024

In #136284, n2 died with a panic of a CrdbTestBuild assertion:

* ERROR: a panic has occurred!
* panic: Counters should not decrease, prev: 0, new: -1919127.
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	GOROOT/src/runtime/panic.go:770
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Counter).Update
*   | 	pkg/util/metric/metric.go:775
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreMetrics).updateEngineMetrics
*   | 	pkg/kv/kvserver/metrics.go:3898
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).computeMetricsLocked
*   | 	pkg/kv/kvserver/store.go:3616
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).computeMetrics
*   | 	pkg/kv/kvserver/store.go:3600
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).ComputeMetricsPeriodically
*   | 	pkg/kv/kvserver/store.go:3641
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).computeMetricsPeriodically.func1
*   | 	pkg/server/node.go:1214
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1
*   | 	pkg/kv/kvserver/stores.go:144
*   | github.com/cockroachdb/cockroach/pkg/util/syncutil.(*Map[...]).Range
*   | 	pkg/util/syncutil/map.go:391
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores
*   | 	pkg/kv/kvserver/stores.go:143
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).computeMetricsPeriodically
*   | 	pkg/server/node.go:1213
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).startComputePeriodicMetrics.func2
*   | 	pkg/server/node.go:1105
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   | 	pkg/util/stop/stopper.go:498
*   | runtime.goexit
*   | 	src/runtime/asm_amd64.s:1695
* Wraps: (2) panic: Counters should not decrease, prev: 0, new: -1919127.
* Error types: (1) *withstack.withStack (2) *errutil.leafError

Jira issue: CRDB-44971

@jbowens jbowens added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. T-storage Storage Team branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Nov 27, 2024
jbowens added a commit to jbowens/pebble that referenced this issue Nov 27, 2024
The WAL failover monitor tracks the duration spent writing to the primary and
secondary WAL directories. Previously it was possible for updates to these
durations to be negative (despite implicitly using the time.Time's monotonic
time), because the time was retrieved before acquiring the mutex guarding the
accumulated durations. A goroutine that measured the current time first was not
necessarily the first to enter the critical section and update the monitor's
lastAccumulateIntoDurations field.

This commit moves these time measurements under the mutex and adds an assertion
that the monitor's understanding of time moves monotonically forward.

Informs cockroachdb/cockroach#136317.
jbowens added a commit to cockroachdb/pebble that referenced this issue Nov 27, 2024
The WAL failover monitor tracks the duration spent writing to the primary and
secondary WAL directories. Previously it was possible for updates to these
durations to be negative (despite implicitly using the time.Time's monotonic
time), because the time was retrieved before acquiring the mutex guarding the
accumulated durations. A goroutine that measured the current time first was not
necessarily the first to enter the critical section and update the monitor's
lastAccumulateIntoDurations field.

This commit moves these time measurements under the mutex and adds an assertion
that the monitor's understanding of time moves monotonically forward.

Informs cockroachdb/cockroach#136317.
@jbowens jbowens added branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 and removed branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Nov 27, 2024
jbowens added a commit to jbowens/pebble that referenced this issue Nov 27, 2024
The WAL failover monitor tracks the duration spent writing to the primary and
secondary WAL directories. Previously it was possible for updates to these
durations to be negative (despite implicitly using the time.Time's monotonic
time), because the time was retrieved before acquiring the mutex guarding the
accumulated durations. A goroutine that measured the current time first was not
necessarily the first to enter the critical section and update the monitor's
lastAccumulateIntoDurations field.

This commit moves these time measurements under the mutex and adds an assertion
that the monitor's understanding of time moves monotonically forward.

Informs cockroachdb/cockroach#136317.
jbowens added a commit to cockroachdb/pebble that referenced this issue Dec 3, 2024
The WAL failover monitor tracks the duration spent writing to the primary and
secondary WAL directories. Previously it was possible for updates to these
durations to be negative (despite implicitly using the time.Time's monotonic
time), because the time was retrieved before acquiring the mutex guarding the
accumulated durations. A goroutine that measured the current time first was not
necessarily the first to enter the critical section and update the monitor's
lastAccumulateIntoDurations field.

This commit moves these time measurements under the mutex and adds an assertion
that the monitor's understanding of time moves monotonically forward.

Informs cockroachdb/cockroach#136317.
jbowens added a commit to jbowens/cockroach that referenced this issue Dec 3, 2024
Changes:

 * [`a7d74870`](cockroachdb/pebble@a7d74870) wal: prevent negative durations

Release note: none.
Epic: none.
Fix cockroachdb#136317
@jbowens jbowens closed this as completed Dec 4, 2024
@github-project-automation github-project-automation bot moved this from Incoming to Done in [Deprecated] Storage Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team
Projects
No open projects
Archived in project
Development

No branches or pull requests

1 participant