grunning: `grunning.Time()` is non-monotonic #95529

irfansharif · 2023-01-19T18:38:21Z

Describe the problem

In #93952 we observed non-monotonicity in grunning.Time(). The problem is due to an accounting bug in the patched runtime we're now running with. To maintain the nanoseconds spent per-g in the _Grunning state, we want to intercept all transitions in and out of the _Grunning state.

When we transition into the _Grunning state, we maintain a timestamp on type g struct to record when we were scheduled:

cockroach/build/teamcity/internal/release/build-and-publish-patched-go/diff.patch

Lines 11 to 16 in f27fa9d

    
           +	now := nanotime() 
        
           +	if newval == _Grunning { 
        
           +		// We're transitioning into the running state, record the timestamp for 
        
           +		// subsequent use. 
        
           +		gp.lastsched = now 
        
           +	}

When we transition out of the Grunning state, we accumulate the time we ran in the most recent interval:

cockroach/build/teamcity/internal/release/build-and-publish-patched-go/diff.patch

Lines 18 to 21 in f27fa9d

    
            	if oldval == _Grunning { 
        
           +		// We're transitioning out of running, record how long we were in the 
        
           +		// state. 
        
           +		gp.runningnanos += now - gp.lastsched

As for surfacing how much time the calling goroutine has spent running, we use the last scheduled timestamp, accumulated counter when last descheduled, and a current monotonic clock reading, to get to the following:

cockroach/build/teamcity/internal/release/build-and-publish-patched-go/diff.patch

Lines 45 to 51 in f27fa9d

    
           +// grunningnanos returns the wall time spent by current g in the running state. 
        
           +// A goroutine may be running on an OS thread that's descheduled by the OS 
        
           +// scheduler, this time still counts towards the metric. 
        
           +func grunningnanos() int64 { 
        
           +	gp := getg() 
        
           +	return gp.runningnanos + nanotime() - gp.lastsched 
        
           +}

This works, except that the patch does not actually intercept all transitions in and out of the _Grunning state. Specifically, we're missing the following points in the Go scheduler code:

diff --git i/src/runtime/proc.go w/src/runtime/proc.go
diff --git i/src/runtime/proc.go w/src/runtime/proc.go
index 4f1bed23d2..76871ad52e 100644
--- i/src/runtime/proc.go
+++ w/src/runtime/proc.go
@@ -929,6 +929,9 @@ func casfrom_Gscanstatus(gp *g, oldval, newval uint32) {
                dumpgstatus(gp)
                throw("casfrom_Gscanstatus: gp->status is not in scan state")
        }
+       if newval == _Grunning {
+               gp.lastsched = nanotime()
+       }
        releaseLockRank(lockRankGscan)
 }

@@ -944,6 +947,9 @@ func castogscanstatus(gp *g, oldval, newval uint32) bool {
                        r := atomic.Cas(&gp.atomicstatus, oldval, newval)
                        if r {
                                acquireLockRank(lockRankGscan)
+                               if oldval == _Grunning {
+                                       gp.runningnanos += nanotime() - gp.lastsched
+                               }
                        }
                        return r

@@ -1065,6 +1071,7 @@ func casGToPreemptScan(gp *g, old, new uint32) {
        }
        acquireLockRank(lockRankGscan)
        for !atomic.Cas(&gp.atomicstatus, _Grunning, _Gscan|_Gpreempted) {
+               gp.runningnanos += nanotime() - gp.lastsched
        }
 }

With it, the reproduction below no longer reproduces the regression.

To Reproduce

With that branch, and the following assertion of monotonicity, it reliably fails after a few seconds when using dev test --stress pkg/sql/opt/exec/explain -f TestCPUTimeEndToEnd:

--- a/pkg/sql/colflow/stats.go
+++ b/pkg/sql/colflow/stats.go
@@ -23,6 +23,7 @@ import (
        "github.com/cockroachdb/cockroach/pkg/sql/execstats"
        "github.com/cockroachdb/cockroach/pkg/util/buildutil"
        "github.com/cockroachdb/cockroach/pkg/util/grunning"
+       "github.com/cockroachdb/cockroach/pkg/util/log"
        "github.com/cockroachdb/cockroach/pkg/util/mon"
        "github.com/cockroachdb/cockroach/pkg/util/syncutil"
        "github.com/cockroachdb/cockroach/pkg/util/timeutil"
@@ -123,7 +124,14 @@ func (bic *batchInfoCollector) Next() coldata.Batch {
        bic.stopwatch.Start()
        // Wrap the call to Next() with a panic catcher in order to get the correct
        // execution time (e.g. in the statement bundle).
+       startCPUTime := grunning.Time()
        err := colexecerror.CatchVectorizedRuntimeError(bic.next)
+       endCPUTime := grunning.Time()
+       if endCPUTime < startCPUTime {
+               log.Fatalf(context.TODO(), "non-monotonicity! endCPUTime (%d) < startCPUTime (%d) ",
+                       endCPUTime, startCPUTime,
+               )
+       }
        bic.stopwatch.Stop()
        if err != nil {
                colexecerror.InternalError(err)

Impact

Before updating the patch, and all the testing/{micro,macro}benchmarking it entails, lets assess the impact. Since we're missing interception points in and out of _Grunning, what could go wrong? Given grunning.Time() is powered by a grunningnanos(), which is gp.runningnanos + nanotime() - gp.lastsched, we can break this down to the following:

gp.runningnanos is maintained when leaving _Grunning. So it might be older/smaller than it actually is. So grunningnanos() can be an under estimate, and thus non-monotonic.
gp.lastsched is maintained when entering _Grunning. So it might be older/smaller than it actually is. So grunningnanos() can be an over-estimate. It might also lead to gp.runningnanos being higher than it's supposed to be (since we use gp.runningnanos += nanotime() - gp.lastsched to keep it up-to-date), so grunningnanos() is again an over-estimate.

If the analysis above holds, given our current uses of grunning.Time(), we could paper over the non-monotonicity (zero out differences between end/start time if the latter is smaller than the former), and document its inaccuracy.

+cc @DrewKimball, @kvoli, @sumeerbhola.

Jira issue: CRDB-23571

Epic CRDB-34249

The text was updated successfully, but these errors were encountered:

Elapsed returns the running time spent doing some piece of work, with grunning.Time() measurements from the start and end. NB: This only exists due to grunning.Time()'s non-monotonicity, a bug in our runtime patch: cockroachdb#95529. We can get rid of this, keeping just grunning.Difference(), if that bug is fixed. The bug results in slight {over,under}-estimation of the running time (the latter breaking monotonicity), but is livable with our current uses of this library. Release note: None

kvoli · 2023-01-19T20:00:45Z

I haven't noticed inconsistency with the current patch that has resulted in negative results for using replica cpu acct for rebalancing. This is over 100+ hrs of tests. Perhaps the error is less impactful over a large number of samples - or it doesn't matter that much for the use case.

The assumption that gp.lastsched should always be up to date when calling grunningnanos() confused me for a moment. It got me thinking, how bad could things really get - do we deschedule and schedule goroutines frequently enough that we should avoid large blips?

I'm fine with papering over for the rebalance usecase - if we are comfortable with the missing entry/exit points being infrequent

sumeerbhola · 2023-01-19T20:14:15Z

gp.runningnanos + nanotime() - gp.lastsched, we can break this down to the following:

gp.runningnanos is maintained when leaving _Grunning. So it might be older/smaller than it actually is. So grunningnanos() can be an under estimate, and thus non-monotonic.

IIUC, gp.runningnanos even though it can be an underestimate (missing some state transitions, so missing some additions) or overestimate (finally seeing a state transition out of running and using an old value of lastsched), is monotonic (its accumulating non-negative values). So the issue arises because we can have one reading that counts nanotime() - gp.lastsched but subsequently we miss the transition out of running, and see the transition into running, and so update gp.lastsched (without accounting for the running time), and thereby return a smaller value in a later computation. Is my understanding correct?

irfansharif · 2023-01-19T21:05:52Z

So the issue arises because we can have one reading that counts nanotime() - gp.lastsched but subsequently we miss the transition out of running, and see the transition into running, and so update gp.lastsched (without accounting for the running time), and thereby return a smaller value in a later computation. Is my understanding correct?

Yes, exactly. Take two calls to grunningnanos(), G1 and G2. Between G1 and G2 we transition out of _Grunning and back again, but miss the first transition. At G2, gp.runningnanos was was what it was at G1 because we missed the transition out of _Grunning. But since we did observe re-entering _Grunning, we updated gp.lastsched.

At G1 grunningnanos() = gp.runningnanosG1 + nanotimeG1() - gp.lastschedG1
At G2 grunningnanos() = gp.runningnanosG1 + nanotimeG2() - gp.lastschedG2
G2 - G1 = (nanotimeG2() - gp.lastschedG2) - (nanotimeG1() - gp.lastschedG1). Which is time since last scheduled. This is non-monotonic.

95518: roachtest/awsdms: add no pk full load test case r=Jeremyyang920 a=Jeremyyang920 This commit adds a new test case that will attempt a full load on a table with no primary key. We will assert that there is a table error with the full load. Fixes: #95328 Release note: None 95526: multitenant: append `WITH REPLICATION STATUS` columns to `SHOW TENANT` columns r=knz a=ecwall Informs #87851 `WITH CAPABILITIES` will be adding another set of columns so append relevant columns instead of creating all possible column sets ahead of time. Release note: None 95536: grunning: add grunning.Elapsed() r=irfansharif a=irfansharif Elapsed returns the running time spent doing some piece of work, with grunning.Time() measurements from the start and end. This only exists due to grunning.Time()'s non-monotonicity, a bug in our runtime patch: #95529. We can get rid of this, keeping just grunning.Difference(), if that bug is fixed. The bug results in slight {over,under}-estimation of the running time (the latter breaking monotonicity), but is livable with our current uses of this library. Release note: None Co-authored-by: Jeremy Yang <[email protected]> Co-authored-by: Evan Wall <[email protected]> Co-authored-by: irfan sharif <[email protected]>

grunning.Elapsed() is the API to use when measuring the running time spent doing some piece of work, with measurements from the start and end. This only exists due to grunning.Time()'s non-monotonicity, a bug in our runtime patch: cockroachdb#95529. The bug results in slight {over,under}-estimation of the running time (the latter breaking monotonicity), but is livable with our current uses of this library, including the one here in cpustopwatch. grunning.Elapsed() papers over this bug by 0-ing out when grunning.Time()stamps regress. This is unlike grunning.Difference() which would return the absolute value of the regression -- not what we want here. Release note: None

…95564 #95583 #95606 90222: server: add api for decommission pre-flight checks r=AlexTalks a=AlexTalks While we have an API for checking the status of an in-progress decommission, we did not previously have an API to execute sanity checks prior to requesting a node to move into the `DECOMMISSIONING` state. This adds an API to do just that, intended to be called by the CLI prior to issuing a subsequent `Decommission` RPC request. Fixes #91568. Release note: None 91458: roachtest/mixed-versions-compat: use corpus for master r=fqazi a=fqazi Informs: #91350 The CI build scripts take advantage of the branch name to uplaod the corpus to GCS. Unfortunately, we have no way of know if the current build is master inside the roachtest. To address this, this patch supports fetching the master corpus as a fallback. Release note: None 92826: multiregionccl: add a cold start latency test r=ajwerner a=ajwerner This commit adds a test which creates an MR serverless cluster and then boots the sql pods in each region while disallowing connectivity to other regions. It also simulates latency to make sure the routing logic works and to provide a somewhat realistic picture of what to expect. Epic: CRDB-18596 Release note: None 93758: server: evaluate decommission pre-checks r=kvoli a=AlexTalks This adds support for the evaluation of the decommission readiness of a node (or set of nodes), by simulating their liveness to have the DECOMMISSIONING status and utilizing the allocator to ensure that we are able to perform any actions needed to repair the range. This supports a "strict" mode, in which case we expect all ranges to only need replacement or removal due to the decommissioning status, or a more permissive "non-strict" mode, which allows for other actions needed, as long as they do not encounter errors in finding a suitable allocation target. The non-strict mode allows us to permit situations where a range may have more than one action needed to repair it, such as a range that needs to reach its replication factor before the decommissioning replica can be replaced, or a range that needs to finalize an atomic replication change. Depends on #94024. Part of #91568 95007: admission: CPU slot adjustment and utilization metrics r=irfansharif a=sumeerbhola Our existing metrics are gauges (total and used slots) which don't give us insight into what is happening at smaller time scales. This creates uncertainty when we observe admission queueing but the gauge samples show total slots consistenly greater than used slots. Additionally, if total slots is steady during queuing, it doesn't tell us whether that was because of roughly matching increments or decrements, or no increments/decrements. The following metrics are added: - admission.granter.slots_exhausted_duration.kv: cumulative duration when the slots were exhausted. This can give insight into how much exhaustion was occurring. It is insufficient to tell us whether 0.5sec/sec of exhaustion is due to a long 500ms of exhaustion and then non-exhaustion or alternating 1ms of exhaustion and non-exhaustion. But this is an improvement over what we have. - admission.granter.slot_adjuster_{increments,decrements}.kv: Counts the increments and decrements of the total slots. - admission.granter.cpu_load_{short,long}_period_duration.kv: cumulative duration of short and long ticks, as indicated by the period in the CPULoad callback. We don't expect long period ticks when admission control is active (and we explicitly disable enforcement during long period ticks), but it helps us eliminate some hypothesis during incidents (e.g. long period ticks alternating with short period ticks causing a slow down in how fast we increment slots). Additionally, the sum of the rate of these two, if significantly < 1, would indicate that CPULoad frequency is lower than expected, say due to CPU overload. Fixes #92673 Epic: none Release note: None 95145: sql/stats: include partial stats in results of statsCache.GetTableStats r=rytaft a=michae2 We were not including partial stats in the list of table statistics returned by `statsCache.GetTableStats`. This was fine for the optimizer, which currently cannot use partial stats directly, but it was a problem for backup. We'd like to use partial stats directly in the optimizer eventually, so this commit goes ahead and adds them to the results of `GetTableStats`. The optimizer then must filter them out. To streamline this we add some helper functions. Finally, in an act of overzealous refactoring, this commit also changes `MergedStatistics` and `ForecastTableStatistics` to accept partial statistics and full statistics mixed together in the same input list. This simplifies the code that calls these functions. Fixes: #95056 Part of: #93983 Epic: CRDB-19449 Release note: None 95387: kvserver: separate loadstats cpu nanos to raft/req r=andrewbaptist a=kvoli Previously, loadstats tracked replica raft/request cpu nanos per second separately but returned both summed together in `load.ReplicaLoadStats`. This patch separates `RaftCPUNanosPerSecond` and `RequestCPUNanosPerSecond` in the returned `load.ReplicaLoadStats` so that they may be used independently. Informs #95380 Release note: None 95557: sql: Fix testing_optimizer_disable_rule_probability usage with vtables r=cucaroach a=cucaroach If a vtable scan query tries to use the dummy "0" column the exec builder errors out, this typically won't happen thanks to prune columns normalization rules and those rules are marked as "essential" but the logic allowing those rules to be applied was flawed. Epic: CRDB-20535 Informs: #94890 Release note: None 95559: rpc: fix comment r=andreimatei a=andreimatei This copy-pasta comment was mentioning a KV node, which was not right. Release note: None Epic: None 95564: cpustopwatch: s/grunning.Difference/grunning.Elapsed r=irfansharif a=irfansharif `grunning.Elapsed()` is the API to use when measuring the running time spent doing some piece of work, with measurements from the start and end. This only exists due to `grunning.Time()`'s non-monotonicity, a bug in our runtime patch: #95529. The bug results in slight {over,under}-estimation of the running time (the latter breaking monotonicity), but is livable with our current uses of this library, including the one here in cpustopwatch. `grunning.Elapsed()` papers over this bug by 0-ing out when `grunning.Time()`stamps regress. This is unlike `grunning.Difference()` which would return the absolute value of the regression -- not what we want here. Release note: None 95583: sql: fix cluster setting propagation flake take 2 r=cucaroach a=cucaroach Previously we tried to fix this with one retry but that was insufficient. Extend it to all queries in this section of the test. Release note: None Epic: CRDB-20535 95606: backupccl: deflake TestScheduleChainingLifecycle r=adityamaru a=msbutler This patch will skip the test if the machine clock is close to midnight, and increases the frequency of incremental backups to run every 2 minutes. Previously, the backup schedule in this test used the following crontab recurrence: '*/5 * * * *'. In english, this means "run a full backup now, and then run a full backup every day at midnight, and an incremental every 5 minutes. This test also relies on an incremental backup running after the first full backup. But, what happens if the first full backup gets scheduled to run within 5 minutes of midnight? A second full backup may get scheduled before the expected incremental backup, breaking the invariant the test expects. Fixes #88575 #95186 Release note: None Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Michael Erickson <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Michael Butler <[email protected]>

In our previous version of this patch, we missed two entry points and one exit point into the grunning state of a Goroutine. This led to `grunning.Time()` being non-monotonic. This new patch adds those missing integration points. Fixes cockroachdb#95529. Release note: None

118907: build: update patch with bug fix to runtime grunning r=sumeerbhola,rickystewart a=aadityasondhi In our previous version of this patch, we missed two entry points and one exit point into the grunning state of a Goroutine. This led to `grunning.Time()` being non-monotonic. This new patch adds those missing integration points. Fixes #95529. Release note: None 119356: jobs: remove double-load from metrics poller r=msbutler a=stevendanna All but the last commit is #119355 Previously, this code loaded the job twice, once via LoadJobWithTxn and again via the call to Unpaused. Here, we re-arrange the code so that it only loads it once. Epic: none Release note: None 119480: opt: reduce allocations for filter implication with disjunctions r=mgartner a=mgartner This commit reduces allocations in `Implicator.FiltersImplyPredicate` by using stack-allocated arrays when there are five or less adjacent disjunctions in the filter or partial index predicate. This should cover the most common cases of filters and predicates with disjunctions. Epic: None Release note: None 119519: ttljob: skip TestSpanToQueryBoundsCompositeKeys under stress r=fqazi a=annrpom This patch skips TestSpanToQueryBoundsCompositeKeys under stress/stressrace as it often hits OOM issues under these conditions. Epic: none Release note: None Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Annie Pompa <[email protected]>

This was fixed in cockroachdb#118907. This patch adds an assertion for test builds to ensure monotonic grunning time. Informs cockroachdb#95529 Release note: None

126424: kvadmission: remove grunning non-monotonic comment r=aadityasondhi a=aadityasondhi This was fixed in #118907. This patch adds an assertion for test builds to ensure monotonic grunning time. Informs #95529 Release note: None Co-authored-by: Aaditya Sondhi <[email protected]>

irfansharif added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 19, 2023

irfansharif mentioned this issue Jan 19, 2023

grunning: add grunning.Elapsed() #95536

Merged

irfansharif mentioned this issue Jan 20, 2023

cpustopwatch: s/grunning.Difference/grunning.Elapsed #95564

Merged

kvoli mentioned this issue Mar 6, 2023

grunning: determine whether grunning includes gc assist time #98110

Closed

irfansharif mentioned this issue Aug 24, 2023

admission,kvadmission: use tenant cpu consumption for inter-tenant fa… #108364

Merged

aadityasondhi self-assigned this Nov 9, 2023

aadityasondhi added A-admission-control T-admission-control Admission Control labels Dec 13, 2023

aadityasondhi mentioned this issue Feb 7, 2024

build: update patch with bug fix to runtime grunning #118907

Merged

craig bot closed this as completed in 6af173f Feb 22, 2024

aadityasondhi mentioned this issue Jun 28, 2024

kvadmission: remove grunning non-monotonic comment #126424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grunning: `grunning.Time()` is non-monotonic #95529

grunning: `grunning.Time()` is non-monotonic #95529

irfansharif commented Jan 19, 2023 •

edited by exalate-issue-sync bot

Loading

kvoli commented Jan 19, 2023 •

edited

Loading

sumeerbhola commented Jan 19, 2023

irfansharif commented Jan 19, 2023 •

edited

Loading

grunning: grunning.Time() is non-monotonic #95529

grunning: grunning.Time() is non-monotonic #95529

Comments

irfansharif commented Jan 19, 2023 • edited by exalate-issue-sync bot Loading

kvoli commented Jan 19, 2023 • edited Loading

sumeerbhola commented Jan 19, 2023

irfansharif commented Jan 19, 2023 • edited Loading

grunning: `grunning.Time()` is non-monotonic #95529

grunning: `grunning.Time()` is non-monotonic #95529

irfansharif commented Jan 19, 2023 •

edited by exalate-issue-sync bot

Loading

kvoli commented Jan 19, 2023 •

edited

Loading

irfansharif commented Jan 19, 2023 •

edited

Loading