storage: consistency checker queue continued to increase after workload stopped #28377

robert-s-lee · 2018-08-08T13:26:36Z

read through #21824. Want to confirm the expected behavior.

BUG REPORT

Created test case per #28376

Which version of CockroachDB are you using?

v2.1.0-alpha.20180730 on osx

What did you do?

for i in `seq 10000`; do cockroach sql --insecure --database test -e "create database if not exists db${i}; use db${i}; create table if not exists test${i} (id int primary key); insert into test${i} values (1),(2),(3);"; done

stopped the test after creating 8,000 databases.

What did you expect to see?

even on a small MacBook, the database was operational.

What did you see instead?

checking for abnormal metric and saw that consistency checker queue continued to increase after workload stopped.

I180808 13:11:07.180262 175 storage/replica_proposal.go:222  [n1,s1,r22/1:/Table/5{1-2}] new range lease repl=(n1,s1):1 seq=4 start=1533733867.175762791,0 epo=1 pro=1533733867.175800343,0 following repl=(n3,s3):2 seq=3 start=1533647315.171325321,1 epo=1 pro=1533647316.120851806,0
I180808 13:11:07.464286 185 storage/replica_proposal.go:222  [n1,s1,r11463/1:/Table/1145{7-8}] new range lease repl=(n1,s1):1 seq=5 start=1533733867.461558025,0 epo=1 pro=1533733867.461561871,0 following repl=(n3,s3):2 seq=4 start=1533725275.949570639,0 epo=2 pro=1533725275.949596737,0
I180808 13:11:08.520167 142 storage/replica_proposal.go:222  [n1,s1,r5279/1:/Table/527{3-4}] new range lease repl=(n1,s1):1 seq=5 start=1533733868.516460494,0 epo=1 pro=1533733868.516464838,0 following repl=(n3,s3):2 seq=4 start=1533725275.949570639,0 epo=2 pro=1533725275.949596737,0
I180808 13:11:09.537130 145 storage/replica_proposal.go:222  [n1,s1,r5310/1:/Table/530{4-5}] new range lease repl=(n1,s1):1 seq=5 start=1533733869.533432123,0 epo=1 pro=1533733869.533435833,0 following repl=(n3,s3):2 seq=4 start=1533725275.949570639,0 epo=2 pro=1533725275.949596737,0
I180808 13:11:10.636281 183 storage/replica_proposal.go:222  [n1,s1,r12234/1:/Table/1222{8-9}] new range lease repl=(n1,s1):1 seq=5 start=1533733870.632955126,0 epo=1 pro=1533733870.632959801,0 following repl=(n3,s3):2 seq=4 start=1533725275.949570639,0 epo=2 pro=1533725275.949596737,0
I180808 13:11:15.481832 229 server/status/runtime.go:425  [n1] runtime stats: 606 MiB RSS, 173 goroutines, 311 MiB/1.2 GiB/1.8 GiB GO alloc/idle/total, 226 MiB/270 MiB CGO alloc/total, 190.00cgo/sec, 0.13/0.03 %(u/s)time, 0.00 %gc (1x)
I180808 13:11:16.042702 177 storage/replica_proposal.go:222  [n1,s1,r11195/1:/Table/111{89-90}] new range lease repl=(n1,s1):1 seq=5 start=1533733876.040034707,0 epo=1 pro=1533733876.040038989,0 following repl=(n3,s3):2 seq=4 start=1533725275.949570639,0 epo=2 pro=1533725275.949596737,0
I180808 13:11:18.227249 180 storage/replica_proposal.go:222  [n1,s1,r6809/1:/Table/680{3-4}] new range lease repl=(n1,s1):1 seq=5 start=1533733878.223059716,0 epo=1 pro=1533733878.223064142,0 following repl=(n3,s3):2 seq=4 start=1533725275.949570639,0 epo=2 pro=1533725275.949596737,0
I180808 13:11:20.962705 522080 rpc/nodedialer/nodedialer.go:90  [n1] connection to n2 established
I180808 13:11:22.432859 140 storage/replica_proposal.go:222  [n1,s1,r13805/1:/Table/13{799-800}] new range lease repl=(n1,s1):1 seq=5 start=1533733882.428899568,0 epo=1 pro=1533733882.428905510,0 following repl=(n3,s3):2 seq=4 start=1533725275.949570639,0 epo=2 pro=1533725275.949596737,0
I180808 13:11:23.440705 141 storage/replica_proposal.go:222  [n1,s1,r14739/1:/Table/1473{3-4}] new range lease repl=(n1,s1):1 seq=5 start=1533733883.436914841,0 epo=1 pro=1533733883.436919395,0 following repl=(n3,s3):2 seq=4 start=1533725275.949570639,0 epo=2 pro=1533725275.949596737,0

What was the impact?

The text was updated successfully, but these errors were encountered:

robert-s-lee · 2018-08-08T14:31:21Z

consistency checker queue looks to be declining slowly. The decline seems to match absence of leaseholder transfers (from Range Operations graph). What is the relationship between the two metric? Is this normal? Any downside to having high consistency checker queue?

tbg · 2018-08-08T14:38:09Z

I haven't checked the specific numbers, but the consistency checker works differently from most other queues. It tries to spread out its load over its full checking cycle (~24h). Yet, replicas are queued when they become eligible for a check (after 24h). So when lots of ranges are created around the same time, they will all become eligible at the same time at the first, and the bump should only disperse after ~1 day. If you're seeing this in a cluster that has had an approximately constant number of ranges for multiple days, there's a problem. Otherwise, it's expected. From your description ("created 8k databases") it seems expected.

robert-s-lee · 2018-08-08T14:52:17Z

@tschottdorf thank you for the info. Make sense.

robert-s-lee assigned tbg and awoods187 Aug 8, 2018

robert-s-lee closed this as completed Aug 8, 2018

robert-s-lee mentioned this issue Aug 8, 2018

ui: document each metrics on UI page #28379

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: consistency checker queue continued to increase after workload stopped #28377

storage: consistency checker queue continued to increase after workload stopped #28377

robert-s-lee commented Aug 8, 2018

robert-s-lee commented Aug 8, 2018

tbg commented Aug 8, 2018

robert-s-lee commented Aug 8, 2018

storage: consistency checker queue continued to increase after workload stopped #28377

storage: consistency checker queue continued to increase after workload stopped #28377

Comments

robert-s-lee commented Aug 8, 2018

robert-s-lee commented Aug 8, 2018

tbg commented Aug 8, 2018

robert-s-lee commented Aug 8, 2018