Test that qps doesn't dip when gracefully draining a node #23274

a-robinson · 2018-03-01T17:51:54Z

This is an important scenario that could really use some regression test coverage, as indicated by the fact that nobody noticed or followed up on #22573 until more than 4 months after 1.1 was released.

This seems like a good fit to be a workload test -- run something like kv with its -max-rate flag set, then gracefully stop a node and expect QPS to not dip more than a few percent below the specified -max-rate. If we wanted to make this extra rigorous, #23202 could be used to pin all leases on the node that we stop before we stop it.

cc @asubiotto

The text was updated successfully, but these errors were encountered:

tbg · 2018-03-01T19:30:59Z

+1, this seems pretty important. One requirement here is being able to programmatically obtain the load generator statistics (ideally while the load runs). I wonder if workload should export an HTTP interface for that. Or we can query the cluster statement statistics (this is nice because users should be able to access this information for their workload, too). Or workload could insert periodically into a statistics table that we can then query.

petermattis · 2018-03-01T22:19:49Z

Do you need cluster statement statistics, or access to some of the internal time series metrics? For a specific metric, the time series are already programmatically available (though the specific magic incantation is a bit involved).

tbg · 2018-03-01T22:43:29Z

I think we'd want to be able to run a few different load gens eventually, and some of them might be so low in qps that their dip could be shadowed by a faster one (say kv). Maybe statement statistics can do well enough for starters.

petermattis · 2018-03-02T01:13:29Z

I think we'd want to be able to run a few different load gens eventually, and some of them might be so low in qps that their dip could be shadowed by a faster one (say kv).

Good point, though for initial test a single QPS metric would suffice.

Maybe statement statistics can do well enough for starters.

Yeah. I've forgotten the specifics of when these are reset and the info they contain, but certainly seems possible they could work.

nvanbenschoten · 2018-07-22T23:11:55Z

Most of this was addressed by #26542, which gracefully shuts down a third of a cluster and watches the QPS of kv. It asserts that QPS did not drop by more than 20%. Is there anything more to this issue that we should address for 2.1?

petermattis · 2018-07-23T10:55:07Z

Is there anything more to this issue that we should address for 2.1?

I think that test is sufficient.

a-robinson · 2018-07-23T17:49:06Z

I don't think that test actually tests this? That test is basically:

Run some load and measure the qps
Stop the load
Stop a node
Run some load and measure the qps
Stop the load
Compare the qps results

It is explicitly not trying to measure how an ongoing load is affected by the process of draining a node, and almost certainly would not have found the various bugs in the node draining logic that motivated this issue.

The test verifies that QPS isn't affected by a node being gracefully drained and shut down. Fixes cockroachdb#23274 Release note: None

33188: roachtest: Add test of graceful draining during shutdown r=a-robinson a=a-robinson The test verifies that QPS isn't affected by a node being gracefully drained and shut down. Fixes #23274 Release note: None Co-authored-by: Alex Robinson <[email protected]>

Follow-up to cockroachdb#33188, which fixed cockroachdb#23274 Release note: None

a-robinson added this to the 2.1 milestone Mar 1, 2018

asubiotto self-assigned this Mar 6, 2018

knz added the A-testing Testing tools and infrastructure label Jul 21, 2018

knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 21, 2018

petermattis closed this as completed Jul 23, 2018

a-robinson reopened this Jul 23, 2018

nvanbenschoten added the A-coreperf label Oct 2, 2018

nvanbenschoten assigned a-robinson and unassigned asubiotto Oct 2, 2018

petermattis removed this from the 2.1 milestone Oct 5, 2018

nvanbenschoten removed the A-coreperf label Oct 16, 2018

a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 17, 2018

roachtest: Add test of graceful draining during shutdown

3b80bf0

The test verifies that QPS isn't affected by a node being gracefully drained and shut down. Fixes cockroachdb#23274 Release note: None

a-robinson mentioned this issue Dec 17, 2018

roachtest: Add test of graceful draining during shutdown #33188

Merged

craig bot closed this as completed in #33188 Dec 24, 2018

a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 24, 2018

roachtest: Use HAProxy for non-local graceful draining test

202ce8f

Follow-up to cockroachdb#33188, which fixed cockroachdb#23274 Release note: None

a-robinson mentioned this issue Dec 24, 2018

roachtest: Use HAProxy for non-local graceful draining test #33349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test that qps doesn't dip when gracefully draining a node #23274

Test that qps doesn't dip when gracefully draining a node #23274

a-robinson commented Mar 1, 2018

tbg commented Mar 1, 2018

petermattis commented Mar 1, 2018

tbg commented Mar 1, 2018

petermattis commented Mar 2, 2018

nvanbenschoten commented Jul 22, 2018

petermattis commented Jul 23, 2018

a-robinson commented Jul 23, 2018

Test that qps doesn't dip when gracefully draining a node #23274

Test that qps doesn't dip when gracefully draining a node #23274

Comments

a-robinson commented Mar 1, 2018

tbg commented Mar 1, 2018

petermattis commented Mar 1, 2018

tbg commented Mar 1, 2018

petermattis commented Mar 2, 2018

nvanbenschoten commented Jul 22, 2018

petermattis commented Jul 23, 2018

a-robinson commented Jul 23, 2018