Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test that qps doesn't dip when gracefully draining a node #23274

Closed
a-robinson opened this issue Mar 1, 2018 · 7 comments
Closed

Test that qps doesn't dip when gracefully draining a node #23274

a-robinson opened this issue Mar 1, 2018 · 7 comments
Assignees
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@a-robinson
Copy link
Contributor

This is an important scenario that could really use some regression test coverage, as indicated by the fact that nobody noticed or followed up on #22573 until more than 4 months after 1.1 was released.

This seems like a good fit to be a workload test -- run something like kv with its -max-rate flag set, then gracefully stop a node and expect QPS to not dip more than a few percent below the specified -max-rate. If we wanted to make this extra rigorous, #23202 could be used to pin all leases on the node that we stop before we stop it.

cc @asubiotto

@a-robinson a-robinson added this to the 2.1 milestone Mar 1, 2018
@tbg
Copy link
Member

tbg commented Mar 1, 2018

+1, this seems pretty important. One requirement here is being able to programmatically obtain the load generator statistics (ideally while the load runs). I wonder if workload should export an HTTP interface for that. Or we can query the cluster statement statistics (this is nice because users should be able to access this information for their workload, too). Or workload could insert periodically into a statistics table that we can then query.

@petermattis
Copy link
Collaborator

Do you need cluster statement statistics, or access to some of the internal time series metrics? For a specific metric, the time series are already programmatically available (though the specific magic incantation is a bit involved).

@tbg
Copy link
Member

tbg commented Mar 1, 2018

I think we'd want to be able to run a few different load gens eventually, and some of them might be so low in qps that their dip could be shadowed by a faster one (say kv). Maybe statement statistics can do well enough for starters.

@petermattis
Copy link
Collaborator

I think we'd want to be able to run a few different load gens eventually, and some of them might be so low in qps that their dip could be shadowed by a faster one (say kv).

Good point, though for initial test a single QPS metric would suffice.

Maybe statement statistics can do well enough for starters.

Yeah. I've forgotten the specifics of when these are reset and the info they contain, but certainly seems possible they could work.

@asubiotto asubiotto self-assigned this Mar 6, 2018
@knz knz added the A-testing Testing tools and infrastructure label Jul 21, 2018
@knz knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 21, 2018
@nvanbenschoten
Copy link
Member

Most of this was addressed by #26542, which gracefully shuts down a third of a cluster and watches the QPS of kv. It asserts that QPS did not drop by more than 20%. Is there anything more to this issue that we should address for 2.1?

@petermattis
Copy link
Collaborator

Is there anything more to this issue that we should address for 2.1?

I think that test is sufficient.

@a-robinson
Copy link
Contributor Author

I don't think that test actually tests this? That test is basically:

  1. Run some load and measure the qps
  2. Stop the load
  3. Stop a node
  4. Run some load and measure the qps
  5. Stop the load
  6. Compare the qps results

It is explicitly not trying to measure how an ongoing load is affected by the process of draining a node, and almost certainly would not have found the various bugs in the node draining logic that motivated this issue.

@a-robinson a-robinson reopened this Jul 23, 2018
@petermattis petermattis removed this from the 2.1 milestone Oct 5, 2018
a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 17, 2018
The test verifies that QPS isn't affected by a node being gracefully
drained and shut down.

Fixes cockroachdb#23274

Release note: None
craig bot pushed a commit that referenced this issue Dec 24, 2018
33188: roachtest: Add test of graceful draining during shutdown r=a-robinson a=a-robinson

The test verifies that QPS isn't affected by a node being gracefully
drained and shut down.

Fixes #23274

Release note: None

Co-authored-by: Alex Robinson <[email protected]>
@craig craig bot closed this as completed in #33188 Dec 24, 2018
a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants