Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load Testing Improvements #3589

Open
4 tasks
kevindelgado opened this issue Mar 14, 2019 · 8 comments
Open
4 tasks

Load Testing Improvements #3589

kevindelgado opened this issue Mar 14, 2019 · 8 comments

Comments

@kevindelgado
Copy link
Contributor

kevindelgado commented Mar 14, 2019

This discussion issue is a bit of a catch-all for improving our load/benchmark testing.

End Goal / Desired Outcome:

As a pachyderm developer I want to be able to specify a version/commit of pachyderm, cloud provider, and a suite of load-tests, and have it run these tests at scale on a production cluster, tearing down the cluster afterwards and provide me with easy to digest results on both performance benchmarks and cost.

Background:

Currently we have a single benchmark test (at src/testing/loadtest/split), that attempts to run pachyderm at scale (see the readme for more info). We believe we can build off of this foundation to have more robust load testing that help us 1) be more confident that new features do not create performance regressions, 2) prove that improvements we make for performance actually perform as expected, 3) provide a tool for reproducing and debugging performance issues we see out in the wild.

Proposal:

Build out infrastructure to simplify and automate the running of the tests:

  • Create a cluster launcher service to streamline the deployment and teardown of a pachyderm cluster (cloud k8s cluster creation, pach cluster deploy, teardown of cluster).
  • Create load testing infrastructure that consumes the cluster launcher service and automates the running of loadtests and consumption of loadtest results on the pach cluster generated by the cluster launcher service.
  • Make new/different loadtests pluggable into streamlined deploy process.
  • Automate the full loadtest lifecycle configuring pachyderm version, cloud-provider, and test-suite.

Manually create a few more loadtests that reproduce some specific scaling issues we've encountered, run these loadtests in the above-mentioned:

• that demonstrates the improvements from batched GetFiles.
• that reproduces the CPU pegging issue that some customers are seeing.
• that requires at least 10+ nodes, 20+ workers and 30+ minutes.

@ysimonson
Copy link
Contributor

ysimonson commented Mar 14, 2019

Maybe out of scope, but I'd love to see fuzztests for large pachyderm clusters, a la chaos monkey or jepsen. Related ticket: #310

@kevindelgado
Copy link
Contributor Author

That's a great idea. Part of the goal here is to make it easy enough for anyone to add new benchmark/load tests, so enabling others to create tests, like the ones you've describe, is a good way to test the usability of this infrastructure. Will keep in mind this use case as we are designing what this new infra should look like.

@ysimonson
Copy link
Contributor

@kevindelgado yesterday you brought up using this as a way of tracking/preventing performance regressions. It'd be great to have some sort of dashboard where you could:

  • Kick off load tests for a particular branch or tagged release
  • Track changes over time

Something like pypy's speed center, but way, way simpler.

@kevindelgado
Copy link
Contributor Author

This is awesome. Had a rough idea of what something like this could look like, hadn't dug around to see what all is out there in this space, but this gives a very concrete, fairly simple solution that we can model off of.

@ysimonson ysimonson self-assigned this Sep 23, 2019
@gabrielgrant
Copy link
Contributor

gabrielgrant commented Sep 23, 2019

note that PyPy's speed center is based on this non-PyPy-specific open-source project: https://github.com/tobami/codespeed/

Generally when we've talked about load tests, we've kind of been referring to a couple different things:

  1. tests/tracking of perf improvements/degradation over time as new code lands
  2. specific correctness tests, but that are only seen when running at scale, so are impractical to include in the standard CI test suite

A recent example of the latter of those two is #4117, reporting that datum output of ~100GB fails to upload to object storage (in general, the breakdown points throughout the system for large single files is a load test axis that I don't believe we have talked much about, perhaps beyond some work on improving initial file ingress)

@ghost
Copy link

ghost commented Sep 24, 2019

I can get you guys something like a 50x genome setup for doing bioinf load testing if you want.

@ysimonson
Copy link
Contributor

@Jamesthegiantpeach thanks for the offer! For now, we're going to just get our built-in split tests off the ground, but I may reach out in the future as we expand the scope of this.

@ysimonson ysimonson removed their assignment Nov 22, 2019
@pappasilenus
Copy link
Contributor

Referencing this internal discussion for performance testing which allows for tuning the load on etcd vs. object store via creating a stress test you can run against an arbitrary cluster: https://pachyderm.slack.com/archives/CGU393QP2/p1613608743034900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants