-
Notifications
You must be signed in to change notification settings - Fork 568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load Testing Improvements #3589
Comments
Maybe out of scope, but I'd love to see fuzztests for large pachyderm clusters, a la chaos monkey or jepsen. Related ticket: #310 |
That's a great idea. Part of the goal here is to make it easy enough for anyone to add new benchmark/load tests, so enabling others to create tests, like the ones you've describe, is a good way to test the usability of this infrastructure. Will keep in mind this use case as we are designing what this new infra should look like. |
@kevindelgado yesterday you brought up using this as a way of tracking/preventing performance regressions. It'd be great to have some sort of dashboard where you could:
Something like pypy's speed center, but way, way simpler. |
This is awesome. Had a rough idea of what something like this could look like, hadn't dug around to see what all is out there in this space, but this gives a very concrete, fairly simple solution that we can model off of. |
note that PyPy's speed center is based on this non-PyPy-specific open-source project: https://github.com/tobami/codespeed/ Generally when we've talked about load tests, we've kind of been referring to a couple different things:
A recent example of the latter of those two is #4117, reporting that datum output of ~100GB fails to upload to object storage (in general, the breakdown points throughout the system for large single files is a load test axis that I don't believe we have talked much about, perhaps beyond some work on improving initial file ingress) |
I can get you guys something like a 50x genome setup for doing bioinf load testing if you want. |
@Jamesthegiantpeach thanks for the offer! For now, we're going to just get our built-in split tests off the ground, but I may reach out in the future as we expand the scope of this. |
Referencing this internal discussion for performance testing which allows for tuning the load on etcd vs. object store via creating a stress test you can run against an arbitrary cluster: https://pachyderm.slack.com/archives/CGU393QP2/p1613608743034900 |
This discussion issue is a bit of a catch-all for improving our load/benchmark testing.
End Goal / Desired Outcome:
As a pachyderm developer I want to be able to specify a version/commit of pachyderm, cloud provider, and a suite of load-tests, and have it run these tests at scale on a production cluster, tearing down the cluster afterwards and provide me with easy to digest results on both performance benchmarks and cost.
Background:
Currently we have a single benchmark test (at
src/testing/loadtest/split
), that attempts to run pachyderm at scale (see the readme for more info). We believe we can build off of this foundation to have more robust load testing that help us 1) be more confident that new features do not create performance regressions, 2) prove that improvements we make for performance actually perform as expected, 3) provide a tool for reproducing and debugging performance issues we see out in the wild.Proposal:
Build out infrastructure to simplify and automate the running of the tests:
Manually create a few more loadtests that reproduce some specific scaling issues we've encountered, run these loadtests in the above-mentioned:
• that demonstrates the improvements from batched GetFiles.
• that reproduces the CPU pegging issue that some customers are seeing.
• that requires at least 10+ nodes, 20+ workers and 30+ minutes.
The text was updated successfully, but these errors were encountered: