Load Testing Improvements #3589

kevindelgado · 2019-03-14T17:18:51Z

This discussion issue is a bit of a catch-all for improving our load/benchmark testing.

End Goal / Desired Outcome:

As a pachyderm developer I want to be able to specify a version/commit of pachyderm, cloud provider, and a suite of load-tests, and have it run these tests at scale on a production cluster, tearing down the cluster afterwards and provide me with easy to digest results on both performance benchmarks and cost.

Background:

Currently we have a single benchmark test (at src/testing/loadtest/split), that attempts to run pachyderm at scale (see the readme for more info). We believe we can build off of this foundation to have more robust load testing that help us 1) be more confident that new features do not create performance regressions, 2) prove that improvements we make for performance actually perform as expected, 3) provide a tool for reproducing and debugging performance issues we see out in the wild.

Proposal:

Build out infrastructure to simplify and automate the running of the tests:

Create a cluster launcher service to streamline the deployment and teardown of a pachyderm cluster (cloud k8s cluster creation, pach cluster deploy, teardown of cluster).
Create load testing infrastructure that consumes the cluster launcher service and automates the running of loadtests and consumption of loadtest results on the pach cluster generated by the cluster launcher service.
Make new/different loadtests pluggable into streamlined deploy process.
Automate the full loadtest lifecycle configuring pachyderm version, cloud-provider, and test-suite.

Manually create a few more loadtests that reproduce some specific scaling issues we've encountered, run these loadtests in the above-mentioned:

• that demonstrates the improvements from batched GetFiles.
• that reproduces the CPU pegging issue that some customers are seeing.
• that requires at least 10+ nodes, 20+ workers and 30+ minutes.

The text was updated successfully, but these errors were encountered:

ysimonson · 2019-03-14T18:39:39Z

Maybe out of scope, but I'd love to see fuzztests for large pachyderm clusters, a la chaos monkey or jepsen. Related ticket: #310

kevindelgado · 2019-03-15T18:36:39Z

That's a great idea. Part of the goal here is to make it easy enough for anyone to add new benchmark/load tests, so enabling others to create tests, like the ones you've describe, is a good way to test the usability of this infrastructure. Will keep in mind this use case as we are designing what this new infra should look like.

ysimonson · 2019-03-26T18:35:37Z

@kevindelgado yesterday you brought up using this as a way of tracking/preventing performance regressions. It'd be great to have some sort of dashboard where you could:

Kick off load tests for a particular branch or tagged release
Track changes over time

Something like pypy's speed center, but way, way simpler.

kevindelgado · 2019-03-26T18:59:54Z

This is awesome. Had a rough idea of what something like this could look like, hadn't dug around to see what all is out there in this space, but this gives a very concrete, fairly simple solution that we can model off of.

gabrielgrant · 2019-09-23T22:17:49Z

note that PyPy's speed center is based on this non-PyPy-specific open-source project: https://github.com/tobami/codespeed/

Generally when we've talked about load tests, we've kind of been referring to a couple different things:

tests/tracking of perf improvements/degradation over time as new code lands
specific correctness tests, but that are only seen when running at scale, so are impractical to include in the standard CI test suite

A recent example of the latter of those two is #4117, reporting that datum output of ~100GB fails to upload to object storage (in general, the breakdown points throughout the system for large single files is a load test axis that I don't believe we have talked much about, perhaps beyond some work on improving initial file ingress)

ghost · 2019-09-24T01:06:46Z

I can get you guys something like a 50x genome setup for doing bioinf load testing if you want.

ysimonson · 2019-09-24T15:43:44Z

@Jamesthegiantpeach thanks for the offer! For now, we're going to just get our built-in split tests off the ground, but I may reach out in the future as we expand the scope of this.

pappasilenus · 2021-03-08T21:56:39Z

Referencing this internal discussion for performance testing which allows for tuning the load on etcd vs. object store via creating a stress test you can run against an arbitrary cluster: https://pachyderm.slack.com/archives/CGU393QP2/p1613608743034900

kevindelgado added the discussion label Mar 14, 2019

ysimonson self-assigned this Sep 23, 2019

ysimonson removed their assignment Nov 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load Testing Improvements #3589

Load Testing Improvements #3589

kevindelgado commented Mar 14, 2019 •

edited

Loading

ysimonson commented Mar 14, 2019 •

edited

Loading

kevindelgado commented Mar 15, 2019

ysimonson commented Mar 26, 2019

kevindelgado commented Mar 26, 2019

gabrielgrant commented Sep 23, 2019 •

edited

Loading

ghost commented Sep 24, 2019

ysimonson commented Sep 24, 2019

pappasilenus commented Mar 8, 2021

Load Testing Improvements #3589

Load Testing Improvements #3589

Comments

kevindelgado commented Mar 14, 2019 • edited Loading

End Goal / Desired Outcome:

Background:

Proposal:

Build out infrastructure to simplify and automate the running of the tests:

Manually create a few more loadtests that reproduce some specific scaling issues we've encountered, run these loadtests in the above-mentioned:

ysimonson commented Mar 14, 2019 • edited Loading

kevindelgado commented Mar 15, 2019

ysimonson commented Mar 26, 2019

kevindelgado commented Mar 26, 2019

gabrielgrant commented Sep 23, 2019 • edited Loading

ghost commented Sep 24, 2019

ysimonson commented Sep 24, 2019

pappasilenus commented Mar 8, 2021

kevindelgado commented Mar 14, 2019 •

edited

Loading

ysimonson commented Mar 14, 2019 •

edited

Loading

gabrielgrant commented Sep 23, 2019 •

edited

Loading