Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault injection #310

Open
derekchiang opened this issue Apr 15, 2016 · 6 comments
Open

Fault injection #310

derekchiang opened this issue Apr 15, 2016 · 6 comments

Comments

@derekchiang
Copy link
Contributor

derekchiang commented Apr 15, 2016

Most mature distributed system projects have some sort of fault injection frameworks for testing purposes. Some examples:

And there are more general-purpose tools such as Jepsen and ChaosMonkey that can be used to inject network faults.

Using tools such as these will help us identify bugs that would otherwise be found in production.

@derekchiang derekchiang modified the milestone: futu Apr 15, 2016
@derekchiang derekchiang added this to the v1.1 milestone May 5, 2016
@derekchiang
Copy link
Contributor Author

Made some progress for 1.1 by adding tests for dynamic membership. Removing the 1.1 label as we will be revisiting this issue post 1.1.

@derekchiang derekchiang removed this from the v1.1 milestone Jun 14, 2016
@jdoliner
Copy link
Member

We should aim to do more of this in 1.2.
In particular I'd like to see some faults that affect running jobs and see how the system handles it.

@jdoliner jdoliner added this to the v1.2 milestone Jul 11, 2016
@sjezewski
Copy link
Contributor

We should also add coverage for the places that we support exponential backoff. Specifically in PutBlock and DeleteBlock

@sjezewski sjezewski modified the milestone: v1.2 Sep 6, 2016
@sjezewski sjezewski mentioned this issue Sep 19, 2016
@derekchiang
Copy link
Contributor Author

The need for this tool is getting stronger, as we are seeing more people pumping a serious amount of data into Pachyderm and bugs are being caught in production. Most of those bugs only manifest when certain requests fail due to transient network failures, which can be simulated by such a tool.

@derekchiang derekchiang self-assigned this Dec 2, 2016
@derekchiang derekchiang added this to the v1.4 milestone Dec 2, 2016
@derekchiang
Copy link
Contributor Author

derekchiang commented Dec 2, 2016

Here's how Kubernetes does it: https://github.com/kubernetes/kubernetes/tree/master/pkg/client/chaosclient

I think this makes sense for us as the first step as well. Basically the faults are injected at client side: in our case it will be code under src/client/...

@sjezewski sjezewski modified the milestone: v1.4 May 24, 2017
@sjezewski
Copy link
Contributor

We have a recent need for testing of this nature:

#2675

This will be a critical path for a customer's deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants