functional-test: add advance network failure cases #6918

fanminshi · 2016-12-01T01:13:37Z

add more network failures such as packet corruption, reordering, loss, and network partition.

resolve #5614

add more network failures such as packet corruption, reordering, loss, and network partition. resolve etcd-io#5614

heyitsanthony · 2016-12-01T03:51:48Z

pkg/netutil/isolate_linux.go

+}
+
+// SetPacketReordering reorders packets. rp% of packets (with a correlation of cp%) gets send immediately. The rest will be delayed for ms millisecond
+func SetPacketReordering(rp int, cp int, ms int) error {


this only tests the tcp stack; etcd will still see everything in order, so why have it?

heyitsanthony · 2016-12-01T03:52:03Z

pkg/netutil/isolate_linux.go

+}
+
+// SetPackLoss randomly drop packet at p% probability
+func SetPackLoss(p int) error {


how is this any different from injecting random latencies? the tcp stack will retransmit

heyitsanthony · 2016-12-01T03:53:31Z

tools/functional-tester/docker/docker-compose.yml

@@ -19,10 +19,5 @@ tester:
    - /etcd-tester
    - -agent-endpoints
    - "172.20.0.2:9027,172.20.0.3:9027,172.20.0.4:9027"
-    - -limit 


why change this?

should the functional-tester run on docker image mirrors the one we run using goreman?

heyitsanthony · 2016-12-01T03:53:44Z

tools/functional-tester/docker/Dockerfile

@@ -1,6 +1,6 @@
 FROM alpine
 RUN apk update 
-RUN apk add -v iptables sudo
+RUN apk --update add iptables bash iproute2


why add bash?

probably don't need.

heyitsanthony · 2016-12-01T03:56:20Z

tools/functional-tester/etcd-tester/failure_agent.go

-	slowNetworkLatency = 500 // 500 millisecond
-	randomVariation    = 50
+	snapshotCount               = 10000
+	slowNetworkLatency          = 500   // 500 millisecond


probably use 500 * time.Millisecond (same for others) instead of having to comment about the units

heyitsanthony · 2016-12-01T04:00:31Z

pkg/netutil/isolate_linux.go

@@ -41,6 +41,82 @@ func RecoverPort(port int) error {
 	return err
 }

+// SetPacketCorruption corrupts packets at p%
+func SetPacketCorruption(p int) error {


most of this will be corrected by tcp checksums, for the packets that aren't, I don't see how etcd would be able to pass its checks (e.g., suppose a lease key is corrupted and when the lease checker looks for the intended key, it's gone)

heyitsanthony · 2016-12-01T06:52:46Z

Some of these faults make sense but not by manipulating data frames with tc:

packet corruption-- tcp corrects for it most of the time. If there's a crc collision and bad data gets through, there's no controlling for it (e.g., bit in key name is flipped).
packet loss-- tcp corrects or drops the connection; etcd will only see delays or disconnects.
packet reordering-- same as above

Since tcp corrects most of the faults, these work better at the tcp level:

packet corruption-- Intercept data over a tcp connection and retransmit with a bit flipped over tcp. When running in TLS mode, if any packet is corrupted, the connection must be dropped. Otherwise, it's open to MITM attacks.
packet loss-- Mix of latency and disconnections (unclear what the tuneables would be to hit interesting paths).
packet reordering-- Intercept data from many connections, buffer for a while, then retransmit the buffered data in some controlled order (e.g., round robin). This would reorder messages across the entire cluster.

There's already a small proxy that does the above, but it's not wired to the functional-tester:
https://github.com/coreos/etcd/tree/master/tools/local-tester/bridge

xiang90 · 2016-12-01T17:13:50Z

@fanminshi

See #5614 (comment).

I agree with @heyitsanthony. The more interesting test is reordering between multiple connections. You can do this at pkg level, but most of time you will reorder pkgs within one tcp connection.

@heyitsanthony aggressive pkg lost, corruption, recording might create some interesting corner cases randomly, but i am not convinced we should prioritize this now.

fanminshi added area/functional-testing WIP labels Dec 1, 2016

functional-test: add advance network failure cases

69b7117

add more network failures such as packet corruption, reordering, loss, and network partition. resolve etcd-io#5614

fanminshi force-pushed the add_advanced_network_failure_injections branch from 676df8e to 69b7117 Compare December 1, 2016 01:25

heyitsanthony reviewed Dec 1, 2016

View reviewed changes

gyuho force-pushed the master branch 2 times, most recently from 44ca396 to 4301f49 Compare June 2, 2017 15:53

gyuho mentioned this pull request Jan 2, 2018

*: run network fault tests with proxy #9081

Merged

gyuho removed area/functional-testing WIP - DO NOT MERGE labels Apr 3, 2018

gyuho closed this Apr 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

functional-test: add advance network failure cases #6918

functional-test: add advance network failure cases #6918

fanminshi commented Dec 1, 2016

heyitsanthony Dec 1, 2016

heyitsanthony Dec 1, 2016

heyitsanthony Dec 1, 2016

fanminshi Dec 1, 2016

heyitsanthony Dec 1, 2016

fanminshi Dec 1, 2016

heyitsanthony Dec 1, 2016

heyitsanthony Dec 1, 2016

heyitsanthony commented Dec 1, 2016

xiang90 commented Dec 1, 2016

functional-test: add advance network failure cases #6918

functional-test: add advance network failure cases #6918

Conversation

fanminshi commented Dec 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heyitsanthony commented Dec 1, 2016

xiang90 commented Dec 1, 2016