Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functional-test: add advance network failure cases #6918

Conversation

fanminshi
Copy link
Member

add more network failures such as packet corruption, reordering, loss, and network partition.

resolve #5614

add more network failures such as packet corruption, reordering, loss, and network partition.

resolve etcd-io#5614
@fanminshi fanminshi force-pushed the add_advanced_network_failure_injections branch from 676df8e to 69b7117 Compare December 1, 2016 01:25
}

// SetPacketReordering reorders packets. rp% of packets (with a correlation of cp%) gets send immediately. The rest will be delayed for ms millisecond
func SetPacketReordering(rp int, cp int, ms int) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only tests the tcp stack; etcd will still see everything in order, so why have it?

}

// SetPackLoss randomly drop packet at p% probability
func SetPackLoss(p int) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this any different from injecting random latencies? the tcp stack will retransmit

@@ -19,10 +19,5 @@ tester:
- /etcd-tester
- -agent-endpoints
- "172.20.0.2:9027,172.20.0.3:9027,172.20.0.4:9027"
- -limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the functional-tester run on docker image mirrors the one we run using goreman?

@@ -1,6 +1,6 @@
FROM alpine
RUN apk update
RUN apk add -v iptables sudo
RUN apk --update add iptables bash iproute2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add bash?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably don't need.

slowNetworkLatency = 500 // 500 millisecond
randomVariation = 50
snapshotCount = 10000
slowNetworkLatency = 500 // 500 millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably use 500 * time.Millisecond (same for others) instead of having to comment about the units

@@ -41,6 +41,82 @@ func RecoverPort(port int) error {
return err
}

// SetPacketCorruption corrupts packets at p%
func SetPacketCorruption(p int) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of this will be corrected by tcp checksums, for the packets that aren't, I don't see how etcd would be able to pass its checks (e.g., suppose a lease key is corrupted and when the lease checker looks for the intended key, it's gone)

@heyitsanthony
Copy link
Contributor

Some of these faults make sense but not by manipulating data frames with tc:

  • packet corruption-- tcp corrects for it most of the time. If there's a crc collision and bad data gets through, there's no controlling for it (e.g., bit in key name is flipped).
  • packet loss-- tcp corrects or drops the connection; etcd will only see delays or disconnects.
  • packet reordering-- same as above

Since tcp corrects most of the faults, these work better at the tcp level:

  • packet corruption-- Intercept data over a tcp connection and retransmit with a bit flipped over tcp. When running in TLS mode, if any packet is corrupted, the connection must be dropped. Otherwise, it's open to MITM attacks.
  • packet loss-- Mix of latency and disconnections (unclear what the tuneables would be to hit interesting paths).
  • packet reordering-- Intercept data from many connections, buffer for a while, then retransmit the buffered data in some controlled order (e.g., round robin). This would reorder messages across the entire cluster.

There's already a small proxy that does the above, but it's not wired to the functional-tester:
https://github.com/coreos/etcd/tree/master/tools/local-tester/bridge

@xiang90
Copy link
Contributor

xiang90 commented Dec 1, 2016

@fanminshi

See #5614 (comment).

I agree with @heyitsanthony. The more interesting test is reordering between multiple connections. You can do this at pkg level, but most of time you will reorder pkgs within one tcp connection.

@heyitsanthony aggressive pkg lost, corruption, recording might create some interesting corner cases randomly, but i am not convinced we should prioritize this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

functional-tests: Provide more advanced network failure injections
4 participants