We allowed our last test to pass by including a quorum
flag on reads, but in order to see the original stale-reads bug, we have to edit the source code again, flipping the flag to false
. It'd be nice if we could adjust that from the command line. Jepsen provides several command-line options by default in jepsen.cli, but we can add our own options by passing an :opt-spec
to cli/single-test-cmd
.
(def cli-opts
"Additional command line options."
[["-q" "--quorum" "Use quorum reads, instead of reading from any primary."]])
CLI options are a collection of vectors, giving a short name, a full name, a documentation string, and options which affect how that option is parsed, its default value, etc. These are passed to tools.cli, the standard Clojure library for option handling.
Now, let's pass that option specification to the CLI:
(defn -main
"Handles command line arguments. Can either run a test, or a web server for
browsing results."
[& args]
(cli/run! (merge (cli/single-test-cmd {:test-fn etcd-test
:opt-spec cli-opts})
(cli/serve-cmd))
args)
If we re-run our test with lein run test -q ...", we'll see a new
:quorum` option in our test map:
10:02:42.532 [main] INFO jepsen.cli - Test options:
{:concurrency 10,
:test-count 1,
:time-limit 30,
:quorum true,
...
Jepsen parsed our -q
option, found the option specification we provided, and
added a :quorum true
pair to the options map. That options map was passed to
etcd-test
, which merge
d it into the test map. Viola! We have a :quorum
key in our test!
Now, let's use that quorum option to control whether the client issues quorum
reads, in the Client invoke
function:
(case (:f op)
:read (let [value (-> conn
(v/get k {:quorum? (:quorum test)})
parse-long)]
(assoc op :type :ok, :value (independent/tuple k value)))
Let's try lein run
with and without quorum reads, and see whether it lets us
see the stale reads bug again.
$ lein run test -q ...
...
$ lein run test ...
...
clojure.lang.ExceptionInfo: throw+: {:errorCode 209, :message "Invalid field", :cause "invalid value for \"quorum\"", :index 0, :status 400}
...
Huh. Let's double-check what the value was for :quorum
in the test map. It's logged at the beginning of every Jepsen run:
2018-02-04 09:53:24,867{GMT} INFO [jepsen test runner] jepsen.core: Running test:
{:concurrency 10,
:db
#object[jepsen.etcdemo$db$reify__4946 0x15a8bbe5 "jepsen.etcdemo$db$reify__4946@15a8bbe5"],
:name "etcd",
:start-time
#object[org.joda.time.DateTime 0x54a5799f "2018-02-04T09:53:24.000-06:00"],
:net
#object[jepsen.net$reify__3493 0x2a2b3aff "jepsen.net$reify__3493@2a2b3aff"],
:client {:conn nil},
:barrier
#object[java.util.concurrent.CyclicBarrier 0x6987b74e "java.util.concurrent.CyclicBarrier@6987b74e"],
:ssh
{:username "root",
:password "root",
:strict-host-key-checking false,
:private-key-path nil},
:checker
#object[jepsen.checker$compose$reify__3220 0x71098fb3 "jepsen.checker$compose$reify__3220@71098fb3"],
:nemesis
#object[jepsen.nemesis$partitioner$reify__3601 0x47c15468 "jepsen.nemesis$partitioner$reify__3601@47c15468"],
:active-histories #<Atom@18bf1bad: #{}>,
:nodes ["n1" "n2" "n3" "n4" "n5"],
:test-count 1,
:generator
#object[jepsen.generator$time_limit$reify__1996 0x483fe83a "jepsen.generator$time_limit$reify__1996@483fe83a"],
:os
#object[jepsen.os.debian$reify__2908 0x8aa1562 "jepsen.os.debian$reify__2908@8aa1562"],
:time-limit 30,
:model {:value nil}}
Oh. That's odd. There... isn't a :quorum
key here. Option flags only appear
in the options map if they're present on the command line; if they're left out
of the command line, they're left out of the option map too. When we ask for
(:quorum test)
, and test
has no :quorum
option, we'll get nil
.
There are a few easy ways to fix this. We could coerce nil
to false
by
using (boolean (:quorum test))
, at the client, or in etcd-test
. Or we could
force the opt spec to provide a default value when the flag is omitted, by
adding :default false
to the quorum opt-spec. We'll apply boolean
in
etcd-test
, just in case someone calls it directly, instead of through the
CLI.
(defn etcd-test
"Given an options map from the command line runner (e.g. :nodes, :ssh,
:concurrency ...), constructs a test map. Special options:
:quorum Whether to use quorum reads"
[opts]
(let [quorum (boolean (:quorum opts))]
(merge tests/noop-test
opts
{:name (str "etcd q=" quorum)
:quorum quorum
...
We're binding quorum
to a variable here so that we can use its boolean value
in two places. We add it to the test's name
, which makes it easy to tell
which tests used quorum reads at a glance. We also add it to the :quorum
option. Since we merge opts
before that, our boolean version of :quorum
will take precedence over whatever in opts
. Now, without -q
, our test can
find errors again:
$ lein ruin test --time-limit 60 --concurrency 100 -q
...
Everything looks good! ヽ(‘ー`)ノ
$ lein ruin test --time-limit 60 --concurrency 100
...
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
Depending on how powerful your computer is, you may have noticed some tests get
stuck on painfully slow analyses. It's hard to control this up-front--the
difficulty of a test goes like ~n!
, where n is the number of concurrent
processes. A couple crashed processes can make the difference between seconds
and days to check.
To help with this problem, let's add some tuning options to our test which control the number of operations you can perform on any single key, and how fast operations are generated.
In the generator, let's change our hardcoded 1/10 delay to a parameter, given as a rate per second, and change our hardcoded limit on each key's generator to a configurable parameter.
(defn etcd-test
"Given an options map from the command line runner (e.g. :nodes, :ssh,
:concurrency ...), constructs a test map. Special options:
:quorum Whether to use quorum reads
:rate Approximate number of requests per second, per thread
:ops-per-key Maximum number of operations allowed on any given key."
[opts]
(let [quorum (boolean (:quorum opts))]
(merge tests/noop-test
opts
{:name (str "etcd q=" quorum)
:quorum quorum
:os debian/os
:db (db "v3.1.5")
:client (Client. nil)
:nemesis (nemesis/partition-random-halves)
:model (model/cas-register)
:checker (checker/compose
{:perf (checker/perf)
:linear (independent/checker (checker/linearizable))
:timeline (independent/checker (timeline/html))})
:generator (->> (independent/concurrent-generator
10
(range)
(fn [k]
(->> (gen/mix [r w cas])
(gen/stagger (/ (:rate opts)))
(gen/limit (:ops-per-key opts)))))
(gen/nemesis
(gen/seq (cycle [(gen/sleep 5)
{:type :info, :f :start}
(gen/sleep 5)
{:type :info, :f :stop}])))
(gen/time-limit (:time-limit opts)))})))
And add corresponding command-line options
(def cli-opts
"Additional command line options."
[["-q" "--quorum" "Use quorum reads, instead of reading from any primary."]
["-r" "--rate HZ" "Approximate number of requests per second, per thread."
:default 10
:parse-fn read-string
:validate [#(and (number? %) (pos? %)) "Must be a positive number"]]
[nil "--ops-per-key NUM" "Maximum number of operations on any given key."
:default 100
:parse-fn parse-long
:validate [pos? "Must be a positive integer."]]])
We don't have to provide a short name for every option: we use nil
to
indicate that --ops-per-key
has no short form. The capital words after each
flag (e.g. "HZ" & "NUM") are arbitrary placeholders for values that you would
pass. They'll be printed as a part of the usage documentation. We provide a
:default
for both options, which is used if there's no flag at the command
line. For rates, we want to allow integers, decimals, and fractions, so...
we'll use Clojure's built-in read-string
function to parse all three. Then
we'll validate that it's both a number and that it's positive, to keep people
from passing strings, negative numbers, zero rates, etc.
Now, if we want to run a less aggressive test, we can try
$ lein run test --time-limit 10 --concurrency 10 --ops-per-key 10 -r 1/5
...
Everything looks good! ヽ(‘ー`)ノ
Looking through the history for each key, we can see that operations proceeded very slowly, and there are only 10 per key. This test is much easier to check! However, it also fails to find the bug! This is an inherent tension in Jepsen: we have to be aggressive to find errors, but verifying those aggressive histories can be much more difficult--even impossible.
Linearizability checking is NP-hard; there's no way around that. We can design somewhat more efficient checkers, but eventually, that exponential cliff is going to bite us. Perhaps, however... we could verify a weaker property. Something in linear or logarithmic time. Let's add a commutative test