Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QCheck: improve int shrinker fast path #173

Closed
wants to merge 8 commits into from

Conversation

jmid
Copy link
Collaborator

@jmid jmid commented Sep 7, 2021

Here is an example of what #172 opens up for:

A simple change (inspired by QCheck2) that tries 0 first on each int shrinker invocation has drastic improvements:

  • a factor 10 is cut of fun_last_foldleftright_qcheck.expected's line count (read: 10x less shrink attempts!)
  • fun_first_foldleftright_qcheck is halved - and even finds a smaller counterexample
  • list_shorter_100_qcheck and list_size_shorter_100_qcheck are reduced to a quarter
  • list_shorter_10_qcheck is reduced to a third

You can also see the shrink counts in test/core/QCheck_expect_test.expected all improve - except for a the int-ones that increment by one (the user-reported shrink-steps only count the successfull ones).

A quick timing from test/core of

dune clean
time dune test --force --no-buffer

improves (before):

real	0m28.844s
user	1m2.870s
sys	0m4.937s

to (after):

real	0m23.954s
user	0m51.232s
sys	0m4.980s

This is a tradeoff, naturally:
int_zero_qcheck and int_smaller_209609 spend more shrinking steps to try 0 repeatedly (like QCheck2's int shrinker)
Given the above, I think this tradeoff is worth it though.

@jmid jmid requested review from c-cube and Gbury September 7, 2021 18:42
@Gbury
Copy link
Collaborator

Gbury commented Sep 9, 2021

I see the point of trying 0 first, but as you pointed, in some reasonable cases (such as when the shrinker has to find a minium bound that is not 0), then half of the tests ran might be on the int 0, which is.. definitely not great. At least the current behavior avoids that kind of problem of repeatedly trying 0 (and even if it's a bit longer, it should still be quite fast considering that you only need a logarithmic number of steps to get down to 0).

The speed improvements on the example are mostly because of how simple the examples are and because in most cases, trying 0 first short-circuits a lot of computation, and I don't know whether it's realistic or not. Imagine a case where 0 is not a minimal counter-example and the test takes a bit of time, and I'm not sure this strategy of trying 0 first is beneficial (and that might be a more realistic test case).

However, maybe this is a sign that we might need a somewhat more complex shrinking strategy, i.e. instead of shrinking one counter-example into a sequence, and then restarting from the first counter-example in the sequence, we might want to add some things such as:

@jmid
Copy link
Collaborator Author

jmid commented Sep 9, 2021

I see the point of trying 0 first, but as you pointed, in some reasonable cases (such as when the shrinker has to find a minium bound that is not 0), then half of the tests ran might be on the int 0, which is.. definitely not great. At least the current behavior avoids that kind of problem of repeatedly trying 0 (and even if it's a bit longer, it should still be quite fast considering that you only need a logarithmic number of steps to get down to 0).

There are many things to say here. I've tried to summarize some of them starting here: #153 (comment)
(the first comment starts out discussing the list shrinker, but it turns out to involve int shrinkers).

Overall, those observations convinced me that trying 0 first is actually the fast path.
As soon as we have ints in lists, functions, tuples etc., some (most) of them will be irrelevant, so quickly reducing
them will be a speed-up.

For a list of length n if the element values do not matter, repeated halving will spend O(n log n) reduction steps,
whereas the aggressive shrinker will only spend O(n) steps. This does not change, e.g., if a few list elements have to be different
from 0.

Now imagine functions of ints, where it will have to shrink both key and value of each irrelevant key-value entry repeatedly.

The speed improvements on the example are mostly because of how simple the examples are and because in most cases, trying 0 first short-circuits a lot of computation, and I don't know whether it's realistic or not. Imagine a case where 0 is not a minimal counter-example and the test takes a bit of time, and I'm not sure this strategy of trying 0 first is beneficial (and that might be a more realistic test case).

I completely agree that trying 0 first short-circuits a lot of computation - that it precisely the point.
For your (potentially) hypothetical example, a counterexample should preferably involve only ints
that matter to the counter example. As soon as it involves an additional int that needs shrinking,
we will spend log n + log n expensive shrinking steps. For anything that needs additional irrelevant ints
needing shrinking the aggressive approach will be a win.

Here are some numbers (with wc counting steps - *.expected_old denotes the "old expected outputs, *.expected are the new):

      64 fun_first_foldleftright_qcheck.expected
     130 fun_first_foldleftright_qcheck.expected_old
      99 fun_last_foldleftright_qcheck.expected
     994 fun_last_foldleftright_qcheck.expected_old
     196 int_smaller_209609_qcheck.expected
     144 int_smaller_209609_qcheck.expected_old
     125 int_zero_qcheck.expected
      64 int_zero_qcheck.expected_old
       6 list_empty_qcheck.expected
      10 list_empty_qcheck.expected_old
   54030 list_shorter_100_qcheck.expected
  196830 list_shorter_100_qcheck.expected_old
     306 list_shorter_10_qcheck.expected
     930 list_shorter_10_qcheck.expected_old
   52000 list_size_shorter_100_qcheck.expected
  209080 list_size_shorter_100_qcheck.expected_old
      15 list_unique_elems_qcheck.expected
      15 list_unique_elems_qcheck.expected_old

As we can see, it is only the int_* tests that got worse. (also for the nat_ but I forgot to save that one).
For int_zero_qcheck it is indeed a factor 2 but not for int_smaller_209609_qcheck.
Why? The old QCheck shrinker only restarts (and thus redundantly tries 0) after a successful reduction step,
of which there only appears to be 1/4 in int_smaller_209609_qcheck.

If 209080 shrink attempts are hard to relate to, consider the file sizes:

$ for file in *qcheck.expected_old; do ls -hs ${file%_old} $file; done | grep -v total
4.0K fun_first_foldleftright_qcheck.expected
8.0K fun_first_foldleftright_qcheck.expected_old
1.2M fun_last_foldleftright_qcheck.expected
 12M fun_last_foldleftright_qcheck.expected_old
4.0K int_smaller_209609_qcheck.expected
4.0K int_smaller_209609_qcheck.expected_old
4.0K int_zero_qcheck.expected
4.0K int_zero_qcheck.expected_old
4.0K list_empty_qcheck.expected
4.0K list_empty_qcheck.expected_old
1.5M list_shorter_100_qcheck.expected
5.3M list_shorter_100_qcheck.expected_old
12K list_shorter_10_qcheck.expected
32K list_shorter_10_qcheck.expected_old
1.4M list_size_shorter_100_qcheck.expected
5.6M list_size_shorter_100_qcheck.expected_old
4.0K list_unique_elems_qcheck.expected
4.0K list_unique_elems_qcheck.expected_old

12M for fun_last_foldleftright_qcheck for only 994 shrink steps!
5.6M and 5.3M for list_size_shorter_100_qcheck and list_shorter_100_qcheck!
This is a lot of data, some of which needs to be kept needlessly long in memory,
thus stressing the OCaml GC.

However, maybe this is a sign that we might need a somewhat more complex shrinking strategy, i.e. instead of shrinking one counter-example into a sequence, and then restarting from the first counter-example in the sequence, we might want to add some things such as:

* either a cache that remembers (at least some) of) the counter-examples tried but which aren't counter-example, so that these are not re-tested again  [...]

* or maybe we need a shrinker with more structure, where the shrinking function is passed some more information what happened up to this point in the shrinking (i.e. make the shrinking function aware of the sequence of events).

I actually suggested something like a 1-element cache (context awareness) in the QCheck2 shrinker for eliminating
the redundant 0s at the end of this comment: #153 (comment)
I even threw together a prototype (which ended up slowing the QCheck2 int shrinker down!)
I can share it in here if you'd like to try it out.

Overall, this is a trade-off between practical bang-for-buck and clever algorithms where QCheck2's choice wins.
Sure, I would be happy to avoid redudant 0-tests, but currently it actually pays off.
Even with this fix there are additional trade-offs to be made in our QCheck's list shrinker and QCheck2's function shrinker:

$ wc -l *_qcheck*.expected | grep -v total
     19 char_never_abcdef_qcheck2.expected
     18 char_never_abcdef_qcheck.expected
     32 fun_first_foldleftright_qcheck2.expected
     64 fun_first_foldleftright_qcheck.expected
    601 fun_last_foldleftright_qcheck2.expected
     99 fun_last_foldleftright_qcheck.expected
      4 int_smaller_209609_qcheck2.expected
    196 int_smaller_209609_qcheck.expected
    125 int_zero_qcheck2.expected
    125 int_zero_qcheck.expected
      7 list_empty_qcheck2.expected
      6 list_empty_qcheck.expected
    126 list_shorter_100_qcheck2.expected
  54030 list_shorter_100_qcheck.expected
     27 list_shorter_10_qcheck2.expected
    306 list_shorter_10_qcheck.expected
    150 list_size_shorter_100_qcheck2.expected
  52000 list_size_shorter_100_qcheck.expected
      8 list_unique_elems_qcheck2.expected
     15 list_unique_elems_qcheck.expected
    106 nat_smaller_5001_qcheck2.expected
     90 nat_smaller_5001_qcheck.expected
      7 string_empty_qcheck2.expected
      5 string_empty_qcheck.expected
    373 string_never_has_000_char_qcheck2.expected
    406 string_never_has_000_char_qcheck.expected

I've tried to argue for why trying 0 first is preferable. Generally, this doesn't cost much - despite redundancy.
Without the shrink-logs we wouldn't even be arguing as the successful shrink count only improved: 😄

$ git diff -U0 6cf99fd95f48c8c52902af7945ba0dc1d9aa5664 ../QCheck_expect_test.expected | grep -v @@
diff --git a/test/core/QCheck_expect_test.expected b/test/core/QCheck_expect_test.expected
index 795ca38..3e1896b 100644
--- a/test/core/QCheck_expect_test.expected
+++ b/test/core/QCheck_expect_test.expected
-Test should_fail_sort_id failed (18 shrink steps):
+Test should_fail_sort_id failed (15 shrink steps):
-Test should_error_raise_exn errored on (63 shrink steps):
+Test should_error_raise_exn errored on (1 shrink steps):
-Test long_shrink failed (149 shrink steps):
+Test long_shrink failed (89 shrink steps):
-Test ints arent 0 mod 3 failed (84 shrink steps):
+Test ints arent 0 mod 3 failed (1 shrink steps):
--21
+0
-Test ints are 0 failed (62 shrink steps):
+Test ints are 0 failed (63 shrink steps):
-Test lists are empty failed (11 shrink steps):
+Test lists are empty failed (8 shrink steps):
-Test lists shorter than 10 failed (50 shrink steps):
+Test lists shorter than 10 failed (17 shrink steps):
-Test lists shorter than 432 failed (1696 shrink steps):
+Test lists shorter than 432 failed (455 shrink steps):
-Test fail_pred_map_commute failed (127 shrink steps):
+Test fail_pred_map_commute failed (128 shrink steps):
-([3], {_ -> 0}, {3 -> false; _ -> true})
+([0], {_ -> -1}, {0 -> false; _ -> true})
-Test fold_left fold_right failed (25 shrink steps):
+Test fold_left fold_right failed (23 shrink steps):
-Test fold_left fold_right uncurried failed (111 shrink steps):
+Test fold_left fold_right uncurried failed (104 shrink steps):
-({(5, 7) -> 0; _ -> 7}, 0, [5; 0])
+({(7, 0) -> 1; _ -> 0}, 0, [7])
-Test fold_left fold_right uncurried fun last failed (26 shrink steps):
+Test fold_left fold_right uncurried fun last failed (23 shrink steps):
-Test false fold, fun first failed (40 shrink steps):
+Test false fold, fun first failed (37 shrink steps):

In any case we have a situation where:

  • the QCheck2 aggressively shrinks ints by trying 0 first - (and later sometimes redundantly)
  • the QCheck int shrinker currently doesn't - but it pays off if it will do so
    (actually the current shrinker sometimes has an unrelated redundancy at the end - see the new unit tests in PR Test.check_exn and more unit tests #174)
    I think we should work to align these two int shrinking approaches.

@Gbury
Copy link
Collaborator

Gbury commented Sep 9, 2021

I don't disagree with your points, and I agree that in absolute, trying 0 as the first shrink step for integer shrinking is most likely the good solution. The main problem I have is that in a non negligible number of tests, with the current implementation this means that sometimes up to half of the cases tried by the shrinker will be 0, so basically, up to half of the work done by the shrinker is useless and wasted.

This is not a problem that is reflected by the testsuite because the tests in it are too simple. Most (if not all), the list tests you mention basically do not care about the individual elements and are mainly tests of the length of the list. Additionally, the property tested is almost always trivial (or rather extremely quick) to compute. That explains why in your experiements to add a cache, the performance was worse: basically, the cache check was more costly that the property to check, which is something that I don't think is very usual in real-life qcheck uses. All of that is not to say that these tests are bad; in the context of a testsuite, it's good to have simple tests such as these, however, these are not a good benchmark to use when comparing shrinking strategies.

@jmid
Copy link
Collaborator Author

jmid commented Sep 11, 2021

I don't disagree with your points, and I agree that in absolute, trying 0 as the first shrink step for integer shrinking is most likely the good solution. The main problem I have is that in a non negligible number of tests, with the current implementation this means that sometimes up to half of the cases tried by the shrinker will be 0, so basically, up to half of the work done by the shrinker is useless and wasted.

Spending half the attempts repeatedly trying 0 will only happen for properties functionally equivalent to the ints are zero test. If there are other attempts where the property holds for the tested shrinking candidate, the shrinker tries the next yield-result (not restarting on 0) and thus reducing the redundant attempts below 50%. int_smaller_209609 illustrates this.

This is not a problem that is reflected by the testsuite because the tests in it are too simple. Most (if not all), the list tests you mention basically do not care about the individual elements and are mainly tests of the length of the list. Additionally, the property tested is almost always trivial (or rather extremely quick) to compute. That explains why in your experiements to add a cache, the performance was worse: basically, the cache check was more costly that the property to check, which is something that I don't think is very usual in real-life qcheck uses. All of that is not to say that these tests are bad; in the context of a testsuite, it's good to have simple tests such as these, however, these are not a good benchmark to use when comparing shrinking strategies.

I am happy that this effort calls for a performance test suite for QCheck. I think it would be a very nice addition (hint, hint).
In this comment @sir4ur0n talks of a 10x performance degrade from QCheck1 to QCheck2: #135 (comment)
That would indeed be nice to quantify and compare across the two versions - both now and going forward.

Up until now we have

As such, we have used these tests to benchmark our shrinkers in the past.
Now these shrinker tests (and others) have been collected into a test suite and we have a shrinking logger set up to help understand their behaviour (which is sometimes puzzling).

Constructively and concretely: What would be a good benchmark to compare shrinking strategies in your opinion?

@jmid jmid mentioned this pull request Sep 12, 2021
@jmid jmid mentioned this pull request Apr 3, 2022
@jmid
Copy link
Collaborator Author

jmid commented Apr 16, 2022

Closing as this is subsumed by #235

@jmid jmid closed this Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants