6.91.2: flaky `tests/quality/test_discovery_ability.py::test_can_produce_multi_line_strings` #3829

roehling · 2024-01-03T09:03:58Z

Recently, I started to see intermittent test failures for test_can_produce_multi_line_strings in Debian package builds. I don't know if this is some sort of regression in version 6.92.2 or if this test has always been a bit flaky and merely became more likely to fail because Debian runs the test suite twice at the moment (Python 3.11 and Python 3.12).

Example failure for Python 3.11:

_____________________ test_can_produce_multi_line_strings ______________________
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 341, in from_call
    result: Optional[TResult] = func()
                                ^^^^^^
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 262, in <lambda>
    lambda: ihook(item=item, **kwds), when=when, reraise=reraise
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pluggy/_hooks.py", line 493, in __call__
    return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pluggy/_manager.py", line 115, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 152, in _multicall
    return outcome.get_result()
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pluggy/_result.py", line 114, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 77, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 177, in pytest_runtest_call
    raise e
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 169, in pytest_runtest_call
    item.runtest()
  File "/usr/lib/python3/dist-packages/_pytest/python.py", line 1792, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/usr/lib/python3/dist-packages/pluggy/_hooks.py", line 493, in __call__
    return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pluggy/_manager.py", line 115, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 113, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 77, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/_pytest/python.py", line 194, in pytest_pyfunc_call
    result = testfunction(**testargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11_hypothesis/build/tests/quality/test_discovery_ability.py", line 118, in run_test
    raise HypothesisFalsified(
tests.quality.test_discovery_ability.HypothesisFalsified: P(lambda x: "\n" in x) ~ 47 / 98 = 0.48 < 0.50; rejected

Full Buildlog

Example failure for Python 3.12:

_____________________ test_can_produce_multi_line_strings ______________________
    def run_test():
        if condition is None:
    
            def _condition(x):
                return True
    
            condition_string = ""
        else:
            _condition = condition
            condition_string = strip_lambda(
                reflection.get_pretty_function_description(condition)
            )
    
        def test_function(data):
            with BuildContext(data):
                try:
                    value = data.draw(specifier)
                except UnsatisfiedAssumption:
                    data.mark_invalid()
                if not _condition(value):
                    data.mark_invalid()
                if predicate(value):
                    data.mark_interesting()
    
        successes = 0
        actual_runs = 0
        for actual_runs in range(1, RUNS + 1):
            # We choose the max_examples a bit larger than default so that we
            # run at least 100 examples outside of the small example generation
            # part of the generation phase.
            runner = ConjectureRunner(
                test_function,
                settings=Settings(
                    max_examples=150,
                    phases=no_shrink,
                    suppress_health_check=suppress_health_check,
                ),
            )
            runner.run()
            if runner.interesting_examples:
                successes += 1
                if successes >= required_runs:
                    return
    
            # If we reach a point where it's impossible to hit our target even
            # if every remaining attempt were to succeed, give up early and
            # report failure.
            if (required_runs - successes) > (RUNS - actual_runs):
                break
    
        event = reflection.get_pretty_function_description(predicate)
        if condition is not None:
            event += "|"
            event += condition_string
    
>       raise HypothesisFalsified(
            f"P({event}) ~ {successes} / {actual_runs} = "
            f"{successes / actual_runs:.2f} < {required_runs / RUNS:.2f}; "
            "rejected"
        )
E       tests.quality.test_discovery_ability.HypothesisFalsified: P(lambda x: "\n" in x) ~ 44 / 95 = 0.46 < 0.50; rejected
tests/quality/test_discovery_ability.py:118: HypothesisFalsified

Full Buildlog

The text was updated successfully, but these errors were encountered:

tybug · 2024-01-06T20:30:57Z

Thanks for the report! This seems to have regressed in 6.91.2 (#3801). Checked by running the following until failure: while pytest hypothesis-python/tests/ -k test_can_produce_multi_line_strings; do :; done

We could just accept that the distribution has changed slightly and lower the probability required for a pass here. Regardless, I'll take a look at the distribution of text() before and after the above PR to make sure nothing else that we care about in the distribution changed.

Zac-HD · 2024-01-08T02:54:28Z

Thanks Liam! I think lowing the probability (substantially) is a fine solution here; the important thing is that we're unlikely to miss bugs which only trigger on multi-line strings.

tybug · 2024-01-13T20:09:35Z

This was caused by 9283da3 specifically. 6.91.1 is ~0.75 pass rate, and 6.91.2 is ~0.55 pass rate.

Given the turnover rate of our distributions here, I'm not going to look much deeper here as long as our distributions aren't obviously incorrect. Our distributions are going to change again with #3818, for instance. Indeed the pass rate on that branch is back up to ~0.68.

A brief distribution investigation:

from hypothesis import *
from hypothesis.strategies import *
import matplotlib.pyplot as plot

small_ords = []
large_ords = []
max_ord = 1000
@given(text())
@settings(max_examples=20_000)
def f(s):
    for c in s:
        o = ord(c)
        if o < max_ord:
            small_ords.append(o)
        else:
            large_ords.append(o)

f()

print(f"small ords: {len(small_ords)}")
print(f"large ords: {len(large_ords)}")

plot.hist(small_ords, bins=max_ord // 2)
plot.show()

plot.hist(large_ords, bins=100)
plot.show()

# settings
# --------
# max_examples = 20_000
# max_ord = 1_000
#
# 6.91.2
# ------
# small ords:     122077
# large ords:     25508
#
# 6.91.1
# ------
# small ords:     83599
# large ords:     32362

which shows that actually 6.91.2 is more likely to generate small ords, at least in the long run. This contradicts the above (ord("\n") == 10), but maybe this relationship only shows up in larger example budgets. I'm not too sure what to make of it, but I'm not going to look deeply into why at the moment. (I wonder if this is why we had to increase the example budget for a dtype encoding bug, which triggers on surrogates -> relatively high code points?)

I also looked at graphs for string length and ord distribution, and while they were slightly changed, nothing was obviously wrong.

I'm going to decrease the probability and call it a day. Any time here is probably better spent working on the IR and obsoleting this distribution!

tybug mentioned this issue Jan 6, 2024

Add note to TypeError when each element to sampled_from is a strategy #3820

Merged

tybug added the flaky-tests for when our tests only sometimes pass label Jan 6, 2024

Zac-HD assigned tybug Jan 8, 2024

tybug changed the title ~~6.92.2: flaky tests/quality/test_discovery_ability.py::test_can_produce_multi_line_strings~~ 6.91.2: flaky tests/quality/test_discovery_ability.py::test_can_produce_multi_line_strings Jan 13, 2024

tybug mentioned this issue Jan 13, 2024

Decrease pass probability for test_can_produce_multi_line_strings #3841

Merged

Zac-HD closed this as completed in #3841 Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6.91.2: flaky `tests/quality/test_discovery_ability.py::test_can_produce_multi_line_strings` #3829

6.91.2: flaky `tests/quality/test_discovery_ability.py::test_can_produce_multi_line_strings` #3829

roehling commented Jan 3, 2024

tybug commented Jan 6, 2024 •

edited

Loading

Zac-HD commented Jan 8, 2024

tybug commented Jan 13, 2024

6.91.2: flaky tests/quality/test_discovery_ability.py::test_can_produce_multi_line_strings #3829

6.91.2: flaky tests/quality/test_discovery_ability.py::test_can_produce_multi_line_strings #3829

Comments

roehling commented Jan 3, 2024

tybug commented Jan 6, 2024 • edited Loading

Zac-HD commented Jan 8, 2024

tybug commented Jan 13, 2024

6.91.2: flaky `tests/quality/test_discovery_ability.py::test_can_produce_multi_line_strings` #3829

6.91.2: flaky `tests/quality/test_discovery_ability.py::test_can_produce_multi_line_strings` #3829

tybug commented Jan 6, 2024 •

edited

Loading