Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6.91.2: flaky tests/quality/test_discovery_ability.py::test_can_produce_multi_line_strings #3829

Closed
roehling opened this issue Jan 3, 2024 · 3 comments · Fixed by #3841
Closed
Assignees
Labels
flaky-tests for when our tests only sometimes pass

Comments

@roehling
Copy link
Contributor

roehling commented Jan 3, 2024

Recently, I started to see intermittent test failures for test_can_produce_multi_line_strings in Debian package builds. I don't know if this is some sort of regression in version 6.92.2 or if this test has always been a bit flaky and merely became more likely to fail because Debian runs the test suite twice at the moment (Python 3.11 and Python 3.12).

Example failure for Python 3.11:

_____________________ test_can_produce_multi_line_strings ______________________
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 341, in from_call
    result: Optional[TResult] = func()
                                ^^^^^^
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 262, in <lambda>
    lambda: ihook(item=item, **kwds), when=when, reraise=reraise
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pluggy/_hooks.py", line 493, in __call__
    return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pluggy/_manager.py", line 115, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 152, in _multicall
    return outcome.get_result()
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pluggy/_result.py", line 114, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 77, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 177, in pytest_runtest_call
    raise e
  File "/usr/lib/python3/dist-packages/_pytest/runner.py", line 169, in pytest_runtest_call
    item.runtest()
  File "/usr/lib/python3/dist-packages/_pytest/python.py", line 1792, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/usr/lib/python3/dist-packages/pluggy/_hooks.py", line 493, in __call__
    return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pluggy/_manager.py", line 115, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 113, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/usr/lib/python3/dist-packages/pluggy/_callers.py", line 77, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/_pytest/python.py", line 194, in pytest_pyfunc_call
    result = testfunction(**testargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11_hypothesis/build/tests/quality/test_discovery_ability.py", line 118, in run_test
    raise HypothesisFalsified(
tests.quality.test_discovery_ability.HypothesisFalsified: P(lambda x: "\n" in x) ~ 47 / 98 = 0.48 < 0.50; rejected

Full Buildlog

Example failure for Python 3.12:

_____________________ test_can_produce_multi_line_strings ______________________
    def run_test():
        if condition is None:
    
            def _condition(x):
                return True
    
            condition_string = ""
        else:
            _condition = condition
            condition_string = strip_lambda(
                reflection.get_pretty_function_description(condition)
            )
    
        def test_function(data):
            with BuildContext(data):
                try:
                    value = data.draw(specifier)
                except UnsatisfiedAssumption:
                    data.mark_invalid()
                if not _condition(value):
                    data.mark_invalid()
                if predicate(value):
                    data.mark_interesting()
    
        successes = 0
        actual_runs = 0
        for actual_runs in range(1, RUNS + 1):
            # We choose the max_examples a bit larger than default so that we
            # run at least 100 examples outside of the small example generation
            # part of the generation phase.
            runner = ConjectureRunner(
                test_function,
                settings=Settings(
                    max_examples=150,
                    phases=no_shrink,
                    suppress_health_check=suppress_health_check,
                ),
            )
            runner.run()
            if runner.interesting_examples:
                successes += 1
                if successes >= required_runs:
                    return
    
            # If we reach a point where it's impossible to hit our target even
            # if every remaining attempt were to succeed, give up early and
            # report failure.
            if (required_runs - successes) > (RUNS - actual_runs):
                break
    
        event = reflection.get_pretty_function_description(predicate)
        if condition is not None:
            event += "|"
            event += condition_string
    
>       raise HypothesisFalsified(
            f"P({event}) ~ {successes} / {actual_runs} = "
            f"{successes / actual_runs:.2f} < {required_runs / RUNS:.2f}; "
            "rejected"
        )
E       tests.quality.test_discovery_ability.HypothesisFalsified: P(lambda x: "\n" in x) ~ 44 / 95 = 0.46 < 0.50; rejected
tests/quality/test_discovery_ability.py:118: HypothesisFalsified

Full Buildlog

@tybug
Copy link
Member

tybug commented Jan 6, 2024

Thanks for the report! This seems to have regressed in 6.91.2 (#3801). Checked by running the following until failure: while pytest hypothesis-python/tests/ -k test_can_produce_multi_line_strings; do :; done

We could just accept that the distribution has changed slightly and lower the probability required for a pass here. Regardless, I'll take a look at the distribution of text() before and after the above PR to make sure nothing else that we care about in the distribution changed.

@tybug tybug added the flaky-tests for when our tests only sometimes pass label Jan 6, 2024
@Zac-HD
Copy link
Member

Zac-HD commented Jan 8, 2024

Thanks Liam! I think lowing the probability (substantially) is a fine solution here; the important thing is that we're unlikely to miss bugs which only trigger on multi-line strings.

@tybug
Copy link
Member

tybug commented Jan 13, 2024

This was caused by 9283da3 specifically. 6.91.1 is ~0.75 pass rate, and 6.91.2 is ~0.55 pass rate.

Given the turnover rate of our distributions here, I'm not going to look much deeper here as long as our distributions aren't obviously incorrect. Our distributions are going to change again with #3818, for instance. Indeed the pass rate on that branch is back up to ~0.68.

A brief distribution investigation:

from hypothesis import *
from hypothesis.strategies import *
import matplotlib.pyplot as plot

small_ords = []
large_ords = []
max_ord = 1000
@given(text())
@settings(max_examples=20_000)
def f(s):
    for c in s:
        o = ord(c)
        if o < max_ord:
            small_ords.append(o)
        else:
            large_ords.append(o)

f()

print(f"small ords: {len(small_ords)}")
print(f"large ords: {len(large_ords)}")

plot.hist(small_ords, bins=max_ord // 2)
plot.show()

plot.hist(large_ords, bins=100)
plot.show()

# settings
# --------
# max_examples = 20_000
# max_ord = 1_000
#
# 6.91.2
# ------
# small ords:     122077
# large ords:     25508
#
# 6.91.1
# ------
# small ords:     83599
# large ords:     32362

which shows that actually 6.91.2 is more likely to generate small ords, at least in the long run. This contradicts the above (ord("\n") == 10), but maybe this relationship only shows up in larger example budgets. I'm not too sure what to make of it, but I'm not going to look deeply into why at the moment. (I wonder if this is why we had to increase the example budget for a dtype encoding bug, which triggers on surrogates -> relatively high code points?)

I also looked at graphs for string length and ord distribution, and while they were slightly changed, nothing was obviously wrong.

I'm going to decrease the probability and call it a day. Any time here is probably better spent working on the IR and obsoleting this distribution!

@tybug tybug changed the title 6.92.2: flaky tests/quality/test_discovery_ability.py::test_can_produce_multi_line_strings 6.91.2: flaky tests/quality/test_discovery_ability.py::test_can_produce_multi_line_strings Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-tests for when our tests only sometimes pass
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants