-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weighting of emoji in text strategies? #1401
Comments
Given the sheer number of codepoints and combinations of edge cases in unicode, I don't think this is too bad as default behaviour! If you know you're after emoji characters though, it's easy enough to create a specialised strategy: # Note: sample sorted list rather than dict, so order is stable between runs
emoji_characters = st.sampled_from(sorted(emoji.UNICODE_EMOJI)) |
Yes but this brings in the I think my main point is that realistic Unicode use is probably heavily weighted towards low code points plus emoji, and an ideal text strategy would almost guarantee:
I guess it's arguable whether you should treat emoji specially since it's relatively rare to have special emoji-handling code (as opposed to general Unicode handing), but given how common they are in real use it's a bit surprising to me that by default they are included in at most 1 in 4 text strategies. |
Right, I see what you mean. Note that 'low codepoints plus emoji' is only representative for English and similar languages though - if your code has users in Asia you'll probably handle a lot of mostly-high-codepoint text too! The distribution of unicode data is tricky, especially when we might have a |
FWIW I think that getting better unicode support is going to involve the sort of magic that actual magicians do - basically doing an unreasonable amount of work behind the scenes. The structure of unicode is a bit too haphazard to do anything more principled - e.g. there's no way we could tell those emoji codepoints are in any way special without knowing something about emoji. I do agree that getting a better distribution of unicode characters in general would be good. we could e.g. start by doing something analogous to what we do with floats of having a hard coded list of deliberately nasty unicode characters that we include with high probability. |
Another place this has come up - This is essentially the script I used when developing from hypothesis import strategies as st
from hypothesis import given
from datetime import datetime
@given(dt=st.datetimes(), sep=st.characters())
def test_separator(dt, sep):
dt_rt = datetime.fromisoformat(dt.isoformat(sep=sep))
assert dt == dt_rt But since CPython doesn't have a Not blaming hypothesis for my own failure to realize that this could be a problem mind you, but this is totally the kind of bug it would be perfect for finding and I actually used hypothesis in this case, so it was a missed opportunity to pre-emptively fix a crash bug in CPython. |
Likewise not blaming you, but I am curious about your workflow here. Finding the bug 25% of the time doesn't sound too bad to me, though obviously it would be nice if it were better, and I feel like most common usage patterns would have noticed a bug that crops up that frequently. |
In this and in the original issue's case, there was little hope of me actually incorporating hypothesis into the target project's CI (in which case 25% chance of finding the bug would be acceptable), and there was a costly build step involved, so I wrote a little hypothesis side-test to see if there were any obvious test cases to include in the main test suite. In the CPython case, my guess is that I did run the tests many times, but likely the early runs found more obvious bugs, and when I got a run that worked on the hypothesis tests, I considered the "finding edge cases" portion of the testing over and switched to using the CI suite. The fact that there would be a whole class of missing edge cases like this never even occurred to me because every other time I've used hypothesis, it has immediately found the most fiddly and persnickety bugs on the first run. |
My very strong recommendation would be to up the number of examples run for one-off tests. If you're only running them on an ad hoc basis then the difference between 100 and 10,000 doesn't really matter that much. It's true that Hypothesis often finds bugs early on, but that's only because there are often relatively easy to find bugs. When the search space is complicated the default configuration is basically always going to miss bugs with some reasonable chance, even if we make this particular bug trigger reliably (which I'm not against doing). On some recent experiments I've been doing in testing SymPy, Hypothesis was still finding novel bugs after a couple of million test cases! |
Yes, I agree and I certainly know to do that now. I think my main point with this issue and with my follow-up example of where it has bitten me is that I believe that there is a smarter weighting function (even if it's a bit messy and hacky like with floats) for the The examples of where it has bitten me are just to show that taking the time to develop that weighting function is probably worth it because it may catch real bugs earlier in the process (e.g. even if this had been run as part of the CI, it would be quite possible for a PR to go through that only hits one of these edge cases as part of an unrelated build, after the original has been merged - though quite possibly before it hits production). |
Agreed - like Paul, I was thinking essentially that we change
We should explicitly document this somewhere, to give people an anchor for what constitutes "a lot of test cases" - see e.g. https://danluu.com/testing/ under "Test Time". |
I do not believe that this can be the case without a lot more structure on the part of The problem is that fundamentally Conjecture can only tell that something is interesting if it has tried it. The behaviour of its input is fairly opaque. While it is able to make some fairly good educated guesses about that behaviour, until it has seen a behaviour occur at least once it knows literally nothing about it. This makes it easy for it to boost the probability of rare-ish events (though it does not currently do a great job of that), but hard for it to boost the probability of events that are so rare that it will not see them on typical runs. This is definitely something where more supporting features in Conjecture could help, but unless I'm missing something very clever I think the solution will look more like "Hairy hand-written generator for nasty unicode behaviours" than it will "Conjecture pulls a rabbit out of the hat and the rabbit starts suggesting some really nasty codepoints for us to test on".
You're not thinking of "Swarm Testing" by Groce et al. are you? (it doesn't sound like it, but it's a similar-ish paper in this space) |
Without writing psudeocode, I was thinking of something else but at least partly inspired by the swarm testing paper but based on a less discrete decision space. Taking "features" to correspond to "subsets of unicode that we expect to trigger similar behavior" (waves hands), basically we would vary the probability of drawing from each but use a continuous measure rather than binary in-or-out for the case. And if instead of drawing independent probabilities for each then normalizing we start by drawing the distribution type, we're up to meta-meta-swarm-testing (:sob:). This could actually be written today, it would just be less helpful to the shrinker than we might like. I'm not expecting a talking rabbit though; if nothing else there are so many codepoints and categories and ways for string code to go wrong that I'd expect to need thousands of test cases minimum to have checked every kind of issue! |
FWIW, this is how Hypothesis used to work. It got lost in the move to Conjecture, but I've always vaguely wondered about resurrecting it. Every strategy had two stages (actually more, but we can ignore that): First you drew a parameter, then you drew a value conditional on that parameter. This could be used to e.g. induce correlation in lists because you would draw children from the same parameter.
I think we might have actually used to do literally that, though I wouldn't swear to it. More likely we just had the binary version.
I've thought a bit about how to do shrinker friendly swarm testing under the current model. It's actually not too bad, though I haven't actually tried it so there's probably some annoying bits that I'm missing. The main points are:
Basically if you can then do
|
I'm going to open a few PRs to improve the variety of text we generate, then move discussion of how to drive that with swarm testing on the backend to a new issue.
Late response, but you can also use |
The documented contract of this function is that it is the inverse of `datetime.isoformat()`. See GH HypothesisWorks/hypothesis#1401, HypothesisWorks/hypothesis#69 and to a lesser extent HypothesisWorks/hypothesis#2273 for background on why I have set max_examples so high.
The documented contract of this function is that it is the inverse of `datetime.isoformat()`. See GH HypothesisWorks/hypothesis#1401, HypothesisWorks/hypothesis#69 and to a lesser extent HypothesisWorks/hypothesis#2273 for background on why I have set max_examples so high.
I was recently attempting to test a bug that seemed to only manifest when it encountered emoji (and not necessarily other high-unicode codepoints). I figured that
strategies.text()
would just naturally include some emoji, but I've found that the following tests only fail about 25% of the time (clearing.hypothesis
between runs):There are 2623 keys in the
emoji.UNICODE_EMOJI
. It seems to me that this is probably a very common source of unicode characters and may be specially handled. One problem I see with this is that emoji do not seem to be a continuous block, and some are created from combining multiple characters, so it's not terribly easy to specify by putting a filter on the codepoint range.Is it worth more heavily weighting emoji in the draw, or possibly adding some sort of special case strategy for retrieving emoji?
The text was updated successfully, but these errors were encountered: