Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weighting of emoji in text strategies? #1401

Open
pganssle opened this issue Jul 16, 2018 · 14 comments
Open

Weighting of emoji in text strategies? #1401

pganssle opened this issue Jul 16, 2018 · 14 comments
Labels
docs documentation could *always* be better enhancement it's not broken, but we want it to be better

Comments

@pganssle
Copy link
Contributor

I was recently attempting to test a bug that seemed to only manifest when it encountered emoji (and not necessarily other high-unicode codepoints). I figured that strategies.text() would just naturally include some emoji, but I've found that the following tests only fail about 25% of the time (clearing .hypothesis between runs):

from hypothesis import given, settings
from hypothesis import strategies as st

import emoji

@given(txt=st.text())
def test_contains_no_emoji(txt):
    assert not any(char not in emoji.UNICODE_EMOJI for char in txt)

@given(c=st.characters())
def test_not_emoji(c)
    assert not c in emoji.UNICODE_EMOJI

There are 2623 keys in the emoji.UNICODE_EMOJI . It seems to me that this is probably a very common source of unicode characters and may be specially handled. One problem I see with this is that emoji do not seem to be a continuous block, and some are created from combining multiple characters, so it's not terribly easy to specify by putting a filter on the codepoint range.

Is it worth more heavily weighting emoji in the draw, or possibly adding some sort of special case strategy for retrieving emoji?

@Zac-HD Zac-HD added the question not sure it's a bug? questions welcome label Jul 17, 2018
@Zac-HD
Copy link
Member

Zac-HD commented Jul 17, 2018

Given the sheer number of codepoints and combinations of edge cases in unicode, I don't think this is too bad as default behaviour! If you know you're after emoji characters though, it's easy enough to create a specialised strategy:

# Note: sample sorted list rather than dict, so order is stable between runs
emoji_characters = st.sampled_from(sorted(emoji.UNICODE_EMOJI))

@pganssle
Copy link
Contributor Author

Yes but this brings in the emoji dependency, which I only included for the sake of the example. Though in fairness test-only dependencies are not a huge deal.

I think my main point is that realistic Unicode use is probably heavily weighted towards low code points plus emoji, and an ideal text strategy would almost guarantee:

  1. Some ASCII
  2. Some "special characters" ($&_/\t)
  3. Some use of combining characters (á, ł)
  4. Emoji
  5. Some characters of each byte length (1, 2, 3, 4).

I guess it's arguable whether you should treat emoji specially since it's relatively rare to have special emoji-handling code (as opposed to general Unicode handing), but given how common they are in real use it's a bit surprising to me that by default they are included in at most 1 in 4 text strategies.

@Zac-HD
Copy link
Member

Zac-HD commented Jul 17, 2018

Right, I see what you mean. Note that 'low codepoints plus emoji' is only representative for English and similar languages though - if your code has users in Asia you'll probably handle a lot of mostly-high-codepoint text too!

The distribution of unicode data is tricky, especially when we might have a characters(...) strategy provided that constrains it further (see e.g. #341). Personally I can't see a fix that's even a net improvement, let along worth the maintenance and opportunity cost, so my current plan is to just leave it alone until @DRMacIver does some new magic which unlocks a better option. No ETA on that though 😢

@DRMacIver
Copy link
Member

until @DRMacIver does some new magic which unlocks a better option

FWIW I think that getting better unicode support is going to involve the sort of magic that actual magicians do - basically doing an unreasonable amount of work behind the scenes. The structure of unicode is a bit too haphazard to do anything more principled - e.g. there's no way we could tell those emoji codepoints are in any way special without knowing something about emoji.

I do agree that getting a better distribution of unicode characters in general would be good. we could e.g. start by doing something analogous to what we do with floats of having a hard coded list of deliberately nasty unicode characters that we include with high probability.

@pganssle
Copy link
Contributor Author

pganssle commented Aug 23, 2018

Another place this has come up - hypothesis should definitely put a high weight on surrogate characters.

This is essentially the script I used when developing datetime.fromisoformat for CPython:

from hypothesis import strategies as st
from hypothesis import given

from datetime import datetime

@given(dt=st.datetimes(), sep=st.characters())
def test_separator(dt, sep):
    dt_rt = datetime.fromisoformat(dt.isoformat(sep=sep))

    assert dt == dt_rt

But since CPython doesn't have a hypothesis dependency, I didn't include it in the test suite. 75%+ of the time, this works correctly, but if the separator is a surrogate character, it's a segfault in 3.7.0. We've got a fix in there now, but in the future it would be nice to have some confidence that we're getting at least one character from each of these character types, in addition to the emoji stuff. I dunno if that should be a separate issue or if we should move it to a new issue like, "improved unicode sampling distribution".

Not blaming hypothesis for my own failure to realize that this could be a problem mind you, but this is totally the kind of bug it would be perfect for finding and I actually used hypothesis in this case, so it was a missed opportunity to pre-emptively fix a crash bug in CPython.

@DRMacIver
Copy link
Member

Not blaming hypothesis for my own failure to realize that this could be a problem mind you, but this is totally the kind of bug it would be perfect for finding and I actually used hypothesis in this case, so it was a missed opportunity to pre-emptively fix a crash bug in CPython.

Likewise not blaming you, but I am curious about your workflow here. Finding the bug 25% of the time doesn't sound too bad to me, though obviously it would be nice if it were better, and I feel like most common usage patterns would have noticed a bug that crops up that frequently.

@pganssle
Copy link
Contributor Author

Likewise not blaming you, but I am curious about your workflow here.

In this and in the original issue's case, there was little hope of me actually incorporating hypothesis into the target project's CI (in which case 25% chance of finding the bug would be acceptable), and there was a costly build step involved, so I wrote a little hypothesis side-test to see if there were any obvious test cases to include in the main test suite.

In the CPython case, my guess is that I did run the tests many times, but likely the early runs found more obvious bugs, and when I got a run that worked on the hypothesis tests, I considered the "finding edge cases" portion of the testing over and switched to using the CI suite. The fact that there would be a whole class of missing edge cases like this never even occurred to me because every other time I've used hypothesis, it has immediately found the most fiddly and persnickety bugs on the first run.

@DRMacIver
Copy link
Member

The fact that there would be a whole class of missing edge cases like this never even occurred to me because every other time I've used hypothesis, it has immediately found the most fiddly and persnickety bugs on the first run.

My very strong recommendation would be to up the number of examples run for one-off tests. If you're only running them on an ad hoc basis then the difference between 100 and 10,000 doesn't really matter that much.

It's true that Hypothesis often finds bugs early on, but that's only because there are often relatively easy to find bugs. When the search space is complicated the default configuration is basically always going to miss bugs with some reasonable chance, even if we make this particular bug trigger reliably (which I'm not against doing). On some recent experiments I've been doing in testing SymPy, Hypothesis was still finding novel bugs after a couple of million test cases!

@pganssle
Copy link
Contributor Author

My very strong recommendation would be to up the number of examples run for one-off tests. If you're only running them on an ad hoc basis then the difference between 100 and 10,000 doesn't really matter that much.

Yes, I agree and I certainly know to do that now.

I think my main point with this issue and with my follow-up example of where it has bitten me is that I believe that there is a smarter weighting function (even if it's a bit messy and hacky like with floats) for the characters and text strategies that allows for greater coverage with fewer examples.

The examples of where it has bitten me are just to show that taking the time to develop that weighting function is probably worth it because it may catch real bugs earlier in the process (e.g. even if this had been run as part of the CI, it would be quite possible for a PR to go through that only hits one of these edge cases as part of an unrelated build, after the original has been merged - though quite possibly before it hits production).

@Zac-HD Zac-HD added enhancement it's not broken, but we want it to be better docs documentation could *always* be better and removed question not sure it's a bug? questions welcome labels Aug 25, 2018
@Zac-HD
Copy link
Member

Zac-HD commented Aug 25, 2018

better unicode support is going to involve the sort of magic that actual magicians do - basically doing an unreasonable amount of work behind the scenes.

Agreed - like Paul, I was thinking essentially that we change characters() to have some more structure, and the magic part would be from Conjecture doing some cool stuff to the data distribution (ie weighting function).

  • Instead of 'draw an index into the string of allowed characters', we could implement characters() in terms of one_of that for each of various categories of characters, whether unicode categories or something else. If we really want to keep the existing shrink behavior we could use old|new, then calculate how the character would be generated from the old-style strategy and write that back to the buffer.
  • Distribution is tricky, because we want a distribution strongly biased towards unusual results - that is, characters in a string should not be chosen independently. I remember reading a paper with a two- or three-level step of selecting a 'mixedness' parameter (ie which distribution to use), then distribution parameters, then finally drawing the elements. Unfortunately the only term I remember them using was alpha, which is totally impossible to search for 😭. Similar in concept to some of Taleb's stuff about fractal patterns, but not actually him and more about combinatorics. (sorry for the vagueness, hoping David might know more...)

When the search space is complicated the default configuration is basically always going to miss bugs with some reasonable chance ... On some experiments testing SymPy, Hypothesis was still finding novel bugs after a couple of million test cases!

We should explicitly document this somewhere, to give people an anchor for what constitutes "a lot of test cases" - see e.g. https://danluu.com/testing/ under "Test Time".

@DRMacIver
Copy link
Member

Agreed - like Paul, I was thinking essentially that we change characters() to have some more structure, and the magic part would be from Conjecture doing some cool stuff to the data distribution (ie weighting function).

I do not believe that this can be the case without a lot more structure on the part of characters().

The problem is that fundamentally Conjecture can only tell that something is interesting if it has tried it. The behaviour of its input is fairly opaque. While it is able to make some fairly good educated guesses about that behaviour, until it has seen a behaviour occur at least once it knows literally nothing about it. This makes it easy for it to boost the probability of rare-ish events (though it does not currently do a great job of that), but hard for it to boost the probability of events that are so rare that it will not see them on typical runs.

This is definitely something where more supporting features in Conjecture could help, but unless I'm missing something very clever I think the solution will look more like "Hairy hand-written generator for nasty unicode behaviours" than it will "Conjecture pulls a rabbit out of the hat and the rabbit starts suggesting some really nasty codepoints for us to test on".

I remember reading a paper

You're not thinking of "Swarm Testing" by Groce et al. are you? (it doesn't sound like it, but it's a similar-ish paper in this space)

@Zac-HD
Copy link
Member

Zac-HD commented Aug 25, 2018

Without writing psudeocode, I was thinking of something else but at least partly inspired by the swarm testing paper but based on a less discrete decision space.

Taking "features" to correspond to "subsets of unicode that we expect to trigger similar behavior" (waves hands), basically we would vary the probability of drawing from each but use a continuous measure rather than binary in-or-out for the case. And if instead of drawing independent probabilities for each then normalizing we start by drawing the distribution type, we're up to meta-meta-swarm-testing (:sob:).

This could actually be written today, it would just be less helpful to the shrinker than we might like.

I'm not expecting a talking rabbit though; if nothing else there are so many codepoints and categories and ways for string code to go wrong that I'd expect to need thousands of test cases minimum to have checked every kind of issue!

@DRMacIver
Copy link
Member

DRMacIver commented Aug 25, 2018

Without writing psudeocode, I was thinking of something else but at least partly inspired by the swarm testing paper but based on a less discrete decision space.

FWIW, this is how Hypothesis used to work. It got lost in the move to Conjecture, but I've always vaguely wondered about resurrecting it. Every strategy had two stages (actually more, but we can ignore that): First you drew a parameter, then you drew a value conditional on that parameter. This could be used to e.g. induce correlation in lists because you would draw children from the same parameter.

basically we would vary the probability of drawing from each but use a continuous measure rather than binary in-or-out for the case.

I think we might have actually used to do literally that, though I wouldn't swear to it. More likely we just had the binary version.

This could actually be written today, it would just be less helpful to the shrinker than we might like.

I've thought a bit about how to do shrinker friendly swarm testing under the current model. It's actually not too bad, though I haven't actually tried it so there's probably some annoying bits that I'm missing. The main points are:

  1. You need swarm flags to "shrink open" so that once the shrinker has run to completion, all flags are enabled. e.g. you could do this by generating a set of banned flags.
  2. You need to use rejection sampling rather than anything more clever.

Basically if you can then do characters().filter(character_class_is_enabled), then what happens during shrinking is follows:

  1. We delete all of the initial iterations of the loop. Now, as if by magic, we just happen to have picked only values that are enabled.
  2. The flags now shrink open, so we've left swarm mode and everything is now enabled.
  3. We can now shrink the values as normal characters.

The big "???" that is hard to make shrinker friendly is when we choose the swarm flags. It would be nice to generate them on first use rather than up front before we know if we need them, but if we do that then we end up getting into a pickle when we try to delete the first point where a swarm flag is used. I have some ideas about how to fix this, but they require experimentation. No, this has an easy fix with no shrinker fuckery: We just ensure that every time we check whether a flag is enabled, if it's already been set we call data.write to record the flag in the data stream, so that if we delete the first use the subsequent uses turn into an initialisation.

@Zac-HD Zac-HD self-assigned this Oct 4, 2018
@Zac-HD
Copy link
Member

Zac-HD commented Oct 4, 2018

I'm going to open a few PRs to improve the variety of text we generate, then move discussion of how to drive that with swarm testing on the backend to a new issue.

(clearing .hypothesis between runs)

Late response, but you can also use @settings(database=None) to disable the database for that test.

@Zac-HD Zac-HD removed their assignment Dec 17, 2018
pganssle added a commit to pganssle/stdlib-property-tests that referenced this issue Jan 30, 2020
The documented contract of this function is that it is the inverse of
`datetime.isoformat()`.

See GH HypothesisWorks/hypothesis#1401, HypothesisWorks/hypothesis#69
and to a lesser extent HypothesisWorks/hypothesis#2273 for background on
why I have set max_examples so high.
pganssle added a commit to pganssle/stdlib-property-tests that referenced this issue Jan 30, 2020
The documented contract of this function is that it is the inverse of
`datetime.isoformat()`.

See GH HypothesisWorks/hypothesis#1401, HypothesisWorks/hypothesis#69
and to a lesser extent HypothesisWorks/hypothesis#2273 for background on
why I have set max_examples so high.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs documentation could *always* be better enhancement it's not broken, but we want it to be better
Projects
None yet
3 participants