Weighting of emoji in text strategies? #1401

pganssle · 2018-07-16T20:58:40Z

I was recently attempting to test a bug that seemed to only manifest when it encountered emoji (and not necessarily other high-unicode codepoints). I figured that strategies.text() would just naturally include some emoji, but I've found that the following tests only fail about 25% of the time (clearing .hypothesis between runs):

from hypothesis import given, settings
from hypothesis import strategies as st

import emoji

@given(txt=st.text())
def test_contains_no_emoji(txt):
    assert not any(char not in emoji.UNICODE_EMOJI for char in txt)

@given(c=st.characters())
def test_not_emoji(c)
    assert not c in emoji.UNICODE_EMOJI

There are 2623 keys in the emoji.UNICODE_EMOJI . It seems to me that this is probably a very common source of unicode characters and may be specially handled. One problem I see with this is that emoji do not seem to be a continuous block, and some are created from combining multiple characters, so it's not terribly easy to specify by putting a filter on the codepoint range.

Is it worth more heavily weighting emoji in the draw, or possibly adding some sort of special case strategy for retrieving emoji?

The text was updated successfully, but these errors were encountered:

Zac-HD · 2018-07-17T00:04:50Z

Given the sheer number of codepoints and combinations of edge cases in unicode, I don't think this is too bad as default behaviour! If you know you're after emoji characters though, it's easy enough to create a specialised strategy:

# Note: sample sorted list rather than dict, so order is stable between runs
emoji_characters = st.sampled_from(sorted(emoji.UNICODE_EMOJI))

pganssle · 2018-07-17T00:40:45Z

Yes but this brings in the emoji dependency, which I only included for the sake of the example. Though in fairness test-only dependencies are not a huge deal.

I think my main point is that realistic Unicode use is probably heavily weighted towards low code points plus emoji, and an ideal text strategy would almost guarantee:

Some ASCII
Some "special characters" ($&_/\t)
Some use of combining characters (á, ł)
Emoji
Some characters of each byte length (1, 2, 3, 4).

I guess it's arguable whether you should treat emoji specially since it's relatively rare to have special emoji-handling code (as opposed to general Unicode handing), but given how common they are in real use it's a bit surprising to me that by default they are included in at most 1 in 4 text strategies.

Zac-HD · 2018-07-17T03:15:57Z

Right, I see what you mean. Note that 'low codepoints plus emoji' is only representative for English and similar languages though - if your code has users in Asia you'll probably handle a lot of mostly-high-codepoint text too!

The distribution of unicode data is tricky, especially when we might have a characters(...) strategy provided that constrains it further (see e.g. #341). Personally I can't see a fix that's even a net improvement, let along worth the maintenance and opportunity cost, so my current plan is to just leave it alone until @DRMacIver does some new magic which unlocks a better option. No ETA on that though 😢

DRMacIver · 2018-07-17T07:40:16Z

until @DRMacIver does some new magic which unlocks a better option

FWIW I think that getting better unicode support is going to involve the sort of magic that actual magicians do - basically doing an unreasonable amount of work behind the scenes. The structure of unicode is a bit too haphazard to do anything more principled - e.g. there's no way we could tell those emoji codepoints are in any way special without knowing something about emoji.

I do agree that getting a better distribution of unicode characters in general would be good. we could e.g. start by doing something analogous to what we do with floats of having a hard coded list of deliberately nasty unicode characters that we include with high probability.

pganssle · 2018-08-23T15:28:48Z

Another place this has come up - hypothesis should definitely put a high weight on surrogate characters.

This is essentially the script I used when developing datetime.fromisoformat for CPython:

from hypothesis import strategies as st
from hypothesis import given

from datetime import datetime

@given(dt=st.datetimes(), sep=st.characters())
def test_separator(dt, sep):
    dt_rt = datetime.fromisoformat(dt.isoformat(sep=sep))

    assert dt == dt_rt

But since CPython doesn't have a hypothesis dependency, I didn't include it in the test suite. 75%+ of the time, this works correctly, but if the separator is a surrogate character, it's a segfault in 3.7.0. We've got a fix in there now, but in the future it would be nice to have some confidence that we're getting at least one character from each of these character types, in addition to the emoji stuff. I dunno if that should be a separate issue or if we should move it to a new issue like, "improved unicode sampling distribution".

Not blaming hypothesis for my own failure to realize that this could be a problem mind you, but this is totally the kind of bug it would be perfect for finding and I actually used hypothesis in this case, so it was a missed opportunity to pre-emptively fix a crash bug in CPython.

DRMacIver · 2018-08-24T07:48:06Z

Not blaming hypothesis for my own failure to realize that this could be a problem mind you, but this is totally the kind of bug it would be perfect for finding and I actually used hypothesis in this case, so it was a missed opportunity to pre-emptively fix a crash bug in CPython.

Likewise not blaming you, but I am curious about your workflow here. Finding the bug 25% of the time doesn't sound too bad to me, though obviously it would be nice if it were better, and I feel like most common usage patterns would have noticed a bug that crops up that frequently.

pganssle · 2018-08-24T11:03:28Z

Likewise not blaming you, but I am curious about your workflow here.

In this and in the original issue's case, there was little hope of me actually incorporating hypothesis into the target project's CI (in which case 25% chance of finding the bug would be acceptable), and there was a costly build step involved, so I wrote a little hypothesis side-test to see if there were any obvious test cases to include in the main test suite.

In the CPython case, my guess is that I did run the tests many times, but likely the early runs found more obvious bugs, and when I got a run that worked on the hypothesis tests, I considered the "finding edge cases" portion of the testing over and switched to using the CI suite. The fact that there would be a whole class of missing edge cases like this never even occurred to me because every other time I've used hypothesis, it has immediately found the most fiddly and persnickety bugs on the first run.

DRMacIver · 2018-08-24T11:36:57Z

The fact that there would be a whole class of missing edge cases like this never even occurred to me because every other time I've used hypothesis, it has immediately found the most fiddly and persnickety bugs on the first run.

My very strong recommendation would be to up the number of examples run for one-off tests. If you're only running them on an ad hoc basis then the difference between 100 and 10,000 doesn't really matter that much.

It's true that Hypothesis often finds bugs early on, but that's only because there are often relatively easy to find bugs. When the search space is complicated the default configuration is basically always going to miss bugs with some reasonable chance, even if we make this particular bug trigger reliably (which I'm not against doing). On some recent experiments I've been doing in testing SymPy, Hypothesis was still finding novel bugs after a couple of million test cases!

pganssle · 2018-08-24T13:59:44Z

My very strong recommendation would be to up the number of examples run for one-off tests. If you're only running them on an ad hoc basis then the difference between 100 and 10,000 doesn't really matter that much.

Yes, I agree and I certainly know to do that now.

I think my main point with this issue and with my follow-up example of where it has bitten me is that I believe that there is a smarter weighting function (even if it's a bit messy and hacky like with floats) for the characters and text strategies that allows for greater coverage with fewer examples.

The examples of where it has bitten me are just to show that taking the time to develop that weighting function is probably worth it because it may catch real bugs earlier in the process (e.g. even if this had been run as part of the CI, it would be quite possible for a PR to go through that only hits one of these edge cases as part of an unrelated build, after the original has been merged - though quite possibly before it hits production).

Zac-HD · 2018-08-25T00:42:43Z

better unicode support is going to involve the sort of magic that actual magicians do - basically doing an unreasonable amount of work behind the scenes.

Agreed - like Paul, I was thinking essentially that we change characters() to have some more structure, and the magic part would be from Conjecture doing some cool stuff to the data distribution (ie weighting function).

Instead of 'draw an index into the string of allowed characters', we could implement characters() in terms of one_of that for each of various categories of characters, whether unicode categories or something else. If we really want to keep the existing shrink behavior we could use old|new, then calculate how the character would be generated from the old-style strategy and write that back to the buffer.
Distribution is tricky, because we want a distribution strongly biased towards unusual results - that is, characters in a string should not be chosen independently. I remember reading a paper with a two- or three-level step of selecting a 'mixedness' parameter (ie which distribution to use), then distribution parameters, then finally drawing the elements. Unfortunately the only term I remember them using was alpha, which is totally impossible to search for 😭. Similar in concept to some of Taleb's stuff about fractal patterns, but not actually him and more about combinatorics. (sorry for the vagueness, hoping David might know more...)

When the search space is complicated the default configuration is basically always going to miss bugs with some reasonable chance ... On some experiments testing SymPy, Hypothesis was still finding novel bugs after a couple of million test cases!

We should explicitly document this somewhere, to give people an anchor for what constitutes "a lot of test cases" - see e.g. https://danluu.com/testing/ under "Test Time".

DRMacIver · 2018-08-25T09:26:24Z

Agreed - like Paul, I was thinking essentially that we change characters() to have some more structure, and the magic part would be from Conjecture doing some cool stuff to the data distribution (ie weighting function).

I do not believe that this can be the case without a lot more structure on the part of characters().

The problem is that fundamentally Conjecture can only tell that something is interesting if it has tried it. The behaviour of its input is fairly opaque. While it is able to make some fairly good educated guesses about that behaviour, until it has seen a behaviour occur at least once it knows literally nothing about it. This makes it easy for it to boost the probability of rare-ish events (though it does not currently do a great job of that), but hard for it to boost the probability of events that are so rare that it will not see them on typical runs.

This is definitely something where more supporting features in Conjecture could help, but unless I'm missing something very clever I think the solution will look more like "Hairy hand-written generator for nasty unicode behaviours" than it will "Conjecture pulls a rabbit out of the hat and the rabbit starts suggesting some really nasty codepoints for us to test on".

I remember reading a paper

You're not thinking of "Swarm Testing" by Groce et al. are you? (it doesn't sound like it, but it's a similar-ish paper in this space)

Zac-HD · 2018-08-25T11:23:08Z

Without writing psudeocode, I was thinking of something else but at least partly inspired by the swarm testing paper but based on a less discrete decision space.

Taking "features" to correspond to "subsets of unicode that we expect to trigger similar behavior" (waves hands), basically we would vary the probability of drawing from each but use a continuous measure rather than binary in-or-out for the case. And if instead of drawing independent probabilities for each then normalizing we start by drawing the distribution type, we're up to meta-meta-swarm-testing (:sob:).

This could actually be written today, it would just be less helpful to the shrinker than we might like.

I'm not expecting a talking rabbit though; if nothing else there are so many codepoints and categories and ways for string code to go wrong that I'd expect to need thousands of test cases minimum to have checked every kind of issue!

DRMacIver · 2018-08-25T12:20:34Z

Without writing psudeocode, I was thinking of something else but at least partly inspired by the swarm testing paper but based on a less discrete decision space.

FWIW, this is how Hypothesis used to work. It got lost in the move to Conjecture, but I've always vaguely wondered about resurrecting it. Every strategy had two stages (actually more, but we can ignore that): First you drew a parameter, then you drew a value conditional on that parameter. This could be used to e.g. induce correlation in lists because you would draw children from the same parameter.

basically we would vary the probability of drawing from each but use a continuous measure rather than binary in-or-out for the case.

I think we might have actually used to do literally that, though I wouldn't swear to it. More likely we just had the binary version.

This could actually be written today, it would just be less helpful to the shrinker than we might like.

I've thought a bit about how to do shrinker friendly swarm testing under the current model. It's actually not too bad, though I haven't actually tried it so there's probably some annoying bits that I'm missing. The main points are:

You need swarm flags to "shrink open" so that once the shrinker has run to completion, all flags are enabled. e.g. you could do this by generating a set of banned flags.
You need to use rejection sampling rather than anything more clever.

Basically if you can then do characters().filter(character_class_is_enabled), then what happens during shrinking is follows:

We delete all of the initial iterations of the loop. Now, as if by magic, we just happen to have picked only values that are enabled.
The flags now shrink open, so we've left swarm mode and everything is now enabled.
We can now shrink the values as normal characters.

The big "???" that is hard to make shrinker friendly is when we choose the swarm flags. It would be nice to generate them on first use rather than up front before we know if we need them, but if we do that then we end up getting into a pickle when we try to delete the first point where a swarm flag is used. I have some ideas about how to fix this, but they require experimentation. No, this has an easy fix with no shrinker fuckery: We just ensure that every time we check whether a flag is enabled, if it's already been set we call data.write to record the flag in the data stream, so that if we delete the first use the subsequent uses turn into an initialisation.

Zac-HD · 2018-10-04T23:56:07Z

I'm going to open a few PRs to improve the variety of text we generate, then move discussion of how to drive that with swarm testing on the backend to a new issue.

(clearing .hypothesis between runs)

Late response, but you can also use @settings(database=None) to disable the database for that test.

The documented contract of this function is that it is the inverse of `datetime.isoformat()`. See GH HypothesisWorks/hypothesis#1401, HypothesisWorks/hypothesis#69 and to a lesser extent HypothesisWorks/hypothesis#2273 for background on why I have set max_examples so high.

Zac-HD added the question not sure it's a bug? questions welcome label Jul 17, 2018

Zac-HD added enhancement it's not broken, but we want it to be better docs documentation could *always* be better and removed question not sure it's a bug? questions welcome labels Aug 25, 2018

Zac-HD self-assigned this Oct 4, 2018

This was referenced Oct 5, 2018

Increase variation in Unicode character generation #1621

Closed

Support automatic 'swarm testing' for example selection #1637

Closed

Zac-HD removed their assignment Dec 17, 2018

This was referenced Dec 18, 2018

st.floats() with bounds do not shrink correctly #1704

Closed

We should add a guide to writing shrinker-friendly strategies #1705

Closed

Zac-HD mentioned this issue Feb 3, 2019

Misc improvements to the docs #1791

Merged

Zac-HD mentioned this issue Jul 10, 2019

[WIP] Increase variety in characters() generation #2045

Closed

Zac-HD mentioned this issue Dec 2, 2019

Chance of generating unnormalized unicode string is vanishingly low #341

Closed

Zac-HD mentioned this issue Jul 10, 2020

Improve shrink ordering of text #2482

Merged

Zac-HD mentioned this issue Oct 17, 2020

Expand our use of swarm testing #2643

Open

Zac-HD mentioned this issue Oct 25, 2021

Better heuristics for generating weird unicode strings #3127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weighting of emoji in text strategies? #1401

Weighting of emoji in text strategies? #1401

pganssle commented Jul 16, 2018

Zac-HD commented Jul 17, 2018

pganssle commented Jul 17, 2018

Zac-HD commented Jul 17, 2018

DRMacIver commented Jul 17, 2018

pganssle commented Aug 23, 2018 •

edited

Loading

DRMacIver commented Aug 24, 2018

pganssle commented Aug 24, 2018

DRMacIver commented Aug 24, 2018

pganssle commented Aug 24, 2018

Zac-HD commented Aug 25, 2018

DRMacIver commented Aug 25, 2018

Zac-HD commented Aug 25, 2018

DRMacIver commented Aug 25, 2018 •

edited

Loading

Zac-HD commented Oct 4, 2018

Weighting of emoji in text strategies? #1401

Weighting of emoji in text strategies? #1401

Comments

pganssle commented Jul 16, 2018

Zac-HD commented Jul 17, 2018

pganssle commented Jul 17, 2018

Zac-HD commented Jul 17, 2018

DRMacIver commented Jul 17, 2018

pganssle commented Aug 23, 2018 • edited Loading

DRMacIver commented Aug 24, 2018

pganssle commented Aug 24, 2018

DRMacIver commented Aug 24, 2018

pganssle commented Aug 24, 2018

Zac-HD commented Aug 25, 2018

DRMacIver commented Aug 25, 2018

Zac-HD commented Aug 25, 2018

DRMacIver commented Aug 25, 2018 • edited Loading

Zac-HD commented Oct 4, 2018

pganssle commented Aug 23, 2018 •

edited

Loading

DRMacIver commented Aug 25, 2018 •

edited

Loading