Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better heuristics for generating weird unicode strings #3127

Open
Zac-HD opened this issue Oct 25, 2021 · 1 comment
Open

Better heuristics for generating weird unicode strings #3127

Zac-HD opened this issue Oct 25, 2021 · 1 comment
Labels
enhancement it's not broken, but we want it to be better internals Stuff that only Hypothesis devs should ever see

Comments

@Zac-HD
Copy link
Member

Zac-HD commented Oct 25, 2021

Generating strings which find all the possible bugs in a program is hard - even at a codepoint-by-codepoint level like in #1401. Worse, there are many bugs that are triggered by sequences of codepoints (e.g. combining characters, emoji composition, etc.) or even more strucured strings like XSS attacks.

Eventually, I would like to 'make our own luck', by teaching text() to pick from a list of known-weird strings (or templates for weird things) and then shrink it as if we and randomly generated that sequence of codepoints. This is already on the wishlist in #3086, at which point it's mostly a matter of vendoring e.g. https://github.com/minimaxir/big-list-of-naughty-strings and whatever else we can think of based on e.g. Text Rendering Hates You, Text Editing Hates You Too, and so on (ligatures, RTL/LTR/TTB text directions, mixed-direction text, emoji modifiers, EICAR test string, ...).

@Zac-HD Zac-HD added enhancement it's not broken, but we want it to be better internals Stuff that only Hypothesis devs should ever see labels Oct 25, 2021
@Zac-HD
Copy link
Member Author

Zac-HD commented Dec 1, 2021

It should also be possible to add more strings (and perhaps also ints, floats, and bytes) to this pool at runtime, to help out with project-specific magic strings like AFL's "dictionaries" of interesting tokens.

As an extension, we could automate the "run strings" trick of reading interesting literals out of the program under test, in our case by grabbing the AST of loaded modules and walking it in search of short literals or statically-evaluable expressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement it's not broken, but we want it to be better internals Stuff that only Hypothesis devs should ever see
Projects
None yet
Development

No branches or pull requests

1 participant