Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ruff_textwrap accepts illegal Python whitespace #4991

Closed
Tracked by #4972
addisoncrump opened this issue Jun 9, 2023 · 3 comments · Fixed by #4994
Closed
Tracked by #4972

ruff_textwrap accepts illegal Python whitespace #4991

addisoncrump opened this issue Jun 9, 2023 · 3 comments · Fixed by #4994
Assignees
Labels
bug Something isn't working

Comments

@addisoncrump
Copy link
Contributor

addisoncrump commented Jun 9, 2023

Python does not consider U+00a0 to be whitespace, but this character is whitespace according to Rust's trim_start (which uses Unicode character classes): https://doc.rust-lang.org/std/string/struct.String.html#method.trim_start

In addition, Rust's str::len() method returns the length in bytes, not in characters. This combination can lead to a confusion in ruff_textwrap: https://github.com/astral-sh/ruff/blob/main/crates/ruff_textwrap/src/lib.rs#L95

This issue potentially indicates incorrect processing of leading whitespace in ruff_textwrap::dedent. For example, the mixing of tabs and spaces, the use of non-ASCII whitespace, etc. will be misprocessed by dedent currently.


How this was discovered

Since trim_start removes UTF-8 whitespace, U+00a0 used as leading whitespace leads to a panic if an odd number of single-byte whitespace (e.g., space) characters are used as leading whitespace later. The string slicing attempts to slice between UTF-8 code points because prefix_len is computed by number of bytes, not in characters.

Here is a reproducing (illegal python syntax, don't pay attention to that) file which triggers this behaviour: illegal-whitespace.txt

@charliermarsh
Copy link
Member

Thank you!

@charliermarsh charliermarsh self-assigned this Jun 10, 2023
charliermarsh added a commit that referenced this issue Jun 10, 2023
…tespace (#4994)

## Summary

We use `.trim()` and friends in a bunch of places, to strip whitespace
from source code. However, not all Unicode whitespace characters are
considered "whitespace" in Python, which only supports the standard
space, tab, and form-feed characters.

This PR audits our usages of `.trim()`, `.trim_start()`, `.trim_end()`,
and `char::is_whitespace`, and replaces them as appropriate with a new
`.trim_whitespace()` analogues, powered by a `PythonWhitespace` trait.

In general, the only place that should continue to use `.trim()` is
content within docstrings, which don't need to adhere to Python's
semantic definitions of whitespace.

Closes #4991.
@addisoncrump
Copy link
Contributor Author

Looks like this was a potentially incomplete fix; I am still able to trigger this behaviour. See this line: https://github.com/astral-sh/ruff/blob/main/crates/ruff_textwrap/src/lib.rs#L117

@charliermarsh
Copy link
Member

I'm confused, I fixed that and added a test all in that same PR, but somehow it didn't end up in the diff on GitHub? Operator error somewhere.

konstin pushed a commit that referenced this issue Jun 13, 2023
…tespace (#4994)

## Summary

We use `.trim()` and friends in a bunch of places, to strip whitespace
from source code. However, not all Unicode whitespace characters are
considered "whitespace" in Python, which only supports the standard
space, tab, and form-feed characters.

This PR audits our usages of `.trim()`, `.trim_start()`, `.trim_end()`,
and `char::is_whitespace`, and replaces them as appropriate with a new
`.trim_whitespace()` analogues, powered by a `PythonWhitespace` trait.

In general, the only place that should continue to use `.trim()` is
content within docstrings, which don't need to adhere to Python's
semantic definitions of whitespace.

Closes #4991.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants