ruff_textwrap accepts illegal Python whitespace #4991

addisoncrump · 2023-06-09T21:23:41Z

Python does not consider U+00a0 to be whitespace, but this character is whitespace according to Rust's trim_start (which uses Unicode character classes): https://doc.rust-lang.org/std/string/struct.String.html#method.trim_start

In addition, Rust's str::len() method returns the length in bytes, not in characters. This combination can lead to a confusion in ruff_textwrap: https://github.com/astral-sh/ruff/blob/main/crates/ruff_textwrap/src/lib.rs#L95

This issue potentially indicates incorrect processing of leading whitespace in ruff_textwrap::dedent. For example, the mixing of tabs and spaces, the use of non-ASCII whitespace, etc. will be misprocessed by dedent currently.

How this was discovered

Since trim_start removes UTF-8 whitespace, U+00a0 used as leading whitespace leads to a panic if an odd number of single-byte whitespace (e.g., space) characters are used as leading whitespace later. The string slicing attempts to slice between UTF-8 code points because prefix_len is computed by number of bytes, not in characters.

Here is a reproducing (illegal python syntax, don't pay attention to that) file which triggers this behaviour: illegal-whitespace.txt

The text was updated successfully, but these errors were encountered:

charliermarsh · 2023-06-09T21:38:31Z

Thank you!

…tespace (#4994) ## Summary We use `.trim()` and friends in a bunch of places, to strip whitespace from source code. However, not all Unicode whitespace characters are considered "whitespace" in Python, which only supports the standard space, tab, and form-feed characters. This PR audits our usages of `.trim()`, `.trim_start()`, `.trim_end()`, and `char::is_whitespace`, and replaces them as appropriate with a new `.trim_whitespace()` analogues, powered by a `PythonWhitespace` trait. In general, the only place that should continue to use `.trim()` is content within docstrings, which don't need to adhere to Python's semantic definitions of whitespace. Closes #4991.

addisoncrump · 2023-06-10T02:09:08Z

Looks like this was a potentially incomplete fix; I am still able to trigger this behaviour. See this line: https://github.com/astral-sh/ruff/blob/main/crates/ruff_textwrap/src/lib.rs#L117

charliermarsh · 2023-06-10T02:17:52Z

I'm confused, I fixed that and added a test all in that same PR, but somehow it didn't end up in the diff on GitHub? Operator error somewhere.

…tespace (#4994) ## Summary We use `.trim()` and friends in a bunch of places, to strip whitespace from source code. However, not all Unicode whitespace characters are considered "whitespace" in Python, which only supports the standard space, tab, and form-feed characters. This PR audits our usages of `.trim()`, `.trim_start()`, `.trim_end()`, and `char::is_whitespace`, and replaces them as appropriate with a new `.trim_whitespace()` analogues, powered by a `PythonWhitespace` trait. In general, the only place that should continue to use `.trim()` is content within docstrings, which don't need to adhere to Python's semantic definitions of whitespace. Closes #4991.

addisoncrump mentioned this issue Jun 9, 2023

Use ruff_fix_validity to catch regressions in CI not detected by unit tests #4972

Open

24 tasks

charliermarsh added the bug Something isn't working label Jun 9, 2023

charliermarsh self-assigned this Jun 10, 2023

charliermarsh mentioned this issue Jun 10, 2023

Introduce PythonWhitespace to confine trim operations to Python whitespace #4994

Merged

charliermarsh closed this as completed in #4994 Jun 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ruff_textwrap accepts illegal Python whitespace #4991

ruff_textwrap accepts illegal Python whitespace #4991

addisoncrump commented Jun 9, 2023 •

edited

Loading

charliermarsh commented Jun 9, 2023

addisoncrump commented Jun 10, 2023

charliermarsh commented Jun 10, 2023

ruff_textwrap accepts illegal Python whitespace #4991

ruff_textwrap accepts illegal Python whitespace #4991

Comments

addisoncrump commented Jun 9, 2023 • edited Loading

How this was discovered

charliermarsh commented Jun 9, 2023

addisoncrump commented Jun 10, 2023

charliermarsh commented Jun 10, 2023

addisoncrump commented Jun 9, 2023 •

edited

Loading