-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ruff_textwrap accepts illegal Python whitespace #4991
Labels
bug
Something isn't working
Comments
24 tasks
Thank you! |
charliermarsh
added a commit
that referenced
this issue
Jun 10, 2023
…tespace (#4994) ## Summary We use `.trim()` and friends in a bunch of places, to strip whitespace from source code. However, not all Unicode whitespace characters are considered "whitespace" in Python, which only supports the standard space, tab, and form-feed characters. This PR audits our usages of `.trim()`, `.trim_start()`, `.trim_end()`, and `char::is_whitespace`, and replaces them as appropriate with a new `.trim_whitespace()` analogues, powered by a `PythonWhitespace` trait. In general, the only place that should continue to use `.trim()` is content within docstrings, which don't need to adhere to Python's semantic definitions of whitespace. Closes #4991.
Looks like this was a potentially incomplete fix; I am still able to trigger this behaviour. See this line: https://github.com/astral-sh/ruff/blob/main/crates/ruff_textwrap/src/lib.rs#L117 |
I'm confused, I fixed that and added a test all in that same PR, but somehow it didn't end up in the diff on GitHub? Operator error somewhere. |
konstin
pushed a commit
that referenced
this issue
Jun 13, 2023
…tespace (#4994) ## Summary We use `.trim()` and friends in a bunch of places, to strip whitespace from source code. However, not all Unicode whitespace characters are considered "whitespace" in Python, which only supports the standard space, tab, and form-feed characters. This PR audits our usages of `.trim()`, `.trim_start()`, `.trim_end()`, and `char::is_whitespace`, and replaces them as appropriate with a new `.trim_whitespace()` analogues, powered by a `PythonWhitespace` trait. In general, the only place that should continue to use `.trim()` is content within docstrings, which don't need to adhere to Python's semantic definitions of whitespace. Closes #4991.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Python does not consider
U+00a0
to be whitespace, but this character is whitespace according to Rust'strim_start
(which uses Unicode character classes): https://doc.rust-lang.org/std/string/struct.String.html#method.trim_startIn addition, Rust's str::len() method returns the length in bytes, not in characters. This combination can lead to a confusion in ruff_textwrap: https://github.com/astral-sh/ruff/blob/main/crates/ruff_textwrap/src/lib.rs#L95
This issue potentially indicates incorrect processing of leading whitespace in ruff_textwrap::dedent. For example, the mixing of tabs and spaces, the use of non-ASCII whitespace, etc. will be misprocessed by dedent currently.
How this was discovered
Since trim_start removes UTF-8 whitespace,
U+00a0
used as leading whitespace leads to a panic if an odd number of single-byte whitespace (e.g., space) characters are used as leading whitespace later. The string slicing attempts to slice between UTF-8 code points becauseprefix_len
is computed by number of bytes, not in characters.Here is a reproducing (illegal python syntax, don't pay attention to that) file which triggers this behaviour: illegal-whitespace.txt
The text was updated successfully, but these errors were encountered: