`_INDENT_RE` is too narrow - `'\x0c '` is strange but valid indent whitespace #446

Zac-HD · 2021-01-12T12:48:31Z

While working on my shed autoformatter, which includes some libcst-based passes, Hypothesmith uncovered a bug:

Line 20 in c22ed6a

_INDENT_RE: Pattern[str] = re.compile(r"[ \t]+")

For example, 'class A:\n\x0c pass\n' is a valid class-declaration. It can even be parsed by libcst.parse_module(), but the refactoring tooling chokes on it.

(this is a terrible bug to report and I hope it never happens in the wild, but there you go!)

The text was updated successfully, but these errors were encountered:

zsol · 2021-01-12T12:54:01Z

uhhhh... thanks? :) 🥇

Zac-HD · 2021-01-12T12:56:36Z

You're, uh, welcome. I really am sorry 😅

FWIW I've worked around this, so if you want to just close it as out-of-scope I will be perfectly fine with that.

bgw · 2021-01-25T06:47:28Z

If anyone is curious...

Documentation about the formfeed character is here: https://docs.python.org/3.9/reference/lexical_analysis.html#indentation

It gets handled here in tokenizer.c: https://github.com/python/cpython/blob/v3.9.1/Parser/tokenizer.c#L1198

It looks like it was originally added to accommodate this emacs usecase, but the exact details about how it's handled are a bit weird (it resets the column counter).

aleivag · 2023-05-26T01:05:29Z

Ultracrepidarian comment here: The character \x0c (a.k.a ^L) its used to reset the line count, so the only white spaces it matter are the one after the last \x0c in the line... for instance ' \x0c ' counts as a single space... ' \x0c x0c ' also counts as 1 space.

here is a good repro... 'def foo():\n \x0c class A:\n\x0c pass\n return A' its basically equivalent of

def foo():
 class A:
  pass
 return A

as shown in:

~/code/libcst >>> ./venv/bin/hatch run ipython                                                                                     ±[●][main]
Python 3.10.10 (main, Mar  5 2023, 22:26:53) [GCC 12.2.1 20230201]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.13.2 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import libcst as cst

In [2]: MOD = 'def foo():\n                        \x0c class A:\n\x0c  pass\n return A'

In [3]: c = compile(MOD, "__main__", "exec")

In [4]: exec(c)

In [5]: foo()
Out[5]: __main__.foo.<locals>.A

now passing MOD through parse_module

In [6]: cst.parse_module(MOD)
---------------------------------------------------------------------------
ParserSyntaxError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 cst.parse_module(MOD)

File ~/code/libcst/libcst/_parser/entrypoints.py:109, in parse_module(source, config)
     94 def parse_module(
     95     source: Union[str, bytes],  # the only entrypoint that accepts bytes
     96     config: PartialParserConfig = _DEFAULT_PARTIAL_PARSER_CONFIG,
     97 ) -> Module:
     98     """
     99     Accepts an entire python module, including all leading and trailing whitespace.
    100 
   (...)
    107     attribute.
    108     """
--> 109     result = _parse(
    110         "file_input",
    111         source,
    112         config,
    113         detect_trailing_newline=True,
    114         detect_default_newline=True,
    115     )
    116     assert isinstance(result, Module)
    117     return result

File ~/code/libcst/libcst/_parser/entrypoints.py:55, in _parse(entrypoint, source, config, detect_trailing_newline, detect_default_newline)
     52     else:
     53         raise ValueError(f"Unknown parser entry point: {entrypoint}")
---> 55     return parse(source_str)
     56 return _pure_python_parse(
     57     entrypoint,
     58     source,
   (...)
     61     detect_default_newline=detect_default_newline,
     62 )

ParserSyntaxError: Syntax Error @ 1:1.
tokenizer error: no matching outer block for dedent

def foo():
^

I think this is a good start to look at this

Zac-HD mentioned this issue Feb 21, 2021

Black fails to tokenise files ending with a backslash psf/black#1012

Closed

bgw mentioned this issue Mar 14, 2021

[native] Add a rust implementation of whitespace_parser #452

Closed

zsol added the bug Something isn't working label May 19, 2021

zsol added the parsing Converting source code into CST nodes label Jun 16, 2022

Zac-HD mentioned this issue May 26, 2023

Real-world code snippets which libcst fails to parse #930

Closed

Zac-HD mentioned this issue Aug 29, 2023

New falsifying example for test_isort_is_idempotent PyCQA/isort#2171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`_INDENT_RE` is too narrow - `'\x0c '` is strange but valid indent whitespace #446

`_INDENT_RE` is too narrow - `'\x0c '` is strange but valid indent whitespace #446

Zac-HD commented Jan 12, 2021

zsol commented Jan 12, 2021

Zac-HD commented Jan 12, 2021 •

edited

Loading

bgw commented Jan 25, 2021 •

edited

Loading

aleivag commented May 26, 2023 •

edited

Loading

_INDENT_RE is too narrow - '\x0c ' is strange but valid indent whitespace #446

_INDENT_RE is too narrow - '\x0c ' is strange but valid indent whitespace #446

Comments

Zac-HD commented Jan 12, 2021

zsol commented Jan 12, 2021

Zac-HD commented Jan 12, 2021 • edited Loading

bgw commented Jan 25, 2021 • edited Loading

aleivag commented May 26, 2023 • edited Loading

`_INDENT_RE` is too narrow - `'\x0c '` is strange but valid indent whitespace #446

`_INDENT_RE` is too narrow - `'\x0c '` is strange but valid indent whitespace #446

Zac-HD commented Jan 12, 2021 •

edited

Loading

bgw commented Jan 25, 2021 •

edited

Loading

aleivag commented May 26, 2023 •

edited

Loading