Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_INDENT_RE is too narrow - '\x0c ' is strange but valid indent whitespace #446

Open
Zac-HD opened this issue Jan 12, 2021 · 4 comments
Open
Labels
bug Something isn't working parsing Converting source code into CST nodes

Comments

@Zac-HD
Copy link
Contributor

Zac-HD commented Jan 12, 2021

While working on my shed autoformatter, which includes some libcst-based passes, Hypothesmith uncovered a bug:

_INDENT_RE: Pattern[str] = re.compile(r"[ \t]+")

For example, 'class A:\n\x0c pass\n' is a valid class-declaration. It can even be parsed by libcst.parse_module(), but the refactoring tooling chokes on it.

(this is a terrible bug to report and I hope it never happens in the wild, but there you go!)

@zsol
Copy link
Member

zsol commented Jan 12, 2021

uhhhh... thanks? :) 🥇

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Jan 12, 2021

You're, uh, welcome. I really am sorry 😅

FWIW I've worked around this, so if you want to just close it as out-of-scope I will be perfectly fine with that.

@bgw
Copy link
Contributor

bgw commented Jan 25, 2021

If anyone is curious...

Documentation about the formfeed character is here: https://docs.python.org/3.9/reference/lexical_analysis.html#indentation

It gets handled here in tokenizer.c: https://github.com/python/cpython/blob/v3.9.1/Parser/tokenizer.c#L1198

It looks like it was originally added to accommodate this emacs usecase, but the exact details about how it's handled are a bit weird (it resets the column counter).

@zsol zsol added the bug Something isn't working label May 19, 2021
@zsol zsol added the parsing Converting source code into CST nodes label Jun 16, 2022
@aleivag
Copy link
Contributor

aleivag commented May 26, 2023

Ultracrepidarian comment here: The character \x0c (a.k.a ^L) its used to reset the line count, so the only white spaces it matter are the one after the last \x0c in the line... for instance ' \x0c ' counts as a single space... ' \x0c x0c ' also counts as 1 space.

here is a good repro... 'def foo():\n \x0c class A:\n\x0c pass\n return A' its basically equivalent of

def foo():
 class A:
  pass
 return A

as shown in:

~/code/libcst >>> ./venv/bin/hatch run ipython                                                                                     ±[●][main]
Python 3.10.10 (main, Mar  5 2023, 22:26:53) [GCC 12.2.1 20230201]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.13.2 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import libcst as cst

In [2]: MOD = 'def foo():\n                        \x0c class A:\n\x0c  pass\n return A'

In [3]: c = compile(MOD, "__main__", "exec")

In [4]: exec(c)

In [5]: foo()
Out[5]: __main__.foo.<locals>.A

now passing MOD through parse_module

In [6]: cst.parse_module(MOD)
---------------------------------------------------------------------------
ParserSyntaxError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 cst.parse_module(MOD)

File ~/code/libcst/libcst/_parser/entrypoints.py:109, in parse_module(source, config)
     94 def parse_module(
     95     source: Union[str, bytes],  # the only entrypoint that accepts bytes
     96     config: PartialParserConfig = _DEFAULT_PARTIAL_PARSER_CONFIG,
     97 ) -> Module:
     98     """
     99     Accepts an entire python module, including all leading and trailing whitespace.
    100 
   (...)
    107     attribute.
    108     """
--> 109     result = _parse(
    110         "file_input",
    111         source,
    112         config,
    113         detect_trailing_newline=True,
    114         detect_default_newline=True,
    115     )
    116     assert isinstance(result, Module)
    117     return result

File ~/code/libcst/libcst/_parser/entrypoints.py:55, in _parse(entrypoint, source, config, detect_trailing_newline, detect_default_newline)
     52     else:
     53         raise ValueError(f"Unknown parser entry point: {entrypoint}")
---> 55     return parse(source_str)
     56 return _pure_python_parse(
     57     entrypoint,
     58     source,
   (...)
     61     detect_default_newline=detect_default_newline,
     62 )

ParserSyntaxError: Syntax Error @ 1:1.
tokenizer error: no matching outer block for dedent

def foo():
^

I think this is a good start to look at this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parsing Converting source code into CST nodes
Projects
None yet
Development

No branches or pull requests

4 participants