Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ast.unparse produces bad code for identifiers that become keywords #90678

Open
Kodiologist mannequin opened this issue Jan 25, 2022 · 17 comments
Open

ast.unparse produces bad code for identifiers that become keywords #90678

Kodiologist mannequin opened this issue Jan 25, 2022 · 17 comments
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@Kodiologist
Copy link
Mannequin

Kodiologist mannequin commented Jan 25, 2022

BPO 46520
Nosy @terryjreedy, @benjaminp, @JelleZijlstra, @Kodiologist, @pablogsal, @isidentical, @sobolevn
PRs
  • bpo-46520: Handle identifiers that look like keywords in ast.unparse #31012
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2022-01-25.15:06:17.083>
    labels = ['type-bug', 'library', '3.9', '3.10', '3.11']
    title = 'ast.unparse produces bad code for identifiers that become keywords'
    updated_at = <Date 2022-01-29.18:06:22.414>
    user = 'https://github.com/Kodiologist'

    bugs.python.org fields:

    activity = <Date 2022-01-29.18:06:22.414>
    actor = 'Kodiologist'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2022-01-25.15:06:17.083>
    creator = 'Kodiologist'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 46520
    keywords = ['patch']
    message_count = 7.0
    messages = ['411606', '411669', '411742', '411780', '411781', '412045', '412078']
    nosy_count = 7.0
    nosy_names = ['terry.reedy', 'benjamin.peterson', 'JelleZijlstra', 'Kodiologist', 'pablogsal', 'BTaskaya', 'sobolevn']
    pr_nums = ['31012']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue46520'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    @Kodiologist
    Copy link
    Mannequin Author

    Kodiologist mannequin commented Jan 25, 2022

    This works:

    𝕕𝕖𝕗 = 1
    

    This raises SyntaxError:

        import ast
        exec(ast.unparse(ast.parse("𝕕𝕖𝕗 = 1")))

    It looks like ast.parse creates a Name node with id='def', which is correct per PEP-3131, but ast.unparse doesn't know it needs to mangle the output somehow, as "𝕕𝕖𝕗" or a similar Unicode replacement.

    @Kodiologist Kodiologist mannequin added 3.10 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 25, 2022
    @sobolevn
    Copy link
    Member

    I can confirm that it happens on all versions from 3.9 to 3.11 (main).

    Python 3.9.9 (main, Dec 21 2021, 11:35:28) 
    [Clang 11.0.0 (clang-1100.0.33.16)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import ast
    >>> ast.unparse(ast.parse("𝕕𝕖𝕗 = 1"))
    'def = 1'
    >>> exec(ast.unparse(ast.parse("𝕕𝕖𝕗 = 1")))  # SyntaxError
    
    Python 3.11.0a4+ (heads/main-dirty:ef3ef6fa43, Jan 20 2022, 20:48:25) [Clang 11.0.0 (clang-1100.0.33.16)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import ast
    >>> ast.unparse(ast.parse("𝕕𝕖𝕗 = 1"))
    'def = 1'
    >>> exec(ast.unparse(ast.parse("𝕕𝕖𝕗 = 1")))  # SyntaxError
    

    @sobolevn sobolevn added 3.9 only security fixes 3.11 only security fixes labels Jan 25, 2022
    @isidentical
    Copy link
    Member

    Technically, this is a bug on the fact that it breaks the only guarantee of ast.unparse:

    Unparse an ast.AST object and generate a string with code that would produce an equivalent ast.AST object if parsed back with ast.parse().

    But I am not really sure if it should be handled at all, since we don't have access to the original form of the identifier in the AST due to the parser's normalization behavior.

    If we want to only create a source that would give the same AST, abusing the fact that original keywords are always basic ASCII we could embed a map of characters that convert ASCII 'a', 'b', 'c', ... to their most similar unicode versions (https://util.unicode.org/UnicodeJsps/confusables.jsp). But I feel like this is a terrible idea, with no possible gain (very limited use case) and very prone to a lot of confusions.

    I think just adding a warning to the documentation regarding this should be the definite resolution, unless @pablogsal has any other idea.

    @Kodiologist
    Copy link
    Mannequin Author

    Kodiologist mannequin commented Jan 26, 2022

    I've done very little work on CPython, but I do a lot of AST construction and call ast.unparse a lot in my work on Hylang, and I think this is a wart worth fixing. The real mistake was letting the user say 𝕕𝕖𝕗 = 1, but that's been legal Python syntax for a long time, so I doubt a change to that would be welcome, especially one affecting old stable versions of Python like 3.9. Python has made its bed and now it must lie in it.

    I think that with a pretty small amount of code (using code-point arithmetic instead of a dictionary with every ASCII letter), I can add Unicode "escaping" of reserved words to the part of ast.unparse that renders variable names. Would a patch of this kind be welcome?

    @Kodiologist
    Copy link
    Mannequin Author

    Kodiologist mannequin commented Jan 26, 2022

    And yes, while this behavior will look strange, the only code that will parse to AST nodes that require it will be code that uses exactly the same trick.

    @terryjreedy
    Copy link
    Member

    'Reserved words' include all double underscore words, like __reserved__. Using such is allowed, but we reserve the right to break such code by adding a use for the word. 'def' is a keyword. Using identifier normalization to smuggle keywords into compiled code is a clever hack. But I am not sure that there is an actionable bug anywhere.

    The Unicode normalization rules are not defined by us. Changing how we use them or creating a custom normalization form is not to be done lightly.

    Should ast.parse raise?  The effect is the same as "globals()['𝕕𝕖𝕗']=1" (which is the same as passing 'def' or anything else that normalizes to it) and that in turn allows ">>> 𝕕𝕖𝕗", which returns 1.  Should such identifiers be outlawed?

    https://docs.python.org/3/reference/lexical_analysis.html#identifiers says "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC." This does not say when an identifier is compared to the keyword set, before or after normalization. Currently is it before. Changing this to after could be considered a backwards-incompatible feature change that would require a deprecation period with syntax warnings. (Do other implementations also compare before normalization?)

    Batuhan already quoted https://docs.python.org/3/library/ast.html#ast.unparse and I mostly agree with his comments. The "would produce" part is contingent upon the result having no syntax errors, and that cannot be guaranteed. What could be done is to check every identifier against keywords and change the first character to a chosen NFKD equivalent. Although 'fixing' the ast this way would make unparse seem to work better succeed in this case, there are other fixes that might also be suggested for the same reason.

    Until this is done in CPython, anyone who cares could write an AST visitor to make the same change before calling unparse. Example code could be attached to this issue.

    @terryjreedy terryjreedy changed the title ast.unparse produces syntactically illegal code for identifiers that look like reserved words ast.unparse produces bad code for identifiers that become keywords Jan 29, 2022
    @terryjreedy terryjreedy changed the title ast.unparse produces syntactically illegal code for identifiers that look like reserved words ast.unparse produces bad code for identifiers that become keywords Jan 29, 2022
    @Kodiologist
    Copy link
    Mannequin Author

    Kodiologist mannequin commented Jan 29, 2022

    (Hilariously, I couldn't post this comment on bugs.python.org due to some kind of Unicode bug ("Edit Error: 'utf8' codec can't decode bytes in position 208-210: invalid continuation byte"), so I've rendered "\U0001D555\U0001D556\U0001D557" as "DEF" in the below.)

    Thanks for clarifying the terminology re: reserved words vs. keywords.

    The effect is the same as "globals()['DEF']=1" (which is the same as
    passing 'def' or anything else that normalizes to it) and that in
    turn allows ">>> DEF", which returns 1.

    This doesn't quite seem to be the case, at least on Pythons 3.9 and 3.10:

        >>> globals()['DEF']=1
        >>> DEF
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        NameError: name 'def' is not defined
        >>> globals()['def']=1
        >>> DEF
        1

    It looks the dictionary interface to globals doesn't normalize like the parser does.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @Kodiologist
    Copy link
    Contributor

    The PR to fix this is a few months old. Can we get it moving?

    @isidentical
    Copy link
    Member

    I am not inclined to accept it. I don't want to quickly reject the issue, but want to listen the opinions of other core developers before. @pablogsal @serhiy-storchaka any feedback on this?

    @Kodiologist
    Copy link
    Contributor

    But then how else would the bug be fixed? Should we make 𝕕𝕖𝕗 = 1 syntactically illegal?

    @serhiy-storchaka
    Copy link
    Member

    Names like 𝕕𝕖𝕗 are just confusing. I would raise an error if a non-ASCII name is normalized to ASCII. Is there a valid use case for them?

    @Kodiologist
    Copy link
    Contributor

    In Hylang, they provide a way for us to produce Python code for Hy code like (send :from obj). send(from = obj) is illegal Python, so we need to produce something like send(𝕗𝕣𝕠𝕞 = obj) instead. Lisp doesn't need most reserved words in the way that C-like languages do, and our previous strategy of trying to determine what names would be Python-illegal where was error-prone.

    @Kodiologist
    Copy link
    Contributor

    That, in turn, would lead to fun surprises where a Hy program that was compiled into AST and run directly would work fine, but if you tried to produce a Python program for it and run that, it wouldn't work.

    @serhiy-storchaka
    Copy link
    Member

    All "bad" identifier characters:

    >>> ''.join(c for c in map(chr, range(0x80, 0x110000)) if ('a'+c).isidentifier() and unicodedata.normalize('NFKC', c) < '\u0080')
    

    'ªºIJijĿŀſDŽDždžLJLjljNJNjnjDZDzdzʰʲʳʷʸˡˢˣᴬᴮᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴿᵀᵁᵂᵃᵇᵈᵉᵍᵏᵐᵒᵖᵗᵘᵛᵢᵣᵤᵥᶜᶠᶻẚⁱⁿₐₑₒₓₕₖₗₘₙₚₛₜℂℊℋℌℍℎℐℑℒℓℕℙℚℛℜℝℤℨKℬℭℯℰℱℳℴℹⅅⅆⅇⅈⅉⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿⱼⱽꟲꟳꟴfffiflffifflſtst︳︴﹍﹎﹏0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz𐞥𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙𝐚𝐛𝐜𝐝𝐞𝐟𝐠𝐡𝐢𝐣𝐤𝐥𝐦𝐧𝐨𝐩𝐪𝐫𝐬𝐭𝐮𝐯𝐰𝐱𝐲𝐳𝐴𝐵𝐶𝐷𝐸𝐹𝐺𝐻𝐼𝐽𝐾𝐿𝑀𝑁𝑂𝑃𝑄𝑅𝑆𝑇𝑈𝑉𝑊𝑋𝑌𝑍𝑎𝑏𝑐𝑑𝑒𝑓𝑔𝑖𝑗𝑘𝑙𝑚𝑛𝑜𝑝𝑞𝑟𝑠𝑡𝑢𝑣𝑤𝑥𝑦𝑧𝑨𝑩𝑪𝑫𝑬𝑭𝑮𝑯𝑰𝑱𝑲𝑳𝑴𝑵𝑶𝑷𝑸𝑹𝑺𝑻𝑼𝑽𝑾𝑿𝒀𝒁𝒂𝒃𝒄𝒅𝒆𝒇𝒈𝒉𝒊𝒋𝒌𝒍𝒎𝒏𝒐𝒑𝒒𝒓𝒔𝒕𝒖𝒗𝒘𝒙𝒚𝒛𝒜𝒞𝒟𝒢𝒥𝒦𝒩𝒪𝒫𝒬𝒮𝒯𝒰𝒱𝒲𝒳𝒴𝒵𝒶𝒷𝒸𝒹𝒻𝒽𝒾𝒿𝓀𝓁𝓂𝓃𝓅𝓆𝓇𝓈𝓉𝓊𝓋𝓌𝓍𝓎𝓏𝓐𝓑𝓒𝓓𝓔𝓕𝓖𝓗𝓘𝓙𝓚𝓛𝓜𝓝𝓞𝓟𝓠𝓡𝓢𝓣𝓤𝓥𝓦𝓧𝓨𝓩𝓪𝓫𝓬𝓭𝓮𝓯𝓰𝓱𝓲𝓳𝓴𝓵𝓶𝓷𝓸𝓹𝓺𝓻𝓼𝓽𝓾𝓿𝔀𝔁𝔂𝔃𝔄𝔅𝔇𝔈𝔉𝔊𝔍𝔎𝔏𝔐𝔑𝔒𝔓𝔔𝔖𝔗𝔘𝔙𝔚𝔛𝔜𝔞𝔟𝔠𝔡𝔢𝔣𝔤𝔥𝔦𝔧𝔨𝔩𝔪𝔫𝔬𝔭𝔮𝔯𝔰𝔱𝔲𝔳𝔴𝔵𝔶𝔷𝔸𝔹𝔻𝔼𝔽𝔾𝕀𝕁𝕂𝕃𝕄𝕆𝕊𝕋𝕌𝕍𝕎𝕏𝕐𝕒𝕓𝕔𝕕𝕖𝕗𝕘𝕙𝕚𝕛𝕜𝕝𝕞𝕟𝕠𝕡𝕢𝕣𝕤𝕥𝕦𝕧𝕨𝕩𝕪𝕫𝕬𝕭𝕮𝕯𝕰𝕱𝕲𝕳𝕴𝕵𝕶𝕷𝕸𝕹𝕺𝕻𝕼𝕽𝕾𝕿𝖀𝖁𝖂𝖃𝖄𝖅𝖆𝖇𝖈𝖉𝖊𝖋𝖌𝖍𝖎𝖏𝖐𝖑𝖒𝖓𝖔𝖕𝖖𝖗𝖘𝖙𝖚𝖛𝖜𝖝𝖞𝖟𝖠𝖡𝖢𝖣𝖤𝖥𝖦𝖧𝖨𝖩𝖪𝖫𝖬𝖭𝖮𝖯𝖰𝖱𝖲𝖳𝖴𝖵𝖶𝖷𝖸𝖹𝖺𝖻𝖼𝖽𝖾𝖿𝗀𝗁𝗂𝗃𝗄𝗅𝗆𝗇𝗈𝗉𝗊𝗋𝗌𝗍𝗎𝗏𝗐𝗑𝗒𝗓𝗔𝗕𝗖𝗗𝗘𝗙𝗚𝗛𝗜𝗝𝗞𝗟𝗠𝗡𝗢𝗣𝗤𝗥𝗦𝗧𝗨𝗩𝗪𝗫𝗬𝗭𝗮𝗯𝗰𝗱𝗲𝗳𝗴𝗵𝗶𝗷𝗸𝗹𝗺𝗻𝗼𝗽𝗾𝗿𝘀𝘁𝘂𝘃𝘄𝘅𝘆𝘇𝘈𝘉𝘊𝘋𝘌𝘍𝘎𝘏𝘐𝘑𝘒𝘓𝘔𝘕𝘖𝘗𝘘𝘙𝘚𝘛𝘜𝘝𝘞𝘟𝘠𝘡𝘢𝘣𝘤𝘥𝘦𝘧𝘨𝘩𝘪𝘫𝘬𝘭𝘮𝘯𝘰𝘱𝘲𝘳𝘴𝘵𝘶𝘷𝘸𝘹𝘺𝘻𝘼𝘽𝘾𝘿𝙀𝙁𝙂𝙃𝙄𝙅𝙆𝙇𝙈𝙉𝙊𝙋𝙌𝙍𝙎𝙏𝙐𝙑𝙒𝙓𝙔𝙕𝙖𝙗𝙘𝙙𝙚𝙛𝙜𝙝𝙞𝙟𝙠𝙡𝙢𝙣𝙤𝙥𝙦𝙧𝙨𝙩𝙪𝙫𝙬𝙭𝙮𝙯𝙰𝙱𝙲𝙳𝙴𝙵𝙶𝙷𝙸𝙹𝙺𝙻𝙼𝙽𝙾𝙿𝚀𝚁𝚂𝚃𝚄𝚅𝚆𝚇𝚈𝚉𝚊𝚋𝚌𝚍𝚎𝚏𝚐𝚑𝚒𝚓𝚔𝚕𝚖𝚗𝚘𝚙𝚚𝚛𝚜𝚝𝚞𝚟𝚠𝚡𝚢𝚣𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡𝟢𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿🯰🯱🯲🯳🯴🯵🯶🯷🯸🯹'

    @serhiy-storchaka
    Copy link
    Member

    In Hylang, they provide a way for us to produce Python code for Hy code like (send :from obj). send(from = obj) is illegal Python, so we need to produce something like send(𝕗𝕣𝕠𝕞 = obj) instead.

    It is a common problem of interoperability between different programming languages. For example, Tkinter allows you to use from_ and class_ to specify Tk options -from and -class. You can also write it as send(**{'from': obj}), but it does not looks nice.

    I do not think that requiring users to type send(𝕗𝕣𝕠𝕞 = obj) in their text editors is a very good idea too.

    @Kodiologist
    Copy link
    Contributor

    Yes, which I guess comes to show it's more useful for generated code (as in Hy's case, where the user just types (send :from obj) and the compiler does the Unicode-mangling) than hand-written code.

    @pablogsal
    Copy link
    Member

    I am of the same opinion as @serhiy-storchaka: raising an error if a non-ASCII name is normalized to ASCII

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants