Skip to content

Adding a new language construct ‐ Python 3.6 Formatted String Literals

chu23465 edited this page Feb 17, 2024 · 1 revision

As a way to explain how to add new Python language features in uncompyle6, I'll go into an example: adding Python 3.6 Formatted String Literals: PEP 498. I don't do the full spec, but rather just a simplified version of it.

In July 2016, before Python 3.6 was fully released and was still using bytecode instead of wordcode. If you wrote:

def fn(var1, var2):
    return f'{var1}py36_string_interpolation{var2}'

this got translated to:

     2           0  LOAD_CONST                ''
                 3  LOAD_ATTR                 'join'
                 6  LOAD_FAST                 'var1'
                 9  FORMAT_VALUE           0  ''
                12  LOAD_CONST                'py36_string_interpolation'
                15  LOAD_FAST                 'var2'
                18  FORMAT_VALUE           0  ''
                21  BUILD_LIST             3  ''
                24  CALL_FUNCTION          1  '1 positional, 0 keyword pair'
                27  RETURN_VALUE

A literal decompilation of the opcodes would be something like:

''.join([fv('var1'), 'py36_string_interpolation', fv('var2')])

Try running uncompyle6 --tree on my rough translation above to get a feel for what the grammar looks like. It's long so I will not copy all of it here.

But it is a little different from the opcodes above. Instead of:

  call (3)
       0. expr
            6  LOAD_NAME      1  'fv'
       1. expr
           10  LOAD_CONST     1  'var1'
       2.  12  CALL_FUNCTION  1

The call replaced by:

    LOAD_FAST      'var1'

In other words, a FORMAT_VALUE opcode was added as a special case of a particular kind of function call.

After the 3.6 release when things got changed to wordcodes instead of bytecode, the code generation got changed to:

                 0  LOAD_FAST                 'var1'
                 2  FORMAT_VALUE           0  ''
                 4  LOAD_CONST                'py36_string_interpolation'
                 6  LOAD_FAST                 'var2'
                 8  FORMAT_VALUE           0  ''
                10  BUILD_STRING           3  ''
                12  RETURN_VALUE

The BUILD_STRING opcode seems to have been added and replaces the call to the function call to a string join.

Note: because as happened here bytecode generation can change when a new feature is added, we don't support such intermediate or "dev" or "release-candidate" versions of Python.

Consulting Python's AST name, it is called a FormattedValue. We'll use formatted_value just to be consistent with the existing grammar conventions. With this, we have:

   formatted_value ::= LOAD_FAST FORMAT_VALUE

Also in the list that makes up a "Formated String Literal" are constant strings which in the AST is called Str. You will see that above in the second LOAD_CONST instruction. What we should do is add a transformation inside the ingester step to change LOAD_CONST to LOAD_STRING whenever the value loaded is a string. However that's too much work for now. So we will use the more general expr instead of having a str.

Since this is currently for Python 3.6 only, we add that those two grammar rules to class Python36Parser in a docstring to a method that starts p_.

In the AST, you'll see a list of formatted values and/or strings is combined together in a list called JoinedStr so let's make grammar rules for that:

   expr       ::= formatted_value
   joined_str ::= expr+ BUILD_STRING

And now I get to the first set of technical issues to discuss.

First the simple SPARK parser doesn't have nice operators like |, and grouping of grammar symbols. It does have a + and a * that can be applied here where there is only one nonterminal on the left-hand-side.

So instead we need to write this as the more cumbersome:

exprs       ::=  expr+
joined_str  ::=  exprs BUILD_STRING

I originally had something like this, and it often worked until I had a tuple which contained as one item a formatted value.

('a', 'b', f'{foo}')

The grammar doesn't separate the tuple entry boundaries from the joined_str boundary.

There is another more subtle problem with using expr+ which is that it can lead to exponential parsing time. We need to ensure that we keep grammar parsing efficient. See for details

So instead, in uncompyle6's "ingest" method, when we see the BUILD_STRING it notices it is a list of size 3 and changes the opcode name to BUILD_STRING_3. Then in parsing a custom rule is added. In other situations it would be:

   joined_str ::= expr expr expr BUILD_STRING_3

and to hook this into the rest of the grammar:

   expr ::= joined_str

Finally comes semantic rules to take the AST and produce the right text.

Nonterminal formatted_value has braces around that. So we add it in TABLE_DIRECT of as:

    'formatted_value':	( '{%c}', 0),

But this really works only if the FORMAT_VALUE has attribute value 0 (no format specifier like !r or !s).

An interpolation rule to first approximation might be:

  'joined_str:	( "f'%C', (0, -1, '') ),

However the above table rules are not quite right. Format strings can have braces and quotes in them. So that needs to escaped. Instead we need then special procedures called n_formatted_value() and n_joined_str for this. Consult the code for the full details.