Speed up frame handling in Python-to-Python calls. #111

markshannon · 2021-11-08T13:02:31Z

For Python-to-Python calls we avoid consuming the C stack by making the call with the _PyEval_EvalFrameDefault function.
However, the handling of frames is not as efficient as it could be.

Tighten this up would have a few benefits:

Speed up Python-to-Python (probably by only a small amount)
Allow cleanup Python frames to be inserted cheaply enough for useful specialization of calls to Python special methods that need clean up to be called in a specialized instruction (__init__, __setitem__, etc.)
Allow artificial frames to be inserted cheaply for compiled code that wants to have nice tracebacks and debuggability (e.g. Cython code).

In order to speed up frame handling we need to reduce the amount of work done in pushing the frame, and when clearing the frame.

The frame consists of three parts:

The "specials": code object, globals, builtins and (slow) locals, link pointers and saved offsets for calls.
The local variables area.
The (evaluation) stack.

The stack is empty on both entry and exit, so has no cost apart from setting the stacktop on entry. This is about as efficient as it can be.

The use of local variables could be tracked in the compiler to create a bitmap describing which locals needs to cleared on exit. However, without a lot of additional work in the compiler, the bitmap will not be precise so we would gain little from it.

That leaves the specials. Most of the cost is in initializing and clearing the four fields:

    PyObject *f_globals;
    PyObject *f_builtins;
    PyObject *f_locals;
    PyCodeObject *f_code;

Not only do these need to be copied from the function on entry, they each need an INCREF on entry and (more expensively) a DECREF on exit. Combining them into a single object would save this work on call and return.

typedef struct _frame_scopes {
    PyObject_HEADER;    
    PyObject *f_globals;
    PyObject *f_builtins;
    PyObject *f_locals;
    PyCodeObject *f_code;
} PyFrameScopes;

typedef struct _interpreter_frame {
    PyObject *f_globals;
    PyObject *f_builtins;
    PyObject *f_locals;
    PyCodeObject *f_code;
    ...

Would become

typedef struct _interpreter_frame {
    PyFrameScopes *scopes;
   ...

and initializing the "specials" part of the frame would become considerably cheaper, and use less space.

There are some downsides to creating this object, however:

Extra complexity and overhead when creating a function (possibly negatively impacting the performance of creating a closure)
Changing the unstable API. We would need to move the PyFunctionObject to the internal headers to make that explicit.
Additional overhead for LOAD_GLOBAL due to the extra indirection. Hopefully the cost of the extra memory load in LOAD_GLOBAL will be outweighed by saving many indirections and branches in each call.
Although f_locals is always NULL for functions, it is non-NULL and cannot be shared when executing module or class level code. Each call to module or class level code would need a new PyFrameScopes to be created.

The text was updated successfully, but these errors were encountered:

JunyiXie · 2021-11-10T10:55:01Z

Not only do these need to be copied from the function on entry, they each need an INCREF on entry and (more expensively) a DECREF on exit. Combining them into a single object would save this work on call and return.

Combining them into a single object, i think these fields still needs INCREF and DECREF. these work saved ?

brandtbucher · 2021-11-10T21:39:31Z

Combining them into a single object, i think these fields still needs INCREF and DECREF. these work saved ?

I don't think that's true. For example, changing the refcount of a list doesn't modify the refcounts of each of its contained items (that only happens when the list is created or destroyed). Same idea here.

markshannon · 2021-11-16T11:33:19Z

The PyFunctionObject struct is (debatably) part of the C-API. So this will have to wait until it isn't, or we find some workaround.

markshannon · 2022-04-17T15:44:02Z

An alternative approach which doesn't require a new object, is as follows:

New frame layout

specials (just func and code; global and builtinswill accessed through func)
fast locals
linkage and f_locals
stack

Placing the code object first allows us to find the stack base without needing an extra register.

Changes to the bytecode

The advantage of this layout becomes clear if we do two extra things:

Push an additional NULL in the calling sequence
Set the oparg of YIELD_VALUE to the stack depth.

Call sequence

This gives us efficient calls.

For calls, the top of the stack looks like:

NULL
func
arg 0
...

By setting the NULL to code, we have created the specials and arguments of the callee frame in place, without any copying.

Return and yield sequence

To return or yield we need to access the linkage section, for which we need to access the stack base.
For returns, the stack is empty, so stack_base = stack_pointer.
For yields, we store the stack depth in oparg, so stack_base = stack_pointer -oparg.

We can avoid any copying and reduce stack consumption when calling by inserting code into the NULL slot.
When returning we can quickly find the linkage section as it is directly under the stack. For returns, the stack will be empty. For yields, we know the stack depth from the oparg.

markshannon · 2022-06-13T11:51:26Z

The additional complexity of #111 (comment) seems to be causing a slowdown, not a speedup.

markshannon · 2023-03-16T16:49:53Z

We seem to have run out of obvious improvements to frame layout.

The only enhancement I can think of is to move the "slow" locals into a "fast" local, and access it via LOAD_FAST x; GET_DICT_ENTRY instead of LOAD_NAME.
That would require fairly extensive changes to the compiler and introspection code, and would need PEP 667 to be implemented.
That is a lot of work to save 1 word.

A possibly cheaper alternative would be track whether f_locals is initialized in the flags bits, to save a memory write when calling a Python function. This is probably too fragile to make sense for such a small performance advantage.

Overall, I think we should just call this "done", and work on other stuff.

markshannon added the deferred label Nov 16, 2021

gramster added this to Fancy CPython Board Jan 10, 2022

gramster moved this to Todo in Fancy CPython Board Jan 10, 2022

gramster moved this from Todo to Other in Fancy CPython Board Jan 10, 2022

gramster moved this from Other to Todo in Fancy CPython Board Jan 24, 2022

markshannon mentioned this issue Apr 6, 2022

bpo-46543: add sys._getcaller python/cpython#30950

Closed

markshannon mentioned this issue Sep 15, 2022

Expose _PyInterpreterFrame_GetLine in the private API python/cpython#96803

Open

markshannon closed this as completed Mar 16, 2023

github-project-automation bot moved this from Todo to Done in Fancy CPython Board Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up frame handling in Python-to-Python calls. #111

Speed up frame handling in Python-to-Python calls. #111

markshannon commented Nov 8, 2021

JunyiXie commented Nov 10, 2021

brandtbucher commented Nov 10, 2021

markshannon commented Nov 16, 2021

markshannon commented Apr 17, 2022 •

edited

Loading

markshannon commented Jun 13, 2022

markshannon commented Mar 16, 2023

Speed up frame handling in Python-to-Python calls. #111

Speed up frame handling in Python-to-Python calls. #111

Comments

markshannon commented Nov 8, 2021

JunyiXie commented Nov 10, 2021

brandtbucher commented Nov 10, 2021

markshannon commented Nov 16, 2021

markshannon commented Apr 17, 2022 • edited Loading

New frame layout

Changes to the bytecode

Call sequence

Return and yield sequence

markshannon commented Jun 13, 2022

markshannon commented Mar 16, 2023

markshannon commented Apr 17, 2022 •

edited

Loading