Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up frame handling in Python-to-Python calls. #111

Closed
markshannon opened this issue Nov 8, 2021 · 6 comments
Closed

Speed up frame handling in Python-to-Python calls. #111

markshannon opened this issue Nov 8, 2021 · 6 comments
Labels

Comments

@markshannon
Copy link
Member

For Python-to-Python calls we avoid consuming the C stack by making the call with the _PyEval_EvalFrameDefault function.
However, the handling of frames is not as efficient as it could be.

Tighten this up would have a few benefits:

  1. Speed up Python-to-Python (probably by only a small amount)
  2. Allow cleanup Python frames to be inserted cheaply enough for useful specialization of calls to Python special methods that need clean up to be called in a specialized instruction (__init__, __setitem__, etc.)
  3. Allow artificial frames to be inserted cheaply for compiled code that wants to have nice tracebacks and debuggability (e.g. Cython code).

In order to speed up frame handling we need to reduce the amount of work done in pushing the frame, and when clearing the frame.

The frame consists of three parts:

  1. The "specials": code object, globals, builtins and (slow) locals, link pointers and saved offsets for calls.
  2. The local variables area.
  3. The (evaluation) stack.

The stack is empty on both entry and exit, so has no cost apart from setting the stacktop on entry. This is about as efficient as it can be.

The use of local variables could be tracked in the compiler to create a bitmap describing which locals needs to cleared on exit. However, without a lot of additional work in the compiler, the bitmap will not be precise so we would gain little from it.

That leaves the specials. Most of the cost is in initializing and clearing the four fields:

    PyObject *f_globals;
    PyObject *f_builtins;
    PyObject *f_locals;
    PyCodeObject *f_code;

Not only do these need to be copied from the function on entry, they each need an INCREF on entry and (more expensively) a DECREF on exit. Combining them into a single object would save this work on call and return.

typedef struct _frame_scopes {
    PyObject_HEADER;    
    PyObject *f_globals;
    PyObject *f_builtins;
    PyObject *f_locals;
    PyCodeObject *f_code;
} PyFrameScopes;
typedef struct _interpreter_frame {
    PyObject *f_globals;
    PyObject *f_builtins;
    PyObject *f_locals;
    PyCodeObject *f_code;
    ...

Would become

typedef struct _interpreter_frame {
    PyFrameScopes *scopes;
   ...

and initializing the "specials" part of the frame would become considerably cheaper, and use less space.

There are some downsides to creating this object, however:

  1. Extra complexity and overhead when creating a function (possibly negatively impacting the performance of creating a closure)
  2. Changing the unstable API. We would need to move the PyFunctionObject to the internal headers to make that explicit.
  3. Additional overhead for LOAD_GLOBAL due to the extra indirection. Hopefully the cost of the extra memory load in LOAD_GLOBAL will be outweighed by saving many indirections and branches in each call.
  4. Although f_locals is always NULL for functions, it is non-NULL and cannot be shared when executing module or class level code. Each call to module or class level code would need a new PyFrameScopes to be created.
@JunyiXie
Copy link

Not only do these need to be copied from the function on entry, they each need an INCREF on entry and (more expensively) a DECREF on exit. Combining them into a single object would save this work on call and return.

Combining them into a single object, i think these fields still needs INCREF and DECREF. these work saved ?

@brandtbucher
Copy link
Member

Combining them into a single object, i think these fields still needs INCREF and DECREF. these work saved ?

I don't think that's true. For example, changing the refcount of a list doesn't modify the refcounts of each of its contained items (that only happens when the list is created or destroyed). Same idea here.

@markshannon
Copy link
Member Author

The PyFunctionObject struct is (debatably) part of the C-API. So this will have to wait until it isn't, or we find some workaround.

@markshannon
Copy link
Member Author

markshannon commented Apr 17, 2022

An alternative approach which doesn't require a new object, is as follows:

New frame layout

  • specials (just func and code; global and builtinswill accessed through func)
  • fast locals
  • linkage and f_locals
  • stack

Placing the code object first allows us to find the stack base without needing an extra register.

Changes to the bytecode

The advantage of this layout becomes clear if we do two extra things:

  1. Push an additional NULL in the calling sequence
  2. Set the oparg of YIELD_VALUE to the stack depth.

Call sequence

This gives us efficient calls.

For calls, the top of the stack looks like:

  • NULL
  • func
  • arg 0
  • ...

By setting the NULL to code, we have created the specials and arguments of the callee frame in place, without any copying.

Return and yield sequence

To return or yield we need to access the linkage section, for which we need to access the stack base.
For returns, the stack is empty, so stack_base = stack_pointer.
For yields, we store the stack depth in oparg, so stack_base = stack_pointer -oparg.

We can avoid any copying and reduce stack consumption when calling by inserting code into the NULL slot.
When returning we can quickly find the linkage section as it is directly under the stack. For returns, the stack will be empty. For yields, we know the stack depth from the oparg.

@markshannon
Copy link
Member Author

The additional complexity of #111 (comment) seems to be causing a slowdown, not a speedup.

@markshannon
Copy link
Member Author

We seem to have run out of obvious improvements to frame layout.

The only enhancement I can think of is to move the "slow" locals into a "fast" local, and access it via LOAD_FAST x; GET_DICT_ENTRY instead of LOAD_NAME.
That would require fairly extensive changes to the compiler and introspection code, and would need PEP 667 to be implemented.
That is a lot of work to save 1 word.

A possibly cheaper alternative would be track whether f_locals is initialized in the flags bits, to save a memory write when calling a Python function. This is probably too fragile to make sense for such a small performance advantage.

Overall, I think we should just call this "done", and work on other stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

3 participants