Use two call stacks instead of one. #675

markshannon · 2024-05-08T09:04:02Z

There are two common-ish ways to lay out call stacks in programming languages.

A single stack, where control and data are interleaved. The archetype language for this is C.
Two stacks, where the control is on one stack and the data is on another. The archetype language for this is Forth.

A single stack is by far the most common and is often faster as it only needs a single stack pointer.

So why consider a two stack approach?

It has advantages for profilers, as the control stack can be guaranteed to be a fixed sized array of fixed size entries
It provides freedom for the VM to put data where it can be accessed more efficiently, for example
It can be made as fast as the single stack approach with some careful placement of the control stack.

Adding a control stack without losing performance

Pointers

Currently, the interpreter and JIT maintain three pointers in registers:

Frame pointer. Points to the base of the current frame
Stack pointer. Points to the top of the evaluation stack
Thread pointer. Points to the current thread state.

We can add a control stack pointer without using another register with a memory layout trick.
By placing the control stack at the end of the thread state, and ensuring that the thread state is aligned on a power of two boundary,
the thread pointer can be found by zeroing the low bits of the control stack pointer.

tstate = control_pointer & ALIGNMENT_MASK

Calls and returns.

To make a call, VM needs to fill in all the fields of a new _PyInterpreterFrame.
That will mostly not change, except:

There will be no need for the previous pointer, as the control stack is a contiguous array.
The control stack entry will need a copy of the frame pointer
The recursion limit check can be made cheaper. Taking advantage of the memory alignment of the thread state,
the current recursion depth can be calculated without any memory reads and no writes are required to maintain it.

Overall we save 1 write for calls and a read and a write for returns.

The control stack

The simplest control stack entry is:

struct ControlFrame {
    PyObject **frame_pointer; /* Contains the frame pointer for the frame */
};

Adding the code object to the control stack makes life easier for profilers, especially out of process profilers, but slows
down entry into generators a tad as the code pointer will need to be copied.

struct ControlFrame {
    PyObject **frame_pointer; /* Contains the frame pointer for the frame */
    PyCodeObject *code; /* The code object for the currently executing function or generator */
};

Another possibility is to move all the control data onto the control stack. Having nothing but (possibly NULL) object references on the data stack should help simplify the code, and simple code is often faster code. Although, it is extra copying when entering a generator.

struct ControlFrame {
    PyObject **frame_pointer; /* Contains the frame pointer for the frame */
    PyCodeObject *code; /* The code object for the currently executing function or generator */
    _Py_CODEUNIT *instr_ptr; /* Instruction currently executing (or about to begin) */
    int stacktop;  /* Offset of TOS from localsplus  */
    uint16_t return_offset;  /* Only relevant during a function call */
}

Note: I'm ignoring the C stack in the above discussion. If you want to consider that as well, then we are moving from two to three stacks.

The text was updated successfully, but these errors were encountered:

gvanrossum · 2024-05-08T15:31:30Z

Could we look at other VMs to see what they do? (Even if it's unspecified, what does the implementation do?) What does the JVM do? Or v8? Or, perhaps WASM?

Fidget-Spinner · 2024-05-08T19:26:47Z

This what I was suggesting in #657. But I guess I didn't do a good job explaining it, also had no alignment tricks there.

Note, the two call stack approach is critical to getting true function inlining deopts working without breaking C profilers. I need a way to reconstruct the stack without inconsistency in VM state. If we can get this to work without perf loss on tier 1, it would be ideal.

The easiest layout for reconstruction would be anything outside of localsplus in the control stack, including f_globals and friends. While the data stack should purely be operand stack entries (localsplus and stack).

Fidget-Spinner · 2024-08-22T16:39:04Z

I'm assigning this to myself to work on.

gvanrossum · 2024-08-22T16:48:26Z

I take it you discussed this with Mark and he approves? (He should, he likes tricks like this. :-)

Fidget-Spinner · 2024-08-22T16:49:42Z

I take it you discussed this with Mark and he approves? (He should, he likes tricks like this. :-)

No I haven't discussed this with Mark, but I will implement it according to Mark's proposal here, so I hope he likes it.

Fidget-Spinner · 2024-08-22T16:59:11Z

I thought about this a bit more and I realised I don't need to implement this for out-of-memory profilers to work. I thought of yet another hack to make them work with full function inlining.

markshannon · 2024-08-22T17:25:15Z

I'm intrigued about your new approach...

markshannon mentioned this issue Jun 26, 2024

gh-119786: move frames documentation to InternalDocs and add details python/cpython#121009

Merged

Fidget-Spinner self-assigned this Aug 22, 2024

Fidget-Spinner removed their assignment Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use two call stacks instead of one. #675

Use two call stacks instead of one. #675

markshannon commented May 8, 2024

gvanrossum commented May 8, 2024

Fidget-Spinner commented May 8, 2024

Fidget-Spinner commented Aug 22, 2024

gvanrossum commented Aug 22, 2024

Fidget-Spinner commented Aug 22, 2024

Fidget-Spinner commented Aug 22, 2024 •

edited

Loading

markshannon commented Aug 22, 2024

Use two call stacks instead of one. #675

Use two call stacks instead of one. #675

Comments

markshannon commented May 8, 2024

Adding a control stack without losing performance

Pointers

Calls and returns.

The control stack

gvanrossum commented May 8, 2024

Fidget-Spinner commented May 8, 2024

Fidget-Spinner commented Aug 22, 2024

gvanrossum commented Aug 22, 2024

Fidget-Spinner commented Aug 22, 2024

Fidget-Spinner commented Aug 22, 2024 • edited Loading

markshannon commented Aug 22, 2024

Fidget-Spinner commented Aug 22, 2024 •

edited

Loading