-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define opcodes in terms of smaller "micro-ops" #454
Comments
I think the micro-ops you propose are too low level and try to do several different things:
It also requires local variables, which moves it into the space of a IR similar to LLVM's. From your proof of concept,
We need to bottom out somewhere, and the three primitives
The above are amenable to optimization. Handling tracing, dispatching, etc are the responsibility of the code generator. |
Here's another example,
If we are willing to have non-objects on the stack (which we can provided the code generator ensures that they never escape),
Note on stack effects.
It is the job of the code generator to make sure that the stack and instruction stream are unchanged after a |
Not accomplishing this first bullet would feel like a huge missed opportunity to me in terms of tooling/testing/maintenance, but I admit that my post probably overstated the value of this relative to other, more important design goals.
I mostly agree with your second bullet: parts of the instruction preamble and dispatch sequence can probably be omitted when actually specifying the uop sequences for generation. As you say, "handling tracing, dispatching, etc are the responsibility of the code generator". I included them in the prototype because, well, I am the code generator in this case. ;) If you remove some of that fluff, though, our uop specifications are quite similar. As you point out, the major differences are that you want to avoid local temporaries and explicit stack manipulations in favor of using just an abstract stack that is manipulated directly by many diverse uops (similar to our current instruction set). On the surface, those both seem less desirable to me, in terms of both maintenance and usefulness. As a simple, concrete example, here is how a couple of simple passes might optimize a recorded trace (or superinstruction) for the statement Uop concatenationPyObject *value, *left, *right, *sum, *tmp;
// LOAD_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(value, 0);
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value);
// LOAD_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(value, 1);
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value);
// BINARY_OP:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_STACK_GET(left, 2);
UOP_STACK_GET(right, 1);
UOP_BINARY_OP(sum, NB_ADD, left, right);
UOP_STACK_SET(2, sum);
UOP_DECREF(left);
UOP_DECREF(right);
UOP_STACK_ADJUST(-1);
UOP_ERROR_IF_NULL(sum);
UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP);
// STORE_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_STACK_GET(value, 1);
UOP_STACK_ADJUST(-1);
UOP_GET_FAST(tmp, 2);
UOP_STORE_FAST(2, value);
UOP_XDECREF(tmp); SSA-ificationPyObject *value_0, *value_1, *left, *right, *sum, *value_2, *tmp;
// LOAD_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(value_0, 0);
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value_0);
UOP_INCREF(value_0);
// LOAD_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(value_1, 1);
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value_1);
UOP_INCREF(value_1);
// BINARY_OP:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_STACK_GET(left, 2);
UOP_STACK_GET(right, 1);
UOP_BINARY_OP(sum, NB_ADD, left, right);
UOP_STACK_SET(2, sum);
UOP_DECREF(left);
UOP_DECREF(right);
UOP_STACK_ADJUST(-1);
UOP_ERROR_IF_NULL(sum);
UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP);
// STORE_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_STACK_GET(value_2, 1);
UOP_STACK_ADJUST(-1);
UOP_GET_FAST(tmp, 2);
UOP_STORE_FAST(2, value_2);
UOP_XDECREF(tmp); Stack-effect eliminationPyObject *value_0, *value_1, *left, *right, *sum, *value_2, *tmp;
// LOAD_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(value_0, 0);
UOP_STACK_SET(0, value_0);
UOP_INCREF(value_0);
// LOAD_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(value_1, 1);
UOP_STACK_SET(-1, value_1);
UOP_INCREF(value_1);
// BINARY_OP:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_STACK_GET(left, 0);
UOP_STACK_GET(right, -1);
UOP_BINARY_OP(sum, NB_ADD, left, right);
UOP_STACK_SET(0, sum);
UOP_DECREF(left);
UOP_DECREF(right);
UOP_ERROR_IF_NULL(sum);
UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP);
// STORE_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_STACK_GET(value_2, 0);
UOP_GET_FAST(tmp, 2);
UOP_STORE_FAST(2, value_2);
UOP_XDECREF(tmp); Stack eliminationPyObject *value_0, *value_1, *sum, *tmp;
// LOAD_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(value_0, 0);
UOP_INCREF(value_0);
// LOAD_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(value_1, 1);
UOP_INCREF(value_1);
// BINARY_OP:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_BINARY_OP(sum, NB_ADD, value_0, value_1);
UOP_DECREF(value_0);
UOP_DECREF(value_1);
UOP_ERROR_IF_NULL(sum);
UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP);
// STORE_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(tmp, 2);
UOP_STORE_FAST(2, sum);
UOP_XDECREF(tmp); Refcount eliminationPyObject *value_0, *value_1, *sum, *tmp;
// LOAD_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(value_0, 0);
// LOAD_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(value_1, 1);
// BINARY_OP:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_BINARY_OP(sum, NB_ADD, value_0, value_1);
UOP_ERROR_IF_NULL(sum);
UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP);
// STORE_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(tmp, 2);
UOP_STORE_FAST(2, sum);
UOP_XDECREF(tmp); Frame write eliminationPyObject *value_0, *value_1, *sum, *tmp;
// LOAD_FAST:
UOP_JUMP(1);
UOP_GET_FAST(value_0, 0);
// LOAD_FAST:
UOP_JUMP(1);
UOP_GET_FAST(value_1, 1);
// BINARY_OP:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_BINARY_OP(sum, NB_ADD, value_0, value_1);
UOP_ERROR_IF_NULL(sum);
UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP);
// STORE_FAST:
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(tmp, 2);
UOP_STORE_FAST(2, sum);
UOP_XDECREF(tmp); Jump eliminationPyObject *value_0, *value_1, *sum, *tmp;
// LOAD_FAST:
UOP_GET_FAST(value_0, 0);
// LOAD_FAST:
UOP_GET_FAST(value_1, 1);
// BINARY_OP:
UOP_JUMP(3);
UOP_WRITE_PREV_INSTR();
UOP_BINARY_OP(sum, NB_ADD, value_0, value_1);
UOP_ERROR_IF_NULL(sum);
// STORE_FAST:
UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP + 1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(tmp, 2);
UOP_STORE_FAST(2, sum);
UOP_XDECREF(tmp); Temporary variable eliminationPyObject *sum, *tmp;
// LOAD_FAST:
// LOAD_FAST:
// BINARY_OP:
UOP_JUMP(3);
UOP_WRITE_PREV_INSTR();
UOP_BINARY_OP(sum, NB_ADD, localsplus[0], localsplus[1]);
UOP_ERROR_IF_NULL(sum);
// STORE_FAST:
UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP + 1);
UOP_WRITE_PREV_INSTR();
UOP_GET_FAST(tmp, 2);
UOP_STORE_FAST(2, sum);
UOP_XDECREF(tmp); Or, if we know // LOAD_FAST:
// LOAD_FAST:
// BINARY_OP:
UOP_JUMP(3);
UOP_WRITE_PREV_INSTR();
UOP_BINARY_OP(localsplus[2], NB_ADD, localsplus[0], localsplus[1]);
UOP_ERROR_IF_NULL(localsplus[2]);
// STORE_FAST:
UOP_JUMP(INLINE_CACHE_ENTRIES_BINARY_OP + 1); The steps for analyzing and optimizing these uops are simple and intuitive to me, and hopefully others. I tried working through the same example with a uop specification closer to what you suggest, and each step seems so much more difficult. I won't straw-man by sharing my attempts, but it seems like it requires either significantly increasing the number of uops, emitting a final result that's less optimized than the one I worked out above, or adding significant complexity to the various passes to handle all of the different possible uops (rather than a tiny subset specific to that pass). |
Just a minor note: this looks like a typo, since it includes both old and new code. Maybe copied-and-pasted from a unified diff of the changes to |
Yeah, Mark was definitely talking about TARGET(LOAD_FAST_CHECK) {
// TODO
PyObject *value;
UOP_JUMP(1);
UOP_WRITE_PREV_INSTR();
UOP_UPDATE_STATS();
UOP_GET_FAST(value, oparg);
if (value == NULL) {
goto unbound_local_error;
}
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value);
UOP_NEXT_OPCODE();
UOP_NEXT_OPARG();
UOP_LLTRACE();
UOP_CHECK_TRACING();
UOP_DISPATCH();
} I do find this distractingly verbose. It is also lacking a bit more explanation (e.g. why does every case starts with a jump? It seems that you've eliminated the I am still trying to get my head around the rest of the various proposals; I find Mark's idea, closer to a proper IR, attractive for its compactness. I think the ease of implementation of the various optimization steps is key to which way we go. Looking at Mark's first comment again, he writes
What seems almost hidden here behind the near-C syntax is that this pushes |
Yeah, my branch exposes these parts which are currently tucked away in the The real body of the instruction is just: PyObject *value;
UOP_GET_FAST(value, oparg);
if (value == NULL) {
goto unbound_local_error;
}
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value); The PyObject *value;
UOP_GET_FAST(value, oparg);
UOP_CHECK_UNBOUND_LOCAL(value);
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value); |
So how would the tooling for generating the eval loop work? I imagine there would be a file containing blocks like this (one block per opcode): // opcode: LOAD_FAST_CHECK
PyObject *value;
UOP_GET_FAST(value, oparg);
UOP_CHECK_UNBOUND_LOCAL(value);
UOP_STACK_ADJUST(1);
UOP_STACK_SET(1, value);
UOP_INCREF(value); Then the tool would read that file and generate the switch block for the eval loop in another file, which would then be So far, so good. In order to do the optimization stages you describe (very nicely) in your comment above we'd need an in-memory representation of the sequence of uops. I guess this could be somewhat similar to the blocks of instructions we currently use in the bytecode compiler. We'd have to have a way to reference variables like And then what? Once you've reached the final optimized form you have to generate machine code (uops look terrible for an interpreter due to the dispatch overhead). So if we're going any of these routes (this applies to Mark's proposal as well) we probably should first think about our story for generating machine code. Until then, UOPS (even the shortened form you give, augmented with more traditional TARGET() and DISPATCH() macros) look like a readability loss -- I'd much rather read the original code than the uops version, which isn't readable at all until I've memorized most of the uop definitions. The benefit of being able to derive the stack effect and similar things just doesn't warrant the cost of the uops migration to me. So maybe we should not be so hasty to convert to uops before tackling eval loop generation. (I have some thoughts about Mark's IR too, I'm putting those in a separate comment.) |
Regarding Mark's IR ideas, I was trying to see how it compares for Brandt's example of As a refresher,
There is no explicit Anyway, I was thinking about what
with IR definitions
Alas, this is neither elegant, nor symmetric with the Second try:
with IR definitions:
I don't like this definition of
But that still requires some kind of implicit temporary, else the And even if we had a satisfactory So something is still missing. :-( I'm guessing this is why Brandt didn't want to post his attempts. But we should look at MLIR and see what they are doing. Maybe we can learn from it. We should also look at the Cinder IRs (they have two, |
I watched a presentation of theirs from 2020 earlier this week, and it appears on the surface that their framework is sort of overkill for what we're doing here. I guess if we wanted to, we would define our own dialect that maps to instructions/uops and define how to optimize and lower that dialect to LLVM IR. But I figure that at that point, if we want LLVM to be our backend, it may just be easier to optimize the uops ourselves and define a direct translation to LLVM IR. Also, C++... :(
Yeah, trycinder.com is great for exploring this. I preloaded that link with our little example, so you can see all of the passes they perform on it. It appears that they do replace the stack with an SSA form almost immediately. Also interesting is that they include a pass to insert refcounts to IR that initially ignores them, rather than removing incref/decref pairs where possible. If we are defining our opcodes as a sequence of uops, the latter approach seems like it makes more sense, since we would just have that information anyways when we start compiling a trace. It would be interesting to learn why they went that route instead. |
Yeah, we should probably not try to use MLIR directly. But we should understand it well enough to be either inspired by it to do something similar ourselves, or able to reject their approach confidently. Right now we can do neither because we don't understand their approach well enough. (At least I don't, although I've dug into their tutorial briefly.)
Oh, cool. I looked at Cinder's Hir source code a bit and it definitely looks like something we should understand better before we design our own. |
I was just coming here to say this! Glad you're already aware of it.
We were faced with the task of translating Python bytecode (which hides refcount operations) into our HIR (where we wanted to expose refcount operations), so our choices were "manually handle refcounting in the translation of each Python opcode, and then later try to optimize some of them out" or "encode the externally visible refcount semantics of each HIR opcode (what references does it borrow/steal/create) and then have a pass to insert the minimal refcount operations." The latter approach seemed less error-prone since it avoids manual coding of refcounting operations, which is easy to get wrong. |
I'd love to write/chat more about our refcount insertion pass later, but quickly for now: we have a human-readable design doc for it here. It covers implementation rather than motivation, but what Carl said about making it harder to mess up refcounting operations in HIR pretty much covers it. Having worked on another JIT that went the other way, it's really nice having the freedom to optimize and otherwise manipulate HIR without having to worry about refcount operations. In general, if you |
Okay, let's think about it! You know I can't resist a good protyping opportunity. ;) Using an awesome x86-64 assembler library I found called PeachPy, I've created a tool that can compile sequences of the uops we've been using here. The compiler itself is just 100 lines of Python, and it's not too difficult to follow once you've gotten used to PeachPy's way of doing things: https://github.com/brandtbucher/cpython/blob/uop-compiler/Tools/uop_compiler/uop_compiler.py Here are some examples of its output for different forms of the uops for Assuming a, b, and c are all bound: MOV r10, rdi ; _PyInterpreterFrame *
MOV r11, [r10 + 56]
ADD r11, 2
ADD r11, 6
LEA r11, [r11 - 2]
MOV [r10 + 56], r11
PUSH r10
MOV rdi, [r10 + 72]
MOV rsi, [r10 + 80]
MOV r10, 94743266116281 ; PyNumber_Add
CALL r10
POP r10
MOV r11, [r10 + 56]
ADD r11, 2
MOV rax, rax
CMP rax, 0
JE error
ADD r11, 4
LEA r11, [r11 - 2]
MOV [r10 + 56], r11
MOV r11, [r10 + 88]
MOV [r10 + 88], rax
DEC [r11]
JNZ end
PUSH r10
MOV rdi, r11
MOV rax, 94743266669343 ; _Py_Dealloc
CALL rax
POP r10
MOV r11, [r10 + 56]
ADD r11, 2
end:
MOV eax, 1
RET
XOR eax, eax
RET
error:
MOV eax, -1
RET Assuming a is unbound and b and c are bound: MOV r10, rdi ; _PyInterpreterFrame *
MOV r11, [r10 + 56]
ADD r11, 2
ADD r11, 6
LEA r11, [r11 - 2]
MOV [r10 + 56], r11
PUSH r10
MOV rdi, [r10 + 72]
MOV rsi, [r10 + 80]
MOV r10, 94743266116281 ; PyNumber_Add
CALL r10
POP r10
MOV r11, [r10 + 56]
ADD r11, 2
MOV [r10 + 88], rax
CMP [r10 + 88], 0
JE error
ADD r11, 4
MOV eax, 1
RET
XOR eax, eax
RET
error:
MOV eax, -1
RET Assuming a, b, and c are all bound, specialized for strings: MOV r10, rdi ; _PyInterpreterFrame *
MOV r11, [r10 + 56]
ADD r11, 2
ADD r11, 6
LEA r11, [r11 - 2]
MOV [r10 + 56], r11
MOV r11, [r10 + 72]
MOV r9, 94743270261024 ; &PyUnicode_Type
CMP [r11 + 8], r9
JNE deopt
MOV r11, [r10 + 80]
MOV r9, 94743270261024 ; &PyUnicode_Type
CMP [r11 + 8], r9
JNE deopt
PUSH r10
MOV rdi, [r10 + 72]
MOV rsi, [r10 + 80]
MOV r10, 94743266994542 ; PyUnicode_Concat
CALL r10
POP r10
MOV r11, [r10 + 56]
ADD r11, 2
MOV rax, rax
CMP rax, 0
JE error
ADD r11, 4
LEA r11, [r11 - 2]
MOV [r10 + 56], r11
MOV r11, [r10 + 88]
MOV [r10 + 88], rax
DEC [r11]
JNZ end
PUSH r10
MOV rdi, r11
MOV rax, 94743266669343 ; _Py_Dealloc
CALL rax
POP r10
MOV r11, [r10 + 56]
ADD r11, 2
end:
MOV eax, 1
RET
deopt:
XOR eax, eax
RET
error:
MOV eax, -1
RET Assuming a is unbound and b, and c are bound, specialized for strings: MOV r10, rdi ; _PyInterpreterFrame *
MOV r11, [r10 + 56]
ADD r11, 2
ADD r11, 6
LEA r11, [r11 - 2]
MOV [r10 + 56], r11
MOV r11, [r10 + 72]
MOV r9, 94743270261024 ; &PyUnicode_Type
CMP [r11 + 8], r9
JNE deopt
MOV r11, [r10 + 80]
MOV r9, 94743270261024 ; &PyUnicode_Type
CMP [r11 + 8], r9
JNE deopt
PUSH r10
MOV rdi, [r10 + 72]
MOV rsi, [r10 + 80]
MOV r10, 94743266994542 ; PyUnicode_Concat
CALL r10
POP r10
MOV r11, [r10 + 56]
ADD r11, 2
MOV [r10 + 88], rax
CMP [r10 + 88], 0
JE error
ADD r11, 4
MOV eax, 1
RET
deopt:
XOR eax, eax
RET
error:
MOV eax, -1
RET If you check out the link, you can see that this machine code actually works! I've set up a little harness that lets you specify a sequence of uops, compile them to assembly, load the assembly, and run it within the context of arbitrary frames. Obviously there's room for improvement and many unanswered questions, but I think this does a good job of answering the question of "then what?" by completing the story of how different specialized forms of our little |
Yup, I see that clearly now. So the main question is really whether uops are the right abstraction for optimizations. From what I've learned so far about Cinder's Jit, in a sense they have their own version of this, HIR (high-level IR), but they don't translate HIR directly into machine code. Instead, they have a few stages where they transform the HIR into more optimal HIR, before they eventually generate machine code. (Their LIR is just an abstraction so they can target different hardware architectures easily). So now the next question is, do uops lend themselves to the same optimization as HIR, or are there things that are easlier to do with a representation like HIR? I don't know, but I worry that uops will mostly be useful to eliminate the instruction dispatch overhead, which IIRC is 20-30%, whereas Cinder claims speedups from 1.5-4x. IIUC, HIR uses a register machine (with unlimited registers) and they have an elegant way of translating bytecode to HIR -- a form of symbolic interpretation where the evaluation stack is replaced by a stack that tells the builder in which register a given stack value referenced by the bytecode lives. The resulting register-based IR feels more powerful as a basis for optimizing transformations than an IR that is a sequence of uops that manipulate the same stack machine as our bytecode. What am I missing? -- I'm sure there's not one thing, I'm just trying to imagine what's the best approach for optimization passes. My ambition is to eventually (around Python 3.14?) have a JIT in CPython that makes it unnecessary for Cinder to maintain their own JIT. |
I disagree. Stack-based instructions are much more easily composed than register-based ones.
I'm not saying that this is the best approach, but it works:
An important advantage of using high-level "micro ops", is that the results of passes 1-4 can be executed by an interpreter, meaning that we can test the earlier passes independently and portably. Depending on the IR, we might want to swap 5. and 6. For passes 1-4 we want a quite high level IR, possibly a bit higher level than #454 (comment) |
Thinking aloud as always... (There's a question at the end.) Click to expand my thinking aloud: 1. Region selection This looks like it takes a number of bytecodes, right? So once it's decided (based on various counters?) that a sequence of bytecodes together form a suitable region, we can create an array of uops by concatenating the lists uops for each of those bytecodes. Easy-peasy. 2. Guard elimination and strengthening I think I understand guard elimination -- this would take guards that are repeated and elide all but the first one. For example if the Python code for a trace contains something I'm guessing that guard strengthening refers to two guards that aren't identical but where one guard implies another, and we replace this by the stronger of the two. Both these optimizations require that we can figure out that two guards refer to the same object ( 3. Simple redundancy elimination Not sure what that is, I'm guessing other redundant ops like repeatedly loading the same variable can be replaced by loading it once and then using 4. Object allocation sinking and unboxing I've never heard of allocation sinking -- I'm guessing it's this, so maybe this could be done for certain tuples (e.g. I'll defer thinking about the remaining points. But I have a question: what does it mean that these forms can be executed by an interpreter? That would only be possible if said interpreter did dispatch on every micro-operation, which sounds like it would be quite a lot of overhead. Sure, we can write such an interpreter for testing, but it wouldn't be used for actual user code execution, would it? What am I missing here? Also, designing an instruction format for uops would be a challenge -- right now they are glorified macros, some of which need additional info (temp variables especially) that would have to be encoded in the uop instruction format. |
Off-line we established that Mark really does intend to interpret the uops. I assume that's why he doesn't want the uops to be quite so granular as in Brandt's prototype, because the instruction decoding overhead would kill any benefits we could get from optimization. It also gives a good reason for sticking with the current stack machine -- deoptimization can then just fall back on the bytecode. On further thought I think the instruction format isn't a big issue since the uops (in this vision) will never be written to disk -- we can just have an array of structs containing the unpacked args to the uops, so we can write a straightforward decoding loop. Inline caches would just use extra fields in those structs. We'd lose a bit of memory locality but we'd otherwise reduce interpreter overhead which seems important. pc = ucode; // start of the micro-code array
while (1) {
switch (pc->uopcode) {
case GET_LOCAL:
*sp++ = frame->f_localsplus[pc->uoparg];
break;
... ... ...
}
pc++;
} It would be an interesting experiment to try and design some of the optimizer passes for the uop code. E.g. how to do guard elimination? What information is needed in the uops to do that effectively? |
Perhaps this is veering off-topic for this thread, but: I like this goal a lot, and judging by reaccs other folks do too :) The current design directions for the faster-cpython JIT would make this goal quite hard to achieve, though. Until/unless nogil happens, I'm not aware of any way we can efficiently utilize a multi-core server fleet in Python without prefork/multiprocess. And I think, under that execution model, a JIT that has to repeat the same compilation work in every process will have a difficult time competing with one that compiles before fork. If the faster-cpython JIT were to be a method JIT, then there'd be some possibility of having multiple ways to trigger compilation of a method, either ahead-of-time in the prefork scenario or by call count in other scenarios. If it's a tracing JIT, it's difficult to see how it could be adapted to work efficiently in the prefork scenario at all. |
I have to admit I hadn't thought of that until you mentioned it. I guess we will need to plan an AOT interface too to cover that use case. Something that somehow forces an entire function to be compiled when it is first imported (or however Cinder does it, maybe just by calling a "compile this" API with a function or code object). (PS. I refuse to say "JIT" in this context because of the original meaning of that acronym. :-) |
It would be great if this were possible! "How Cinder does it" is with an env var that points to a file that lists function qualnames one per line; there's also a "compile this function now" function available. But I don't think we'd be at all picky about exactly how it works, as long as it is possible. If there's a "compile this function" function, anything else we'd need can be built on that.
Fair! I suppose the original meaning (taken literally) could still apply if we observe that in our context "just in time" means "right before we fork" -- anything else would be "too late compilation" :P |
I had another idea here. I explained it to Brandt and he said "I like that a lot, go ahead and work on it, just assign the issue to yourself." The idea is that instead of trying to put the input for the generator in a separate file and add dummy stuff to the top and bottom of that file to make it look like C, we keep the input to the generator in ceval.c, like this: #ifdef SOME_SYMBOL
switch (opcode) {
case NOP:
// stack-effect: 0
break;
...
}
#else
#include "generated.h"
#endif The code generator can write the uglified, optimized code to generated.h, e.g. replacing The code generator could even change the code it generates based on the compiler or platform. Other things the code generator could do:
Anyways, we've come full circle -- this was one of the earliest ideas (#5), eventually we decided to skip it for 3.11, now it's back on the 3.12 workflow as "Generate version of the interpreter better suited to emscripten and web assembly". |
This could maybe be closed since we now have a working code generator -- though the discussion about the level of uops is still useful. |
FWIW, I looked at The only think we might be able to do is, in the
But these are completely theoretical, I don't think our bytecode compiler would ever generate such code. Also, there's an additional caveat: certain operations guarantee that certain stack elements are not changed even though they appear popped and pushed (e.g. |
I removed the 3.12 label for this, because I believe we no longer intend to do this for 3.12. |
We now have micro-ops. |
We've talked about this concept on-and-off for some time now, so I thought I would create this issue to collect some concrete ideas and start trying to reach a consensus on what the best path to take here is. This is a big, high-risk project, so we should probably start discussing it sooner rather than later.
Currently, our instructions are pretty much defined as "whatever the corresponding blob of C code in
ceval.c
does". While there are several places (mark_stacks
,stack_effect
,dis
and its docs, the varioushaswhatever
lists inopcode.py
) that describe aspects of their behavior, they are scattered all over the place and must be kept in-sync (by hand) with the actual implementation. In other words, they are a burden just as much as they are a useful resource.It's also not always very clear (for example):
NULL
to the stack.NULL
on the stack.And potentially, in the future:
...and, of course, much more.
One way to improve this situation is by ditching the free-form C code and defining our opcodes in terms of structured, statically-analyzable micro-ops (or "uops"). These uops represent lowered, composable units of work, such as incrementing the instruction pointer, decrementing a value's reference count, writing values to the stack, etc. Ideally, they'd represent a sort of IR that bridges the gap between Python and native code.
Last week I put together a simple proof-of-concept that converts much of the common logic in our instructions to a couple dozen initial uops. There is much more to be done, of course, but it does a pretty good job of communicating what this could look like, and many opcodes have been mostly or entirely converted. Note that GitHub collapses the huge
ceval.c
diff by default, but most of the interesting stuff is there.Other random thoughts to spark discussion:
mark_stacks
andstack_effect
, or eval loops with debugging features added or removed.The text was updated successfully, but these errors were encountered: