-
-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chapter 3 - ILP - Dynamic Scheduling #45
Conversation
@pveentjer , I made a few changes.
|
@@ -38,13 +38,22 @@ In real implementations, pipelining introduces several constraints that limit th | |||
A *write-after-read* (WAR) hazard requires a dependent write to execute after a read. It occurs when instruction `x+1` writes a source before instruction `x` reads the source, resulting in the wrong new value being read. A WAR hazard is not a true dependency and is eliminated by a technique called [register renaming](https://en.wikipedia.org/wiki/Register_renaming).[^1] It is a technique that abstracts logical registers from physical registers. CPUs support register renaming by keeping a large number of physical registers. Logical (architectural) registers, the ones that are defined by the ISA, are just aliases over a wider register file. With such decoupling of [architectural state](https://en.wikipedia.org/wiki/Architectural_state),[^3] solving WAR hazards is simple: we just need to use a different physical register for the write operation. For example: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the register renaming part to the 3.3 chapter. Register renaming AFAIK was used first in the Tomasulo algorithm and one can use pipelining without register renaming.
Therefore the more logical place to explain register renaming would be in the 3.3 chapter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think more about it, but I tend to think that it's good already.
Dynamic scheduling of these instructions is enabled by sophisticated hardware structures such as scoreboards and techniques such as register renaming to reduce data hazards. In the 1960s, some work to support dynamic scheduling and out-of-order execution included the [Tomasulo algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm),[^4] implemented in the IBM360, and [Scoreboading](https://en.wikipedia.org/wiki/Scoreboarding),[^5] which was implemented in the CDC6600. Those pioneering efforts have influenced all modern CPU architectures.The scoreboard hardware is used to schedule the in-order retirement and all machine state updates. It keeps track of data dependencies of every instruction and where in the pipeline the data is available. Most implementations strive to balance the hardware cost with the potential return. Typically, the size of the scoreboard determines how far ahead the hardware can look for scheduling such independent instructions. | ||
[TODO]: | ||
Peter: "(not true: on ARM the OOoE of loads/stores is allowed to become visible.. loads for sure. So it depends on the memory model of the ISA). " | ||
Denis: I'm not familiar. What are the conditions in which this could happen? Is it about relaxed consistency? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. So it could happen that an earlier load runs into a cache miss and a later load doesn't and then the 2 loads don't observe the stores in program order.
On X86 such reorderings are prohibited from becoming architecturally visible, since loads can't be reordered with other loads (every normal load has acquire semantics). But the X86 could speculate that reordering the loads didn't lead to problems, so it will try to execute them out of order.
On the ARM this reordering of loads would be perfectly fine since a normal load (LDR) doesn't have any memory ordering semantics. You need to use e.g. a LDAR/LDAPR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move the whole discussion about memory ordering to its own place.
If you change this
CPUs with OOO execution must still give the same result as if all instructions were executed in the program order
To this
CPUs with OOO execution should not be able to observe the out of order execution of its own instructions. But other CPUs could. For more details see chapter ....
Then it is more clear that the OOO should not influence the CPU in isolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather leave it as it is. Let's not overwhelm readers with nuances, they would second-guess themselves whether they really understand it or not. We will have a chance to correct ourselves later.
|
||
### Superscalar Engines and VLIW | ||
The OOO execution in the Tomasulo algorithm is implemented using the Reorder Buffer (ROB) and Reservation Station (RS). The ROB is a circular buffer that keeps track of the state of each instruction, and in modern processors it has more than a thousand entries. Typically, the size of the ROB determines how far ahead the hardware can look for scheduling such independent instructions. Instructions are inserted in the ROB in program order, can execute out of order, and retire in program order. Register renaming is done when the instructions are placed in the ROB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single reservation station is needed for a single instruction. So a CPU needs to have a large number of reservation stations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know any processor that has 1000 entries in the ROB. The Apple M1 has something like 630 and that was double that of Intel/AMD. It is safer to say that the ROB capacity is in the hundreds on modern processors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the explanation is now out of order :) The ROB isn't part of the original Tomasulo algorithm. It is an extension to support speculative execution and it isn't needed for out of order execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know any processor that has 1000 entries in the ROB. The Apple M1 has something like 630 and that was double that of Intel/AMD. It is safer to say that the ROB capacity is in the hundreds on modern processors.
I'm going ahead of myself and I need to sleep more. :) Just checked chips-and-cheese and yes, you're right. Will change.
A single reservation station is needed for a single instruction. So a CPU needs to have a large number of reservation stations.
I think this is not true. Again, from chips and cheese:
I think the explanation is now out of order :) The ROB isn't part of the original Tomasulo algorithm. It is an extension to support speculative execution and it isn't needed for out of order execution.
I'm not very well versed in the history here, but I think you still need ROB to retire instructions in order, no? We can say something like "Modern processors implement derivatives of Tomasulo's algorithm that include ROB and RS ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest; I'm also slightly confused about how the original Tomasulo deals with in-order retirement.
I cleaned up the PR, @pveentjer please let me know if you have further comments. |
It looks good to me. Thank you for the review. |
@pveentjer , thanks a lot! |
This is work in progress.