Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 3 - ILP - Dynamic Scheduling #45

Merged
merged 11 commits into from
Apr 30, 2024
Merged

Conversation

pveentjer
Copy link
Contributor

This is work in progress.

@dendibakh
Copy link
Owner

@pveentjer , I made a few changes.

  1. I moved your register renaming examples up in the section "3.2 Pipelining"
  2. I sticked to using OOO, because I refer to it later in many places in the book, for example, "... OOO engine ..."
  3. I feel like this is not the right place to talk about memory ordering. I would rather save it for section 3.8.3 Load-Store Unit or chapter 12 (section Architecture-Specific Optimizations).
  4. Also, please keep in mind, this is mostly theoretical discussion about ideas. I'm OK with keeping things abstract here and not get into details. I have a big section that exemplifies a real implementation: section "3.8 Modern CPU Design". That is where we can discuss nuances. BTW, maybe I can come up with a better name for this section...
  5. I left some TODO in the text, please check them.

@@ -38,13 +38,22 @@ In real implementations, pipelining introduces several constraints that limit th
A *write-after-read* (WAR) hazard requires a dependent write to execute after a read. It occurs when instruction `x+1` writes a source before instruction `x` reads the source, resulting in the wrong new value being read. A WAR hazard is not a true dependency and is eliminated by a technique called [register renaming](https://en.wikipedia.org/wiki/Register_renaming).[^1] It is a technique that abstracts logical registers from physical registers. CPUs support register renaming by keeping a large number of physical registers. Logical (architectural) registers, the ones that are defined by the ISA, are just aliases over a wider register file. With such decoupling of [architectural state](https://en.wikipedia.org/wiki/Architectural_state),[^3] solving WAR hazards is simple: we just need to use a different physical register for the write operation. For example:

Copy link
Contributor Author

@pveentjer pveentjer Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the register renaming part to the 3.3 chapter. Register renaming AFAIK was used first in the Tomasulo algorithm and one can use pipelining without register renaming.

Therefore the more logical place to explain register renaming would be in the 3.3 chapter.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think more about it, but I tend to think that it's good already.

Dynamic scheduling of these instructions is enabled by sophisticated hardware structures such as scoreboards and techniques such as register renaming to reduce data hazards. In the 1960s, some work to support dynamic scheduling and out-of-order execution included the [Tomasulo algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm),[^4] implemented in the IBM360, and [Scoreboading](https://en.wikipedia.org/wiki/Scoreboarding),[^5] which was implemented in the CDC6600. Those pioneering efforts have influenced all modern CPU architectures.The scoreboard hardware is used to schedule the in-order retirement and all machine state updates. It keeps track of data dependencies of every instruction and where in the pipeline the data is available. Most implementations strive to balance the hardware cost with the potential return. Typically, the size of the scoreboard determines how far ahead the hardware can look for scheduling such independent instructions.
[TODO]:
Peter: "(not true: on ARM the OOoE of loads/stores is allowed to become visible.. loads for sure. So it depends on the memory model of the ISA). "
Denis: I'm not familiar. What are the conditions in which this could happen? Is it about relaxed consistency?
Copy link
Contributor Author

@pveentjer pveentjer Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. So it could happen that an earlier load runs into a cache miss and a later load doesn't and then the 2 loads don't observe the stores in program order.

On X86 such reorderings are prohibited from becoming architecturally visible, since loads can't be reordered with other loads (every normal load has acquire semantics). But the X86 could speculate that reordering the loads didn't lead to problems, so it will try to execute them out of order.

On the ARM this reordering of loads would be perfectly fine since a normal load (LDR) doesn't have any memory ordering semantics. You need to use e.g. a LDAR/LDAPR for that.

Copy link
Contributor Author

@pveentjer pveentjer Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move the whole discussion about memory ordering to its own place.

If you change this

CPUs with OOO execution must still give the same result as if all instructions were executed in the program order

To this

CPUs with OOO execution should not be able to observe the out of order execution of its own instructions. But other CPUs could. For more details see chapter ....

Then it is more clear that the OOO should not influence the CPU in isolation.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather leave it as it is. Let's not overwhelm readers with nuances, they would second-guess themselves whether they really understand it or not. We will have a chance to correct ourselves later.


### Superscalar Engines and VLIW
The OOO execution in the Tomasulo algorithm is implemented using the Reorder Buffer (ROB) and Reservation Station (RS). The ROB is a circular buffer that keeps track of the state of each instruction, and in modern processors it has more than a thousand entries. Typically, the size of the ROB determines how far ahead the hardware can look for scheduling such independent instructions. Instructions are inserted in the ROB in program order, can execute out of order, and retire in program order. Register renaming is done when the instructions are placed in the ROB.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single reservation station is needed for a single instruction. So a CPU needs to have a large number of reservation stations.

Copy link
Contributor Author

@pveentjer pveentjer Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know any processor that has 1000 entries in the ROB. The Apple M1 has something like 630 and that was double that of Intel/AMD. It is safer to say that the ROB capacity is in the hundreds on modern processors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the explanation is now out of order :) The ROB isn't part of the original Tomasulo algorithm. It is an extension to support speculative execution and it isn't needed for out of order execution.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know any processor that has 1000 entries in the ROB. The Apple M1 has something like 630 and that was double that of Intel/AMD. It is safer to say that the ROB capacity is in the hundreds on modern processors.

I'm going ahead of myself and I need to sleep more. :) Just checked chips-and-cheese and yes, you're right. Will change.

A single reservation station is needed for a single instruction. So a CPU needs to have a large number of reservation stations.

I think this is not true. Again, from chips and cheese:

image

I think the explanation is now out of order :) The ROB isn't part of the original Tomasulo algorithm. It is an extension to support speculative execution and it isn't needed for out of order execution.

I'm not very well versed in the history here, but I think you still need ROB to retire instructions in order, no? We can say something like "Modern processors implement derivatives of Tomasulo's algorithm that include ROB and RS ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest; I'm also slightly confused about how the original Tomasulo deals with in-order retirement.

@dendibakh
Copy link
Owner

I cleaned up the PR, @pveentjer please let me know if you have further comments.

@pveentjer
Copy link
Contributor Author

pveentjer commented Apr 16, 2024

It looks good to me. Thank you for the review.

@pveentjer pveentjer changed the title [WIP] Chapter 3 - ILP - Dynamic Scheduling Chapter 3 - ILP - Dynamic Scheduling Apr 16, 2024
@dendibakh
Copy link
Owner

@pveentjer , thanks a lot!

@dendibakh dendibakh merged commit dd9c74f into dendibakh:main Apr 30, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants