Chapter 3 - ILP - Dynamic Scheduling #45

pveentjer · 2024-04-03T03:38:01Z

This is work in progress.

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md

dendibakh · 2024-04-11T23:48:00Z

@pveentjer , I made a few changes.

I moved your register renaming examples up in the section "3.2 Pipelining"
I sticked to using OOO, because I refer to it later in many places in the book, for example, "... OOO engine ..."
I feel like this is not the right place to talk about memory ordering. I would rather save it for section 3.8.3 Load-Store Unit or chapter 12 (section Architecture-Specific Optimizations).
Also, please keep in mind, this is mostly theoretical discussion about ideas. I'm OK with keeping things abstract here and not get into details. I have a big section that exemplifies a real implementation: section "3.8 Modern CPU Design". That is where we can discuss nuances. BTW, maybe I can come up with a better name for this section...
I left some TODO in the text, please check them.

pveentjer · 2024-04-12T01:13:11Z

chapters/3-CPU-Microarchitecture/3-2 Pipelining.md

@@ -38,13 +38,22 @@ In real implementations, pipelining introduces several constraints that limit th
  A *write-after-read* (WAR) hazard requires a dependent write to execute after a read. It occurs when instruction `x+1` writes a source before instruction `x` reads the source, resulting in the wrong new value being read. A WAR hazard is not a true dependency and is eliminated by a technique called [register renaming](https://en.wikipedia.org/wiki/Register_renaming).[^1] It is a technique that abstracts logical registers from physical registers. CPUs support register renaming by keeping a large number of physical registers. Logical (architectural) registers, the ones that are defined by the ISA, are just aliases over a wider register file. With such decoupling of [architectural state](https://en.wikipedia.org/wiki/Architectural_state),[^3] solving WAR hazards is simple: we just need to use a different physical register for the write operation. For example:



I would move the register renaming part to the 3.3 chapter. Register renaming AFAIK was used first in the Tomasulo algorithm and one can use pipelining without register renaming.

Therefore the more logical place to explain register renaming would be in the 3.3 chapter.

Let me think more about it, but I tend to think that it's good already.

pveentjer · 2024-04-12T01:24:58Z

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md

-Dynamic scheduling of these instructions is enabled by sophisticated hardware structures such as scoreboards and techniques such as register renaming to reduce data hazards. In the 1960s, some work to support dynamic scheduling and out-of-order execution included the [Tomasulo algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm),[^4] implemented in the IBM360, and [Scoreboading](https://en.wikipedia.org/wiki/Scoreboarding),[^5] which was implemented in the CDC6600. Those pioneering efforts have influenced all modern CPU architectures.The scoreboard hardware is used to schedule the in-order retirement and all machine state updates. It keeps track of data dependencies of every instruction and where in the pipeline the data is available. Most implementations strive to balance the hardware cost with the potential return. Typically, the size of the scoreboard determines how far ahead the hardware can look for scheduling such independent instructions. 
+[TODO]:
+Peter: "(not true: on ARM the OOoE of loads/stores is allowed to become visible.. loads for sure. So it depends on the memory model of the ISA). "
+Denis: I'm not familiar. What are the conditions in which this could happen? Is it about relaxed consistency?


Yes. So it could happen that an earlier load runs into a cache miss and a later load doesn't and then the 2 loads don't observe the stores in program order.

On X86 such reorderings are prohibited from becoming architecturally visible, since loads can't be reordered with other loads (every normal load has acquire semantics). But the X86 could speculate that reordering the loads didn't lead to problems, so it will try to execute them out of order.

On the ARM this reordering of loads would be perfectly fine since a normal load (LDR) doesn't have any memory ordering semantics. You need to use e.g. a LDAR/LDAPR for that.

We should move the whole discussion about memory ordering to its own place.

If you change this

CPUs with OOO execution must still give the same result as if all instructions were executed in the program order

To this

CPUs with OOO execution should not be able to observe the out of order execution of its own instructions. But other CPUs could. For more details see chapter ....

Then it is more clear that the OOO should not influence the CPU in isolation.

I would rather leave it as it is. Let's not overwhelm readers with nuances, they would second-guess themselves whether they really understand it or not. We will have a chance to correct ourselves later.

pveentjer · 2024-04-12T01:28:26Z

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md


-### Superscalar Engines and VLIW
+The OOO execution in the Tomasulo algorithm is implemented using the Reorder Buffer (ROB) and Reservation Station (RS). The ROB is a circular buffer that keeps track of the state of each instruction, and in modern processors it has more than a thousand entries. Typically, the size of the ROB determines how far ahead the hardware can look for scheduling such independent instructions. Instructions are inserted in the ROB in program order, can execute out of order, and retire in program order. Register renaming is done when the instructions are placed in the ROB. 


A single reservation station is needed for a single instruction. So a CPU needs to have a large number of reservation stations.

I don't know any processor that has 1000 entries in the ROB. The Apple M1 has something like 630 and that was double that of Intel/AMD. It is safer to say that the ROB capacity is in the hundreds on modern processors.

I think the explanation is now out of order :) The ROB isn't part of the original Tomasulo algorithm. It is an extension to support speculative execution and it isn't needed for out of order execution.

I don't know any processor that has 1000 entries in the ROB. The Apple M1 has something like 630 and that was double that of Intel/AMD. It is safer to say that the ROB capacity is in the hundreds on modern processors.

I'm going ahead of myself and I need to sleep more. :) Just checked chips-and-cheese and yes, you're right. Will change.

A single reservation station is needed for a single instruction. So a CPU needs to have a large number of reservation stations.

I think this is not true. Again, from chips and cheese:

I think the explanation is now out of order :) The ROB isn't part of the original Tomasulo algorithm. It is an extension to support speculative execution and it isn't needed for out of order execution.

I'm not very well versed in the history here, but I think you still need ROB to retire instructions in order, no? We can say something like "Modern processors implement derivatives of Tomasulo's algorithm that include ROB and RS ..."

To be honest; I'm also slightly confused about how the original Tomasulo deals with in-order retirement.

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md

dendibakh · 2024-04-15T21:57:20Z

I cleaned up the PR, @pveentjer please let me know if you have further comments.

pveentjer · 2024-04-16T03:13:53Z

It looks good to me. Thank you for the review.

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md

dendibakh · 2024-04-30T16:39:00Z

@pveentjer , thanks a lot!

pveentjer added 5 commits April 3, 2024 06:12

WIP

01df5ed

WIP

047fd95

WIP

a115ef7

WIP

2da2bb4

WIP

663f1f2

jerrinot reviewed Apr 8, 2024

View reviewed changes

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md Outdated Show resolved Hide resolved

pveentjer and others added 4 commits April 10, 2024 06:36

WIP

82031ba

WIP

02fd094

Denis cosmetic fix. part1

5639939

Denis cosmetic fix. part2

29c5701

pveentjer commented Apr 12, 2024

View reviewed changes

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md Outdated Show resolved Hide resolved

pveentjer commented Apr 12, 2024

View reviewed changes

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md Outdated Show resolved Hide resolved

pveentjer commented Apr 12, 2024

View reviewed changes

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md Outdated Show resolved Hide resolved

pveentjer commented Apr 12, 2024

View reviewed changes

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md Outdated Show resolved Hide resolved

pveentjer commented Apr 12, 2024

View reviewed changes

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md Outdated Show resolved Hide resolved

pveentjer commented Apr 12, 2024

View reviewed changes

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md Outdated Show resolved Hide resolved

pveentjer commented Apr 12, 2024

View reviewed changes

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md Outdated Show resolved Hide resolved

pveentjer commented Apr 12, 2024

View reviewed changes

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md Outdated Show resolved Hide resolved

Review comments

97faeeb

pveentjer changed the title ~~[WIP] Chapter 3 - ILP - Dynamic Scheduling~~ Chapter 3 - ILP - Dynamic Scheduling Apr 16, 2024

cf-natali reviewed Apr 16, 2024

View reviewed changes

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md Outdated Show resolved Hide resolved

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md Outdated Show resolved Hide resolved

chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md Outdated Show resolved Hide resolved

fixed review comments

77d6bbc

dendibakh merged commit dd9c74f into dendibakh:main Apr 30, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 3 - ILP - Dynamic Scheduling #45

Chapter 3 - ILP - Dynamic Scheduling #45

pveentjer commented Apr 3, 2024

dendibakh commented Apr 11, 2024

pveentjer Apr 12, 2024 •

edited

Loading

dendibakh Apr 12, 2024

pveentjer Apr 12, 2024 •

edited

Loading

pveentjer Apr 12, 2024 •

edited

Loading

dendibakh Apr 12, 2024

pveentjer Apr 12, 2024

pveentjer Apr 12, 2024 •

edited

Loading

pveentjer Apr 12, 2024

dendibakh Apr 12, 2024

pveentjer Apr 16, 2024

dendibakh commented Apr 15, 2024

pveentjer commented Apr 16, 2024 •

edited

Loading

dendibakh commented Apr 30, 2024

		@@ -38,13 +38,22 @@ In real implementations, pipelining introduces several constraints that limit th
		A write-after-read (WAR) hazard requires a dependent write to execute after a read. It occurs when instruction `x+1` writes a source before instruction `x` reads the source, resulting in the wrong new value being read. A WAR hazard is not a true dependency and is eliminated by a technique called [register renaming](https://en.wikipedia.org/wiki/Register_renaming).[^1] It is a technique that abstracts logical registers from physical registers. CPUs support register renaming by keeping a large number of physical registers. Logical (architectural) registers, the ones that are defined by the ISA, are just aliases over a wider register file. With such decoupling of [architectural state](https://en.wikipedia.org/wiki/Architectural_state),[^3] solving WAR hazards is simple: we just need to use a different physical register for the write operation. For example:


		### Superscalar Engines and VLIW
		The OOO execution in the Tomasulo algorithm is implemented using the Reorder Buffer (ROB) and Reservation Station (RS). The ROB is a circular buffer that keeps track of the state of each instruction, and in modern processors it has more than a thousand entries. Typically, the size of the ROB determines how far ahead the hardware can look for scheduling such independent instructions. Instructions are inserted in the ROB in program order, can execute out of order, and retire in program order. Register renaming is done when the instructions are placed in the ROB.

Chapter 3 - ILP - Dynamic Scheduling #45

Chapter 3 - ILP - Dynamic Scheduling #45

Conversation

pveentjer commented Apr 3, 2024

dendibakh commented Apr 11, 2024

pveentjer Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

dendibakh Apr 12, 2024

Choose a reason for hiding this comment

pveentjer Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

pveentjer Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

dendibakh Apr 12, 2024

Choose a reason for hiding this comment

pveentjer Apr 12, 2024

Choose a reason for hiding this comment

pveentjer Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

pveentjer Apr 12, 2024

Choose a reason for hiding this comment

dendibakh Apr 12, 2024

Choose a reason for hiding this comment

pveentjer Apr 16, 2024

Choose a reason for hiding this comment

dendibakh commented Apr 15, 2024

pveentjer commented Apr 16, 2024 • edited Loading

dendibakh commented Apr 30, 2024

pveentjer Apr 12, 2024 •

edited

Loading

pveentjer Apr 12, 2024 •

edited

Loading

pveentjer Apr 12, 2024 •

edited

Loading

pveentjer Apr 12, 2024 •

edited

Loading

pveentjer commented Apr 16, 2024 •

edited

Loading