From d1164b3ff0751402d605793f43389bf804a1d08f Mon Sep 17 00:00:00 2001
From: Peter Veentjer <alarmnummer@gmail.com>
Date: Tue, 2 Apr 2024 08:34:11 +0300
Subject: [PATCH 1/5] Added note about duality of load/store and
 register/memory behavior of X86

---
 chapters/3-CPU-Microarchitecture/3-1 ISA.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/chapters/3-CPU-Microarchitecture/3-1 ISA.md b/chapters/3-CPU-Microarchitecture/3-1 ISA.md
index 2bdadab015..dc8462256f 100644
--- a/chapters/3-CPU-Microarchitecture/3-1 ISA.md	
+++ b/chapters/3-CPU-Microarchitecture/3-1 ISA.md	
@@ -4,4 +4,6 @@ The instruction set is the vocabulary used by software to communicate with the h
 
 Most modern architectures can be classified as general purpose register-based, load-store architectures where the operands are explicitly specified, and memory is accessed only using load and store instructions. In addition to providing the basic functions in the ISA such as load, store, control and scalar arithmetic operations using integers and floating-point, the widely deployed architectures continue to enhance their ISA to support new computing paradigms. These include enhanced vector processing instructions (e.g., Intel AVX2, AVX512, ARM SVE) and matrix/tensor instructions (Intel AMX). Software mapped to use these advanced instructions typically provide orders of magnitude improvement in performance. 
 
+The X86 ISA is a register-memory architecture since arithmatic can be done directly on memory operands. But after conversion to micro-instructions (uops), the X86 microarchitecture is actually a load-store architecture.
+
 Modern CPUs support 32-bit and 64-bit precision for arithmetic operations. With the fast-evolving field of machine learning and AI, the industry has a renewed interest in alternative numeric formats for variables to drive significant performance improvements. Research has shown that machine learning models perform just as good, using fewer bits to represent the variables, saving on both compute and memory bandwidth. As a result, several CPU franchises have recently added support for lower precision data types such as 8-bit integers (int8, e.g., Intel VNNI), 16-bit floating-point (fp16, bf16) in the ISA, in addition to the traditional 32-bit and 64-bit formats for arithmetic operations.
\ No newline at end of file

From f68fab2062386d2c4b4a3b4e416b03751ce5d6df Mon Sep 17 00:00:00 2001
From: Denis Bakhvalov <dendibakh@gmail.com>
Date: Wed, 10 Apr 2024 11:28:58 -0400
Subject: [PATCH 2/5] Denis cosmetic fix

---
 chapters/3-CPU-Microarchitecture/3-1 ISA.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/chapters/3-CPU-Microarchitecture/3-1 ISA.md b/chapters/3-CPU-Microarchitecture/3-1 ISA.md
index dc8462256f..a013da5db1 100644
--- a/chapters/3-CPU-Microarchitecture/3-1 ISA.md	
+++ b/chapters/3-CPU-Microarchitecture/3-1 ISA.md	
@@ -2,8 +2,6 @@
 
 The instruction set is the vocabulary used by software to communicate with the hardware. The instruction set architecture (ISA) defines the contract between the software and the hardware. Intel x86, ARM v8 and RISC-V are examples of current-day ISAs that are widely deployed. All of these are 64-bit architectures, i.e., all address computations use 64 bits. ISA developers and CPU architects typically ensure that software or firmware conforming to the specification will execute on any processor built using the specification. Widely deployed ISA franchises also typically ensure backward compatibility such that code written for the GenX version of a processor will continue to execute on GenX+i.
 
-Most modern architectures can be classified as general purpose register-based, load-store architectures where the operands are explicitly specified, and memory is accessed only using load and store instructions. In addition to providing the basic functions in the ISA such as load, store, control and scalar arithmetic operations using integers and floating-point, the widely deployed architectures continue to enhance their ISA to support new computing paradigms. These include enhanced vector processing instructions (e.g., Intel AVX2, AVX512, ARM SVE) and matrix/tensor instructions (Intel AMX). Software mapped to use these advanced instructions typically provide orders of magnitude improvement in performance. 
-
-The X86 ISA is a register-memory architecture since arithmatic can be done directly on memory operands. But after conversion to micro-instructions (uops), the X86 microarchitecture is actually a load-store architecture.
+Most modern architectures can be classified as general purpose register-memory architectures where operands are explicitly specified, operations can be performed on memory operands, as well as registers, and memory is accessed only using load and store instructions. In addition to providing the basic functions in the ISA such as load, store, control and scalar arithmetic operations using integers and floating-point, the widely deployed architectures continue to enhance their ISA to support new computing paradigms. These include enhanced vector processing instructions (e.g., Intel AVX2, AVX512, ARM SVE) and matrix/tensor instructions (Intel AMX). Software mapped to use these advanced instructions typically provide orders of magnitude improvement in performance. 
 
 Modern CPUs support 32-bit and 64-bit precision for arithmetic operations. With the fast-evolving field of machine learning and AI, the industry has a renewed interest in alternative numeric formats for variables to drive significant performance improvements. Research has shown that machine learning models perform just as good, using fewer bits to represent the variables, saving on both compute and memory bandwidth. As a result, several CPU franchises have recently added support for lower precision data types such as 8-bit integers (int8, e.g., Intel VNNI), 16-bit floating-point (fp16, bf16) in the ISA, in addition to the traditional 32-bit and 64-bit formats for arithmetic operations.
\ No newline at end of file

From b1f5aed5f2b63dc5f4c0f49ff769adf306638c99 Mon Sep 17 00:00:00 2001
From: Denis Bakhvalov <dendibakh@gmail.com>
Date: Thu, 11 Apr 2024 16:19:30 -0400
Subject: [PATCH 3/5] Denis cosmetic fix ++

---
 chapters/3-CPU-Microarchitecture/3-1 ISA.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/chapters/3-CPU-Microarchitecture/3-1 ISA.md b/chapters/3-CPU-Microarchitecture/3-1 ISA.md
index a013da5db1..8878cd9900 100644
--- a/chapters/3-CPU-Microarchitecture/3-1 ISA.md	
+++ b/chapters/3-CPU-Microarchitecture/3-1 ISA.md	
@@ -2,6 +2,6 @@
 
 The instruction set is the vocabulary used by software to communicate with the hardware. The instruction set architecture (ISA) defines the contract between the software and the hardware. Intel x86, ARM v8 and RISC-V are examples of current-day ISAs that are widely deployed. All of these are 64-bit architectures, i.e., all address computations use 64 bits. ISA developers and CPU architects typically ensure that software or firmware conforming to the specification will execute on any processor built using the specification. Widely deployed ISA franchises also typically ensure backward compatibility such that code written for the GenX version of a processor will continue to execute on GenX+i.
 
-Most modern architectures can be classified as general purpose register-memory architectures where operands are explicitly specified, operations can be performed on memory operands, as well as registers, and memory is accessed only using load and store instructions. In addition to providing the basic functions in the ISA such as load, store, control and scalar arithmetic operations using integers and floating-point, the widely deployed architectures continue to enhance their ISA to support new computing paradigms. These include enhanced vector processing instructions (e.g., Intel AVX2, AVX512, ARM SVE) and matrix/tensor instructions (Intel AMX). Software mapped to use these advanced instructions typically provide orders of magnitude improvement in performance. 
+Most modern architectures can be classified as general purpose register-based, load-store architectures, such as RISC-V and ARM where the operands are explicitly specified, and memory is accessed only using load and store instructions. The X86 ISA is a register-memory architecture, where operations can be performed on registers, as well as memory operands. In addition to providing the basic functions in an ISA such as load, store, control and scalar arithmetic operations using integers and floating-point, the widely deployed architectures continue to enhance their ISA to support new computing paradigms. These include enhanced vector processing instructions (e.g., Intel AVX2, AVX512, ARM SVE, RISC-V "V" vector extension) and matrix/tensor instructions (Intel AMX, ARM SME). Software mapped to use these advanced instructions typically provide orders of magnitude improvement in performance.
 
 Modern CPUs support 32-bit and 64-bit precision for arithmetic operations. With the fast-evolving field of machine learning and AI, the industry has a renewed interest in alternative numeric formats for variables to drive significant performance improvements. Research has shown that machine learning models perform just as good, using fewer bits to represent the variables, saving on both compute and memory bandwidth. As a result, several CPU franchises have recently added support for lower precision data types such as 8-bit integers (int8, e.g., Intel VNNI), 16-bit floating-point (fp16, bf16) in the ISA, in addition to the traditional 32-bit and 64-bit formats for arithmetic operations.
\ No newline at end of file

From 438b56b88beae1071cfe29e55f0a8449c22d65e0 Mon Sep 17 00:00:00 2001
From: Peter Veentjer <alarmnummer@gmail.com>
Date: Tue, 16 Apr 2024 05:17:56 +0300
Subject: [PATCH 4/5] Moved the load-store architecture improvement to uops
 chapter

---
 chapters/4-Terminology-And-Metrics/4-4 UOP.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/chapters/4-Terminology-And-Metrics/4-4 UOP.md b/chapters/4-Terminology-And-Metrics/4-4 UOP.md
index 28d76874f7..b12a4cad8d 100644
--- a/chapters/4-Terminology-And-Metrics/4-4 UOP.md	
+++ b/chapters/4-Terminology-And-Metrics/4-4 UOP.md	
@@ -4,7 +4,7 @@ typora-root-url: ..\..\img
 
 ## Micro-ops {#sec:sec_UOP}
 
-Microprocessors with the x86 architecture translate complex CISC-like instructions into simple RISC-like microoperations, abbreviated as $\mu$ops or $\mu$ops. A simple addition instruction such as `ADD rax, rbx` generates only one $\mu$op, while a more complex instruction like `ADD rax, [mem]` may generate two: one for reading from the `mem` memory location into a temporary (un-named) register, and one for adding it to the `rax` register. The instruction `ADD [mem], rax` generates three $\mu$ops: one for reading from memory, one for adding, and one for writing the result back to memory.
+Microprocessors with the x86 architecture translate complex CISC-like instructions into simple RISC-like microoperations, abbreviated as $\mu$ops or $\mu$ops. Even though x86 ISA is a register-memory architecture, after $\mu$ops conversion, it changes into a load-store architecture since every load/store to memory goes through a load/store instruction. A simple addition instruction such as `ADD rax, rbx` generates only one $\mu$op, while a more complex instruction like `ADD rax, [mem]` may generate two: one for loading from the `mem` memory location into a temporary (un-named) register, and one for adding it to the `rax` register. The instruction `ADD [mem], rax` generates three $\mu$ops: one for loading from memory, one for adding, and one for storing the result back to memory.
 
 The main advantage of splitting instructions into micro operations is that $\mu$ops can be executed:
 

From 6d5d6592df2131fc2f54d90b9db9e62adf530a10 Mon Sep 17 00:00:00 2001
From: Denis Bakhvalov <dendibakh@gmail.com>
Date: Tue, 30 Apr 2024 12:19:11 -0400
Subject: [PATCH 5/5] reorder sentences

---
 chapters/4-Terminology-And-Metrics/4-4 UOP.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/chapters/4-Terminology-And-Metrics/4-4 UOP.md b/chapters/4-Terminology-And-Metrics/4-4 UOP.md
index b12a4cad8d..2038adc064 100644
--- a/chapters/4-Terminology-And-Metrics/4-4 UOP.md	
+++ b/chapters/4-Terminology-And-Metrics/4-4 UOP.md	
@@ -4,7 +4,7 @@ typora-root-url: ..\..\img
 
 ## Micro-ops {#sec:sec_UOP}
 
-Microprocessors with the x86 architecture translate complex CISC-like instructions into simple RISC-like microoperations, abbreviated as $\mu$ops or $\mu$ops. Even though x86 ISA is a register-memory architecture, after $\mu$ops conversion, it changes into a load-store architecture since every load/store to memory goes through a load/store instruction. A simple addition instruction such as `ADD rax, rbx` generates only one $\mu$op, while a more complex instruction like `ADD rax, [mem]` may generate two: one for loading from the `mem` memory location into a temporary (un-named) register, and one for adding it to the `rax` register. The instruction `ADD [mem], rax` generates three $\mu$ops: one for loading from memory, one for adding, and one for storing the result back to memory.
+Microprocessors with the x86 architecture translate complex CISC-like instructions into simple RISC-like microoperations, abbreviated as $\mu$ops. A simple addition instruction such as `ADD rax, rbx` generates only one $\mu$op, while a more complex instruction like `ADD rax, [mem]` may generate two: one for loading from the `mem` memory location into a temporary (un-named) register, and one for adding it to the `rax` register. The instruction `ADD [mem], rax` generates three $\mu$ops: one for loading from memory, one for adding, and one for storing the result back to memory. Even though x86 ISA is a register-memory architecture, after $\mu$ops conversion, it becomes a load-store architecture since memory is only accessed via load/store $\mu$ops.
 
 The main advantage of splitting instructions into micro operations is that $\mu$ops can be executed: