Releases: dendibakh/perf-book
Second Edition
The second edition of "Performance Analysis and Tuning on Modern CPUs".
This book teaches you how to implement low-level optimizations for modern CPUs using advanced methods and tools. After reading this book you will be able to root cause performance regressions and find optimization opportunities in your code.
The second edition has been updated with comprehensive case studies and hands-on exercises. Also, it expands in covering AMD and ARM-based architectures.
Buy a paper/Kindle version on Amazon.
Q2.2024
New Content
- Chapter 13 "Optimizing Multithreaded Applications" - major rewrite of the chapter. I added new sections about Thread Count Scalability, Task Scheduling, and updated the remaining parts of this chapter.
- Chapter 8 "Optimizing Memory Accesses" - added a section about field reordering in data structures. (5e80a16)
- More proofreading fixes throughout the book.
- I'm currently working on the last big piece of content for the second edition. This will be a section in chapter 12 titled "CPU-specific optimizations", where I will touch on some aspects of optimizing for a specific platform. It covers topics such as ISA extensions, instruction latencies and throughput, and some common microarchitecture-specific issues. (#56)
Pull requests:
@pveentjer #44 #45 #47 #51
@cf-natali #46 #48
Full Changelog: Q1.2024...Q2.2024
Q1.2024
New Content
- A case study about L3 cache sensitivity, contributed by @chusAB (chapter 12, pull request #39). It shows how you can determine whether an application is sensitive to the size of the last-level cache (LLC). Using this information, you can make educated decisions when buying HW components for your computing systems. Similarly, you can later determine sensitivity to other factors, such as memory bandwidth, core count, and processor frequency.
- I wrote a section about how to measure hot code footprint (chapter 11, commit 2183eda). Applications with large amounts of hot code usually cause pressure on the CPU front end (I-cache and TLBs). Knowing how many cache lines/pages of a program code are hot can be an additional argument for investing time into machine code layout optimizations. Thanks to @aaupov for his review and comments.
- I wrote a new section about memory profiling. It discusses how to measure memory usage (VSZ and RSS), how to analyze heap allocations and more. (chapter 7, pull request #27)
- I've made some big updates to chapter 8. "Optimizing Memory Accesses". I wrote about some new data structure reorganization techniques that were not present in the first edition. Also, I improved two sections about dynamic memory allocation and what to do when you hit memory bandwidth limitation.
- I have fixed ~10 TODOs. There are still ~60 items left.
- I fixed many proofreading comments (thanks to Ciaran).
Full Changelog: Q4.2023...Q1.2024
Q4.2023
Release notes:
- Low-latency techniques (#33, authored by Mark Dawson)
- New section about data dependency chains
- Updated the chapter about Front-End bound optimizations (now with pretty images), expanded the section about PGO and BOLT (thanks to @aaupov).
- Major update of the PMU chapter. Including performance monitoring features of AMD and ARM-based processors. (WIP)
- A LOT of proofreading comments (thanks to Ciaran).
Q3.2023
Release notes:
- PRs merged:
- Hybrid profilers, Tracy by @theWatchmen (#19)
- Updated SIMD sections by @jan-wassenberg (#21)
- Continuous profiling by Mark Dawson (#26)
- A few small updates by @cf-natali (#22, #23, #25)
- Finished chapter 7 "Overview Of Performance Analysis Tools" (+AMD uProf, +Xcode Instruments, +flamegraphs)
- Many changes in chapters 1-5
- Chapters 1 and 2: mostly cosmetics
- Chapter 3: TLB hierarchy, store optimizations
- Chapter 4: major updates for sections about UOPs, IPC, pipeline slots
- Chapter 5: many updates for sections on sampling, static performance analysis (+UICA), and compiler opt reports.
- Updated a section about FP subnormals
Q2.2023
Release notes:
- Two PRs merged:
- Major update to the chapter 3 CPU-Microarchitecture
- DRAM rank, channels, interleaving
- Multicore, SMT, and Hybrid CPUs.
- Branch prediction section.
- Updated section "Modern CPU design" (Skylake -> Goldencove), deep dive into CPU Front-End, Back-End, Load-Store unit, and TLB hierarchy.
- Added questions and exercises throughout the book.
- Major rewrite of section 6.1 Code Instrumentation.
- Updated intro for the second part (section 9.0).
- Split previously chapter 9 BackendBound into two: chapter 9 MemoryBound and chapter 10 CoreBound.
Q1 2023 release
The book has several updated chapters and sections:
- performance metrics (major update): secondary metrics, memory latency and bandwidth, case study
- overview of performance analysis tools (new chapter, WIP)
- huge pages (several places throughout the book)
- sw memory prefetching
- unroll and jam
- draft of the cover image
First edition of the book
This is the first edition of the book. Published on November 2020.
PDF can also be found on this page: https://book.easyperf.net/perf_book.