Lally/marker api #17

dendibakh · 2023-02-02T15:30:10Z

No description provided.

dendibakh

Thanks, Lally, good start. I posted some comments, let me know...

dendibakh · 2023-02-02T15:30:58Z

chapters/5-Power-And-Performance/5-1 Code Instrumentation.md

+system call. These atomic bundles are very useful. 
+
+You can compensate for the scheduler by requesting CPU cycles consumed (e.g.,
+`UNHALTED_CORE_CYCLES`) and comparing against wall-clock time to detect when the


Not easy since the frequency may fluctuate during that time.

dendibakh · 2023-02-02T15:33:25Z

chapters/5-Power-And-Performance/5-1 Code Instrumentation.md

+comparing CPU cycles (`UNHALTED_CORE_CYCLES`) vs the fixed-frequency reference clock
+(`UNHALTED_REFERENCE_CYCLES`).
+
+You can also correlate the CPU-utilization of your code (IPC:


CPU utilization and IPC are different things. But you probably meant "... correlate low IPC to ... LLC_MISSES".

dendibakh · 2023-02-02T15:36:31Z

chapters/5-Power-And-Performance/5-1 Code Instrumentation.md

+and can be very slow.
+
+
+### Using Marker APIs for Attribution


After reading your PR, I think the better place for this section would be after we discuss sampling, i.e. 5.5, not here. Sorry about that, but you can leave it here, I will relocate it later.

dendibakh · 2023-02-02T15:47:20Z

chapters/5-Power-And-Performance/5-1 Code Instrumentation.md

+adding the instrumentation.  For example, if you have a loop:
+
+
+```c


I would like to request a few things here:

I, as a reader, would like to know how to set up marker API (showing events I want to collect), how to decorate the code region, and how to capture the data. It can be combined in a single example.

Can we use some cross-platform marker API tool? Modern C++ developers are not very much into reading POSIX API calls. :) PAPI, libpfm4, likwid. It must be a project that is maintained and that supports the latest AMD Zen4 and Intel AlderLake chips. Let me know if you can find one.

This example is too large for the book. We need to reduce it somehow.

Yeah that's also planned. So how about a separate tutorial section on using libpfm4 (it's the only one I've used) and shrink the example. I can swap out the error handling for assert(). If I move the "Getting the data out of process" section above, perhaps between it and the the libpfm section, we can use the struct decl as a running example.

I think there is no need for a separate section. Let's just show the idea. People will figure out the details.
I would just focus on how to set up and use libpfm4 APIs in a real benchmark.

Here is the list of benchmarks that you can use:
https://openbenchmarking.org/test/pts/c-ray
https://openbenchmarking.org/test/pts/simdjson
https://openbenchmarking.org/test/pts/primesieve
https://openbenchmarking.org/test/pts/stockfish
https://openbenchmarking.org/test/pts/cryptopp

You can click on "View Source", there you will find instructions on how to build and run the benchmark. Choose the one you like. :)
C-ray may be the simplest one. We just need something for demonstration purposes. We can stick with a single-threaded app, or run with only one HW thread,

dendibakh · 2023-02-02T15:51:34Z

chapters/5-Power-And-Performance/5-1 Code Instrumentation.md

+
+
+% Getting the data out of process
+ - Really, a big-ass `fwrite(3)` with a big-healthy `setbuf(3)` call works pretty well.


big-ass

Ha-ha. :) please let's avoid those words.

Ah, yeah these were notes to myself :-)

dendibakh · 2023-02-02T15:55:32Z

chapters/5-Power-And-Performance/5-1 Code Instrumentation.md

+ - Really, a big-ass `fwrite(3)` with a big-healthy `setbuf(3)` call works pretty well.
+ - Mention IPC techniques or ring-buffer snapshotting to disk to minimize overhead/jitter.  Use a seqno.
+
+% Aggregate Statistics Solutions


It would be nice to have an example of data visualization for a real-world workload.
For example:
"We used a popular benchmark A from test suite B and inserted a marker API around the top hotspot in that benchmark. Here is the summarized data that we collected using the method described earlier..."
I can help with the benchmark selection.

Yeah I'll take any suggestions on that. The easier to modify/build/run the better :)

dendibakh · 2023-02-02T15:56:05Z

chapters/5-Power-And-Performance/5-1 Code Instrumentation.md

+   them (at least 3, but get more for better variety of values), you can estimate $x_1, x_2, x_3$ by the estimate $M`$` 
+   you get from the SVD pseudoinverse. 
+
+


Also, it would be nice to list possible tools that developers can use.

Indeed. A small example with numpy's pinv() should help.

@magras

Spotted by @magras

written by Swarup Sahoo

dendibakh · 2023-04-12T11:17:52Z

chapters/5-Power-And-Performance/5-1 Code Instrumentation.md

+   double value = uintptr_t(argv) % 50000;  // Avoid compiler precomputation
+   int event_fd;
+   perf_event_attr  perf_attr{};
+   perf_attr.size = sizeof(perf_event_attr);


I tried it on the C-ray benchmark and I had to add the following line to make it work:
memset(&perf_attr, 0, sizeof(perf_attr)
Otherwise, I was seeing some garbage numbers.

dendibakh · 2023-04-12T11:19:39Z

chapters/5-Power-And-Performance/5-1 Code Instrumentation.md

+package is useful here, as it adds both a discovery tool for identifying
+available events on your CPU's PMU, and a wrapper library around the raw `perf_event_open(2)`
+system call.  Here's an example that (poorly) benchmarks `sqrt(3)`:
+```c


Thank you, this is the example I was looking for.

dendibakh · 2023-04-12T11:20:58Z

chapters/5-Power-And-Performance/5-1 Code Instrumentation.md


 The difference in event counts over a region of code is the number of events
 that occurred during that region, in that thread. That difference divided by the
 time taken is the event rate. You can ask for up to 3 counters simultaneously in
 one file descriptor. They will available atomically under the same `read(2)`
 system call. These atomic bundles are very useful. 

+Repeatedly running the example above (you will probably have to 0 to `/proc/sys/kernel/perf_event_paranoid` to 


Also, please add that you can run it as normal, and sudo is not required.

dendibakh · 2023-04-12T14:42:05Z

chapters/5-Power-And-Performance/5-1 Code Instrumentation.md

-(e.g., `LLC_MISSES`).
+You can also correlate low IPC (`INSTRUCTIONS_RETIRED/UNHALTED_CORE_CYCLES`) to
+events you believe dominate it (e.g., `LLC_MISSES`).
+


@lally, I think we can stop after this line and just briefly summarize your ideas on how to aggregate data. I think readers are capable of writing that aggregation logic themselves. :) Our goal here is just to show the idea of what's possible with marker APIs, and the example above (that showcases pfm_get_os_event_encoding + perf_event_open + read) is good enough.

dendibakh · 2023-04-14T10:56:40Z

chapters/5-Power-And-Performance/5-1 Code Instrumentation.md

+   perf_attr.read_format = PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING | PERF_FORMAT_GROUP;
+
+   pfm_perf_encode_arg_t perf_setup{};
+   perf_setup.size = sizeof(pfm_perf_encode_arg_t);


BTW, the memset here is also needed :)

memset(&perf_setup, 0, sizeof(pfm_perf_encode_arg_t));

lally · 2023-04-14T12:24:25Z

Odd, the {} should zero-initialize that, and works on my machine. I wonder if it's a compiler version thing.

…

On Fri, Apr 14, 2023, 6:56 AM Denis Bakhvalov ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In chapters/5-Power-And-Performance/5-1 Code Instrumentation.md <#17 (comment)>: > + +// The flags to perf_event_open() determine which fields are returned +// from the resulting FD. +struct read_format { uint64_t nr, time_enabled, time_running, values[2]; }; +void require_pfm(int value) { assert(value == PFM_SUCCESS); } +void require_nonneg(int value) { assert(value >= 0); } + +int main(int argc, char **argv) { + double value = uintptr_t(argv) % 50000; // Avoid compiler precomputation + int event_fd; + perf_event_attr perf_attr{}; + perf_attr.size = sizeof(perf_event_attr); + perf_attr.read_format = PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING | PERF_FORMAT_GROUP; + + pfm_perf_encode_arg_t perf_setup{}; + perf_setup.size = sizeof(pfm_perf_encode_arg_t); BTW, the memset here is also needed :) memset(&perf_setup, 0, sizeof(pfm_perf_encode_arg_t)); — Reply to this email directly, view it on GitHub <#17 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAK4S3R6KECDKVVQFOQYYTXBEUPHANCNFSM6AAAAAAUPFPOB4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dendibakh · 2023-04-14T13:50:40Z

Odd, the {} should zero-initialize that, and works on my machine. I wonder if it's a compiler version thing.

Ah, OK, I see what's going on now.
I compiled it with GCC 11.3.0, and it complained (because it's C-compilation, not C++) about {}:

c-ray-f-mod.c:247:37: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘{’ token
  247 |     pfm_perf_encode_arg_t perf_setup{};
      |                                     ^
c-ray-f-mod.c:249:5: error: ‘perf_setup’ undeclared (first use in this function)
  249 |     perf_setup.size = sizeof(pfm_perf_encode_arg_t);
      |     ^~~~~~~~~~

That's why I deleted it but didn't add memset. :)

lally · 2023-04-14T14:12:39Z

Ah. It's a c++-ism.

…

On Fri, Apr 14, 2023, 9:50 AM Denis Bakhvalov ***@***.***> wrote: Odd, the {} should zero-initialize that, and works on my machine. I wonder if it's a compiler version thing. Ah, OK, I see what's going on now. I compiled it with GCC 11.3.0, and it complained about {}: c-ray-f-mod.c:247:37: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘{’ token 247 | pfm_perf_encode_arg_t perf_setup{}; | ^ c-ray-f-mod.c:249:5: error: ‘perf_setup’ undeclared (first use in this function) 249 | perf_setup.size = sizeof(pfm_perf_encode_arg_t); | ^~~~~~~~~~ That's why I deleted it but didn't add memset. :) — Reply to this email directly, view it on GitHub <#17 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAK4S7VUSSHWIA6NW4P5EDXBFI3XANCNFSM6AAAAAAUPFPOB4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dendibakh · 2023-05-16T14:55:28Z

Merged with #20.

Lally Singh added 2 commits January 3, 2023 07:02

Skeleton for 5.1.1

7a5b122

Partial draft of Marker API.

82f74c0

dendibakh commented Feb 2, 2023

View reviewed changes

dendibakh and others added 27 commits April 7, 2023 12:06

Update new_toc.md

6ed0a59

Renamed Optimizing TLB sections

c2e2e52

Expanded on page tables, page faults, page walks

e403524

Fixed image reference

aa4f201

Continue with virtual memory

7681b02

Described huge pages

b47eadb

Improved ITLB section

94d9901

Added appendix C

32c1b4c

Working on DTLB misses section

435e066

Continue with Huge Pages

2e41237

Finished Huge Pages

6631d2c

Finished Huge Pages section

8558b7c

[NFC] Split chapter 3 one big source md file into multiple files

4c823c8

Added exercises for huge pages

1cf735a

Start working on unroll and jam

07a362a

Fixed an issue in FE ITLB section

2015bdf

Fixed section reference

00b5851

Continued unroll and jam

1d22029

Fixed mapping error

802e639

Spotted by @magras

Finished Unroll and Jam

a2c1103

Working on SW memory prefetching

6b4db30

Finished SW mem prefetching

0d94046

Fixed minor issue in SW mem prefetching

e395324

Renamed chapter on CPU FrontEnd optimizations

8286e95

Fixed minor issue

3415ddf

Reordered chapters 9-12

0b6f9c6

Fixed one TODO

a6c93cd

dbakhval and others added 21 commits April 7, 2023 12:06

Started Perf Tools chapter

31801b7

Updated moore's law image

3bd9204

Added sections for the pnp chapter

7f1bd6f

Working on Intel Vtune section...

c9e3744

Working on Intel Vtune section...

adae5e6

Finished Intel Vtune section

66c161c

Updated new TOC

fe113d2

Draft of AMD uprof section

cd508f2

written by Swarup Sahoo

Updated cover draft

9940f31

Started updating chapter 4 - metrics

c032e09

Working on chapter 4

0732633

Added a table with secondary perf metrics

821a794

Finished describing performance metrics

0e75147

Working on chapter 4 case study

8aa20fd

Small update to the new TOC

45b3431

Working on chapter 4 case study

17b96ba

Added draft watermark on the cover

4519b2c

Added regions to the Vtune timeline screenshot

bf93cd5

Continue working on case study for chapter 4

08e6eb2

Added perf_event example.

08c6085

Added compile flag comment.

bac1444

dendibakh commented Apr 12, 2023

View reviewed changes

dendibakh mentioned this pull request Apr 13, 2023

Lally/marker api #20

Merged

dendibakh commented Apr 14, 2023

View reviewed changes

dendibakh closed this May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lally/marker api #17

Lally/marker api #17

dendibakh commented Feb 2, 2023

dendibakh left a comment

dendibakh Feb 2, 2023

dendibakh Feb 2, 2023

dendibakh Feb 2, 2023

dendibakh Feb 2, 2023

lally Mar 7, 2023

dendibakh Mar 10, 2023

dendibakh Feb 2, 2023

lally Mar 7, 2023

dendibakh Feb 2, 2023

lally Mar 7, 2023

dendibakh Feb 2, 2023

lally Mar 8, 2023

dendibakh Apr 12, 2023

dendibakh Apr 12, 2023

dendibakh Apr 12, 2023

dendibakh Apr 12, 2023

dendibakh Apr 14, 2023

lally commented Apr 14, 2023 via email

dendibakh commented Apr 14, 2023 •

edited

Loading

lally commented Apr 14, 2023 via email

dendibakh commented May 16, 2023

		adding the instrumentation. For example, if you have a loop:


		```c



		% Getting the data out of process
		- Really, a big-ass `fwrite(3)` with a big-healthy `setbuf(3)` call works pretty well.

		them (at least 3, but get more for better variety of values), you can estimate $x_1, x_2, x_3$ by the estimate $M`$`
		you get from the SVD pseudoinverse.

Lally/marker api #17

Lally/marker api #17

Conversation

dendibakh commented Feb 2, 2023

dendibakh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lally commented Apr 14, 2023 via email

dendibakh commented Apr 14, 2023 • edited Loading

lally commented Apr 14, 2023 via email

dendibakh commented May 16, 2023

dendibakh commented Apr 14, 2023 •

edited

Loading