Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lally/marker api #17

Closed
wants to merge 57 commits into from
Closed

Lally/marker api #17

wants to merge 57 commits into from

Conversation

dendibakh
Copy link
Owner

No description provided.

Copy link
Owner Author

@dendibakh dendibakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Lally, good start. I posted some comments, let me know...

system call. These atomic bundles are very useful.

You can compensate for the scheduler by requesting CPU cycles consumed (e.g.,
`UNHALTED_CORE_CYCLES`) and comparing against wall-clock time to detect when the
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not easy since the frequency may fluctuate during that time.

comparing CPU cycles (`UNHALTED_CORE_CYCLES`) vs the fixed-frequency reference clock
(`UNHALTED_REFERENCE_CYCLES`).

You can also correlate the CPU-utilization of your code (IPC:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU utilization and IPC are different things. But you probably meant "... correlate low IPC to ... LLC_MISSES".

and can be very slow.


### Using Marker APIs for Attribution
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading your PR, I think the better place for this section would be after we discuss sampling, i.e. 5.5, not here. Sorry about that, but you can leave it here, I will relocate it later.

adding the instrumentation. For example, if you have a loop:


```c
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to request a few things here:

  1. I, as a reader, would like to know how to set up marker API (showing events I want to collect), how to decorate the code region, and how to capture the data. It can be combined in a single example.
  2. Can we use some cross-platform marker API tool? Modern C++ developers are not very much into reading POSIX API calls. :) PAPI, libpfm4, likwid. It must be a project that is maintained and that supports the latest AMD Zen4 and Intel AlderLake chips. Let me know if you can find one.
  3. This example is too large for the book. We need to reduce it somehow.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's also planned. So how about a separate tutorial section on using libpfm4 (it's the only one I've used) and shrink the example. I can swap out the error handling for assert(). If I move the "Getting the data out of process" section above, perhaps between it and the the libpfm section, we can use the struct decl as a running example.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need for a separate section. Let's just show the idea. People will figure out the details.
I would just focus on how to set up and use libpfm4 APIs in a real benchmark.

Here is the list of benchmarks that you can use:
https://openbenchmarking.org/test/pts/c-ray
https://openbenchmarking.org/test/pts/simdjson
https://openbenchmarking.org/test/pts/primesieve
https://openbenchmarking.org/test/pts/stockfish
https://openbenchmarking.org/test/pts/cryptopp

You can click on "View Source", there you will find instructions on how to build and run the benchmark. Choose the one you like. :)
C-ray may be the simplest one. We just need something for demonstration purposes. We can stick with a single-threaded app, or run with only one HW thread,



% Getting the data out of process
- Really, a big-ass `fwrite(3)` with a big-healthy `setbuf(3)` call works pretty well.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big-ass

Ha-ha. :) please let's avoid those words.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah these were notes to myself :-)

- Really, a big-ass `fwrite(3)` with a big-healthy `setbuf(3)` call works pretty well.
- Mention IPC techniques or ring-buffer snapshotting to disk to minimize overhead/jitter. Use a seqno.

% Aggregate Statistics Solutions
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have an example of data visualization for a real-world workload.
For example:
"We used a popular benchmark A from test suite B and inserted a marker API around the top hotspot in that benchmark. Here is the summarized data that we collected using the method described earlier..."
I can help with the benchmark selection.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll take any suggestions on that. The easier to modify/build/run the better :)

them (at least 3, but get more for better variety of values), you can estimate $x_1, x_2, x_3$ by the estimate $M`$`
you get from the SVD pseudoinverse.


Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be nice to list possible tools that developers can use.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. A small example with numpy's pinv() should help.

double value = uintptr_t(argv) % 50000; // Avoid compiler precomputation
int event_fd;
perf_event_attr perf_attr{};
perf_attr.size = sizeof(perf_event_attr);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it on the C-ray benchmark and I had to add the following line to make it work:
memset(&perf_attr, 0, sizeof(perf_attr)
Otherwise, I was seeing some garbage numbers.

package is useful here, as it adds both a discovery tool for identifying
available events on your CPU's PMU, and a wrapper library around the raw `perf_event_open(2)`
system call. Here's an example that (poorly) benchmarks `sqrt(3)`:
```c
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is the example I was looking for.


The difference in event counts over a region of code is the number of events
that occurred during that region, in that thread. That difference divided by the
time taken is the event rate. You can ask for up to 3 counters simultaneously in
one file descriptor. They will available atomically under the same `read(2)`
system call. These atomic bundles are very useful.

Repeatedly running the example above (you will probably have to 0 to `/proc/sys/kernel/perf_event_paranoid` to
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add that you can run it as normal, and sudo is not required.

(e.g., `LLC_MISSES`).
You can also correlate low IPC (`INSTRUCTIONS_RETIRED/UNHALTED_CORE_CYCLES`) to
events you believe dominate it (e.g., `LLC_MISSES`).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lally, I think we can stop after this line and just briefly summarize your ideas on how to aggregate data. I think readers are capable of writing that aggregation logic themselves. :) Our goal here is just to show the idea of what's possible with marker APIs, and the example above (that showcases pfm_get_os_event_encoding + perf_event_open + read) is good enough.

@dendibakh dendibakh mentioned this pull request Apr 13, 2023
perf_attr.read_format = PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING | PERF_FORMAT_GROUP;

pfm_perf_encode_arg_t perf_setup{};
perf_setup.size = sizeof(pfm_perf_encode_arg_t);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the memset here is also needed :)

memset(&perf_setup, 0, sizeof(pfm_perf_encode_arg_t));

@lally
Copy link

lally commented Apr 14, 2023 via email

@dendibakh
Copy link
Owner Author

dendibakh commented Apr 14, 2023

Odd, the {} should zero-initialize that, and works on my machine. I wonder if it's a compiler version thing.

Ah, OK, I see what's going on now.
I compiled it with GCC 11.3.0, and it complained (because it's C-compilation, not C++) about {}:

c-ray-f-mod.c:247:37: error: expected ‘=’, ‘,’, ‘;’, ‘asmor ‘__attribute__’ before ‘{’ token
  247 |     pfm_perf_encode_arg_t perf_setup{};
      |                                     ^
c-ray-f-mod.c:249:5: error: ‘perf_setup’ undeclared (first use in this function)
  249 |     perf_setup.size = sizeof(pfm_perf_encode_arg_t);
      |     ^~~~~~~~~~

That's why I deleted it but didn't add memset. :)

@lally
Copy link

lally commented Apr 14, 2023 via email

@dendibakh
Copy link
Owner Author

Merged with #20.

@dendibakh dendibakh closed this May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants