-
-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lally/marker api #17
Lally/marker api #17
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Lally, good start. I posted some comments, let me know...
system call. These atomic bundles are very useful. | ||
|
||
You can compensate for the scheduler by requesting CPU cycles consumed (e.g., | ||
`UNHALTED_CORE_CYCLES`) and comparing against wall-clock time to detect when the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not easy since the frequency may fluctuate during that time.
comparing CPU cycles (`UNHALTED_CORE_CYCLES`) vs the fixed-frequency reference clock | ||
(`UNHALTED_REFERENCE_CYCLES`). | ||
|
||
You can also correlate the CPU-utilization of your code (IPC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU utilization and IPC are different things. But you probably meant "... correlate low IPC to ... LLC_MISSES".
and can be very slow. | ||
|
||
|
||
### Using Marker APIs for Attribution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading your PR, I think the better place for this section would be after we discuss sampling, i.e. 5.5, not here. Sorry about that, but you can leave it here, I will relocate it later.
adding the instrumentation. For example, if you have a loop: | ||
|
||
|
||
```c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to request a few things here:
- I, as a reader, would like to know how to set up marker API (showing events I want to collect), how to decorate the code region, and how to capture the data. It can be combined in a single example.
- Can we use some cross-platform marker API tool? Modern C++ developers are not very much into reading POSIX API calls. :) PAPI, libpfm4, likwid. It must be a project that is maintained and that supports the latest AMD Zen4 and Intel AlderLake chips. Let me know if you can find one.
- This example is too large for the book. We need to reduce it somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's also planned. So how about a separate tutorial section on using libpfm4 (it's the only one I've used) and shrink the example. I can swap out the error handling for assert()
. If I move the "Getting the data out of process" section above, perhaps between it and the the libpfm section, we can use the struct decl as a running example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need for a separate section. Let's just show the idea. People will figure out the details.
I would just focus on how to set up and use libpfm4 APIs in a real benchmark.
Here is the list of benchmarks that you can use:
https://openbenchmarking.org/test/pts/c-ray
https://openbenchmarking.org/test/pts/simdjson
https://openbenchmarking.org/test/pts/primesieve
https://openbenchmarking.org/test/pts/stockfish
https://openbenchmarking.org/test/pts/cryptopp
You can click on "View Source", there you will find instructions on how to build and run the benchmark. Choose the one you like. :)
C-ray may be the simplest one. We just need something for demonstration purposes. We can stick with a single-threaded app, or run with only one HW thread,
|
||
|
||
% Getting the data out of process | ||
- Really, a big-ass `fwrite(3)` with a big-healthy `setbuf(3)` call works pretty well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
big-ass
Ha-ha. :) please let's avoid those words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah these were notes to myself :-)
- Really, a big-ass `fwrite(3)` with a big-healthy `setbuf(3)` call works pretty well. | ||
- Mention IPC techniques or ring-buffer snapshotting to disk to minimize overhead/jitter. Use a seqno. | ||
|
||
% Aggregate Statistics Solutions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have an example of data visualization for a real-world workload.
For example:
"We used a popular benchmark A from test suite B and inserted a marker API around the top hotspot in that benchmark. Here is the summarized data that we collected using the method described earlier..."
I can help with the benchmark selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll take any suggestions on that. The easier to modify/build/run the better :)
them (at least 3, but get more for better variety of values), you can estimate $x_1, x_2, x_3$ by the estimate $M`$` | ||
you get from the SVD pseudoinverse. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be nice to list possible tools that developers can use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. A small example with numpy
's pinv()
should help.
Spotted by @magras
written by Swarup Sahoo
double value = uintptr_t(argv) % 50000; // Avoid compiler precomputation | ||
int event_fd; | ||
perf_event_attr perf_attr{}; | ||
perf_attr.size = sizeof(perf_event_attr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it on the C-ray benchmark and I had to add the following line to make it work:
memset(&perf_attr, 0, sizeof(perf_attr)
Otherwise, I was seeing some garbage numbers.
package is useful here, as it adds both a discovery tool for identifying | ||
available events on your CPU's PMU, and a wrapper library around the raw `perf_event_open(2)` | ||
system call. Here's an example that (poorly) benchmarks `sqrt(3)`: | ||
```c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is the example I was looking for.
|
||
The difference in event counts over a region of code is the number of events | ||
that occurred during that region, in that thread. That difference divided by the | ||
time taken is the event rate. You can ask for up to 3 counters simultaneously in | ||
one file descriptor. They will available atomically under the same `read(2)` | ||
system call. These atomic bundles are very useful. | ||
|
||
Repeatedly running the example above (you will probably have to 0 to `/proc/sys/kernel/perf_event_paranoid` to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please add that you can run it as normal, and sudo
is not required.
(e.g., `LLC_MISSES`). | ||
You can also correlate low IPC (`INSTRUCTIONS_RETIRED/UNHALTED_CORE_CYCLES`) to | ||
events you believe dominate it (e.g., `LLC_MISSES`). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lally, I think we can stop after this line and just briefly summarize your ideas on how to aggregate data. I think readers are capable of writing that aggregation logic themselves. :) Our goal here is just to show the idea of what's possible with marker APIs, and the example above (that showcases pfm_get_os_event_encoding
+ perf_event_open
+ read
) is good enough.
perf_attr.read_format = PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING | PERF_FORMAT_GROUP; | ||
|
||
pfm_perf_encode_arg_t perf_setup{}; | ||
perf_setup.size = sizeof(pfm_perf_encode_arg_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the memset
here is also needed :)
memset(&perf_setup, 0, sizeof(pfm_perf_encode_arg_t));
Odd, the {} should zero-initialize that, and works on my machine. I wonder
if it's a compiler version thing.
…On Fri, Apr 14, 2023, 6:56 AM Denis Bakhvalov ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In chapters/5-Power-And-Performance/5-1 Code Instrumentation.md
<#17 (comment)>:
> +
+// The flags to perf_event_open() determine which fields are returned
+// from the resulting FD.
+struct read_format { uint64_t nr, time_enabled, time_running, values[2]; };
+void require_pfm(int value) { assert(value == PFM_SUCCESS); }
+void require_nonneg(int value) { assert(value >= 0); }
+
+int main(int argc, char **argv) {
+ double value = uintptr_t(argv) % 50000; // Avoid compiler precomputation
+ int event_fd;
+ perf_event_attr perf_attr{};
+ perf_attr.size = sizeof(perf_event_attr);
+ perf_attr.read_format = PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING | PERF_FORMAT_GROUP;
+
+ pfm_perf_encode_arg_t perf_setup{};
+ perf_setup.size = sizeof(pfm_perf_encode_arg_t);
BTW, the memset here is also needed :)
memset(&perf_setup, 0, sizeof(pfm_perf_encode_arg_t));
—
Reply to this email directly, view it on GitHub
<#17 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAK4S3R6KECDKVVQFOQYYTXBEUPHANCNFSM6AAAAAAUPFPOB4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ah, OK, I see what's going on now. c-ray-f-mod.c:247:37: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘{’ token
247 | pfm_perf_encode_arg_t perf_setup{};
| ^
c-ray-f-mod.c:249:5: error: ‘perf_setup’ undeclared (first use in this function)
249 | perf_setup.size = sizeof(pfm_perf_encode_arg_t);
| ^~~~~~~~~~ That's why I deleted it but didn't add |
Ah. It's a c++-ism.
…On Fri, Apr 14, 2023, 9:50 AM Denis Bakhvalov ***@***.***> wrote:
Odd, the {} should zero-initialize that, and works on my machine. I wonder
if it's a compiler version thing.
Ah, OK, I see what's going on now.
I compiled it with GCC 11.3.0, and it complained about {}:
c-ray-f-mod.c:247:37: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘{’ token
247 | pfm_perf_encode_arg_t perf_setup{};
| ^
c-ray-f-mod.c:249:5: error: ‘perf_setup’ undeclared (first use in this function)
249 | perf_setup.size = sizeof(pfm_perf_encode_arg_t);
| ^~~~~~~~~~
That's why I deleted it but didn't add memset. :)
—
Reply to this email directly, view it on GitHub
<#17 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAK4S7VUSSHWIA6NW4P5EDXBFI3XANCNFSM6AAAAAAUPFPOB4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Merged with #20. |
No description provided.