-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hist: log2 histograms with finer granularity #2831
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look too close as the mathy bits yet. Mostly tried to leave some high level comments. This PR also needs tests (see https://github.com/iovisor/bpftrace/blob/master/docs/developers.md#tests for some hints)
Overall, this feature makes sense to me. I'm taking it on faith that the
CI failures look legit. There should be some hints about debugging those in the developers.md doc I linked as well |
4d851af
to
6d193bb
Compare
I have fixed the json format ones and function prototype formatting. |
two questions here: (but these are orthogonal, so feel free to ignore them...)
|
6d193bb
to
299a49a
Compare
Added a few tests in (semantic_analyser, call_hist) and one in runtime/json-output (histogram-finegrain), plus updated codegen for call_hist |
They should be fixed now. |
299a49a
to
561e454
Compare
if (!check_varargs(call, 1, 2)) | ||
return; | ||
if (call.vargs->size() == 1) { | ||
call.vargs->push_back(new Integer(0, call.loc)); // default bits is 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be modifying the AST in semantic analysis. I see one other occurrence in semantic analysis where we inject a cast, but I think that's bad too. Ideally mutable passes should derive from Mutator
visitor so it's explicit.
In this case, I think it'd be better to teach the other passes to use a default value of 0 where appropriate
Yeah that would be a good addition. I filed #2841 . Feel free to take it if you're interested. Otherwise I will probably pick it up at some point
The json output happened to be easier to test. In the past we've done regexes to match the text output but that's messy and we often got it wrong. Might be useful in this case to check in some "gold" files and did a regular text compare. Might be useful to have an Manual testing would be ok for now. I'll file a ticket for |
03791d7
to
3d94abe
Compare
anything else left to do here? |
Code needs a clang-format (https://github.com/iovisor/bpftrace/blob/master/docs/developers.md#code-style). Also would be good to use the new EXPECT_JSON for the runtime test. Other than that, was waiting for enough coffee to hit the bloodstream to take a look at the algorithm again. |
Thanks for the PR! Agree with @danobi that this feature looks reasonable. I just want to make sure before merging that it won't slow down histogram calculations for the default case (k=0). The LLVM IR in the tests looks unoptimised so we'd probably need to look at the generated BPF bytecode to be sure. I can have a check next week if no one gets there before me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than the outstanding comments (clang-format, semantic analysis ast modification, stack allocation lifetime tracking, EXPECT_JSON, and Alastair's todo), LGTM.
I'm kinda on the fence about the semantic analysis ast modification thing now. Probably fine the way it is.
Value *n_alloc = b_.CreateAllocaBPF(CreateUInt64()); | ||
b_.CreateStore(arg, n_alloc); | ||
Value *result = b_.CreateAllocaBPF(CreateUInt64()); | ||
b_.CreateStore(b_.getInt64(0), result); | ||
b_.CreateStore(log2_func->arg_begin(), n_alloc); | ||
Value *k_alloc = b_.CreateAllocaBPF(CreateUInt64()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing b_.CreateLifetimeEnd
for n_alloc
and k_alloc
I don't know if you have any better way to measure index computation (other than modifying the code generator to collect timestamps around it) but I suspect most of the time is spent in looking up the bpf hash entry, rather than computing the index. For measuring the time I use the following:
and on my Beelink SER5 i see most entries in the 900,1000 slot with both the old and new code. For reference, the same (native) code in On the topic:
|
Allow a second optional argument in hist(n, k) to map each power of 2 into 2^k buckets, thus creating a logarithmic scale with finer granularity and modest runtime overhead (a couple of shifts and add/mask in addition to the original algorithm). Allowed values of k are 0..5, with 0 as default for backward compatibility. The implementation follows my earlier code in https://github.com/luigirizzo/lr-cstats Example below: $ sudo src/bpftrace -e 'kfunc:tick_do_update_jiffies64 { @ = hist((nsecs & 0xff),2); }' Attaching 2 probes... @: [0] 4 |@ | [1] 1 | | [2] 3 |@ | [3] 2 | | [4] 3 |@ | [5] 0 | | [6] 3 |@ | [7] 2 | | [8, 10) 5 |@ | [10, 12) 7 |@@ | [12, 14) 5 |@ | [14, 16) 6 |@@ | [16, 20) 11 |@@@ | [20, 24) 14 |@@@@ | [24, 28) 20 |@@@@@@ | [28, 32) 13 |@@@@ | [32, 40) 40 |@@@@@@@@@@@@@ | [40, 48) 38 |@@@@@@@@@@@@@ | [48, 56) 35 |@@@@@@@@@@@ | [56, 64) 29 |@@@@@@@@@ | [64, 80) 72 |@@@@@@@@@@@@@@@@@@@@@@@@ | [80, 96) 64 |@@@@@@@@@@@@@@@@@@@@@ | [96, 112) 61 |@@@@@@@@@@@@@@@@@@@@ | [112, 128) 67 |@@@@@@@@@@@@@@@@@@@@@@ | [128, 160) 124 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [160, 192) 130 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [192, 224) 124 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [224, 256) 152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| Performance: most of the cost of hist() in bpftrace is in the bpf hash lookup, and the cost of index calculation is negligible. One way to measure the overall cost is the following sudo taskset -c 1 src/bpftrace -e 'i:us:1 { $t = nsecs; @A = hist($t); $t = nsecs - $t; @ = lhist($t, 0, 5000, 100);} ' and on my AMD 5800 most of the samples are in the 900-100us range; my estimate for index computation (from lr-cstats) is in the 10-20ns range. Remember to check `/proc/sys/kernel/perf_` to make sure that the code can generate a sufficient number of samples per second.
3d94abe
to
80c28f4
Compare
I think I am done with the modifications from my side (see comment on why I believe lifetime annotations are useless in this case). |
I don't see the comment anywhere - forgot to send? |
This one on line 3109 for codegen_llvm.cpp ?I have looked up the lifetime annotations https://llvm.org/docs/LangRef.html#object-lifetime and I believe that for this function it is much better to leave it to the optimizer. As an experiment I also tried to access the two call arguments directly from the registers instead of allocating arguments on the stack: Value *n = log2_func->arg_begin(); |
Hmm, I see But in any case, I think you're right. log2() is a separate function. We can leave it to compiler in this case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting a build warning:
[21/86] Building CXX object src/CMakeFiles/runtime.dir/output.cpp.o
In file included from /usr/include/c++/13.2.1/cassert:44,
from /home/dxu/dev/bpftrace/src/log.h:3,
from /home/dxu/dev/bpftrace/src/output.cpp:3:
/home/dxu/dev/bpftrace/src/output.cpp: In static member function ‘static std::string bpftrace::TextOutput::hist_index_label(int, int)’:
/home/dxu/dev/bpftrace/src/output.cpp:53:16: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const uint32_t’ {aka ‘const unsigned int’} [-Wsign-compare]
53 | assert(index >= n); // Smaller indexes are converted directly.
No big deal - I'll fix it up if we want to merge as is
@ajor did you get a chance to take a look? I suspect this will add a few more insns to the hot path. But ALU ops are quite cheap (basically free since it's all in registers) so I'm personally not too worried |
FWIW (see other comments and updated description) I believe the hash lookup is probably 20-50 times more expensive than the index computation. This test
reports 900-1000ns for most samples with both old and new code. In fact, it would be great if we could specialize the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. LLVM does optimise the code well and the only extra instructions in the default case are some extra bit twiddling for the higher bits:
hist.is_not_zero.i: ; preds = %hist.is_not_less_than_zero.i
+ %a = icmp ugt i64 %input, 4294967295
+ %b = select i1 %a, i64 32, i64 0
+ %c = lshr i64 %input, %b
%d = icmp ugt i64 %c, 65535
%e = select i1 %d, i64 16, i64 0
%f = lshr i64 %c, %e
%g = icmp ugt i64 %f, 255
%h = select i1 %g, i64 8, i64 0
%i = lshr i64 %f, %h
%j = icmp ugt i64 %i, 15
%k = select i1 %j, i64 4, i64 0
%l = lshr i64 %i, %k
%m = icmp ugt i64 %l, 3
%n = select i1 %m, i64 2, i64 0
%o = lshr i64 %l, %n
%p = icmp ugt i64 %o, 1
%q = zext i1 %p to i64
+ %r = or i64 %b, %e
%s = or i64 %r, %h
%t = or i64 %s, %k
%u = or i64 %t, 2
%v = add nuw nsw i64 %u, %n
%w = or i64 %v, %q
br label %log2.exit
So no major performance penalty. Definitely agree that we should switch histograms over to array maps in the future!
What's the reason that this algorithm deals with more bits now though?
int old, bits = static_cast<Integer *>(call.vargs->at(1))->n; | ||
if (r.find(call.map->ident) != r.end() && | ||
(old = r[call.map->ident]) != bits) | ||
{ | ||
LOG(ERROR, call.loc, err_) | ||
<< "Different bits in a single hist, had " << old << " now " << bits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int old, bits = static_cast<Integer *>(call.vargs->at(1))->n; | |
if (r.find(call.map->ident) != r.end() && | |
(old = r[call.map->ident]) != bits) | |
{ | |
LOG(ERROR, call.loc, err_) | |
<< "Different bits in a single hist, had " << old << " now " << bits; | |
int bits = static_cast<Integer *>(call.vargs->at(1))->n; | |
if (r.find(call.map->ident) != r.end() && | |
(r[call.map->ident]) != bits) | |
{ | |
LOG(ERROR, call.loc, err_) | |
<< "Different bits in a single hist, had " << r[call.map->ident] << " now " << bits; |
The multiple declarations and assignment/comparision in single lines feel too clever for me. Since this is just an error path there's no need to optimise it anyway, so let's just read the map twice to simplify the code.
Do you mean what is the motivation ? Power of 2 is too coarse to evaluate e.g. changes that cause less than 2x performance differences. With the finer granularity we become able to see down to 3-5% without having to compromise on the dynamic range (e.g. we can differentiate 1000ns and 1100ns and still identify tails in the ms range) |
I mentioned in commit msg in 006e68d that we were underutilizing the u64 by a lot (~54 bits unused IIUC). So yes, this uses more bits. Not sure if that addresses the question. |
I'd prefer if we merge this soon-ish rather than let it drag out and potentially eat merge conflicts. I can follow up with the request readability fix. I'll push the button sometime this weekend if there are no more comments. |
Fixes up some follow-on comments from [0]. Also fixes a compile time warning: [25/85] Building CXX object src/CMakeFiles/runtime.dir/output.cpp.o In file included from /usr/include/c++/13.2.1/cassert:44, from /home/dxu/dev/bpftrace/src/log.h:3, from /home/dxu/dev/bpftrace/src/output.cpp:3: /home/dxu/dev/bpftrace/src/output.cpp: In static member function ‘static std::string bpftrace::TextOutput::hist_index_label(int, int)’: /home/dxu/dev/bpftrace/src/output.cpp:53:16: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const uint32_t’ {aka ‘const unsigned int’} [-Wsign-compare] 53 | assert(index >= n); // Smaller indexes are converted directly. | ~~~~~~^~~~ [0]: bpftrace#2831 (comment)
Fixes up some follow-on comments from [0]. Also fixes a compile time warning: ``` [25/85] Building CXX object src/CMakeFiles/runtime.dir/output.cpp.o In file included from /usr/include/c++/13.2.1/cassert:44, from /home/dxu/dev/bpftrace/src/log.h:3, from /home/dxu/dev/bpftrace/src/output.cpp:3: /home/dxu/dev/bpftrace/src/output.cpp: In static member function ‘static std::string bpftrace::TextOutput::hist_index_label(int, int)’: /home/dxu/dev/bpftrace/src/output.cpp:53:16: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const uint32_t’ {aka ‘const unsigned int’} [-Wsign-compare] 53 | assert(index >= n); // Smaller indexes are converted directly. | ~~~~~~^~~~ ``` [0]: bpftrace#2831 (comment)
Fixes up some follow-on comments from [0]. Also fixes a compile time warning: ``` [25/85] Building CXX object src/CMakeFiles/runtime.dir/output.cpp.o In file included from /usr/include/c++/13.2.1/cassert:44, from /home/dxu/dev/bpftrace/src/log.h:3, from /home/dxu/dev/bpftrace/src/output.cpp:3: /home/dxu/dev/bpftrace/src/output.cpp: In static member function ‘static std::string bpftrace::TextOutput::hist_index_label(int, int)’: /home/dxu/dev/bpftrace/src/output.cpp:53:16: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const uint32_t’ {aka ‘const unsigned int’} [-Wsign-compare] 53 | assert(index >= n); // Smaller indexes are converted directly. | ~~~~~~^~~~ ``` [0]: bpftrace#2831 (comment)
Fixes up some follow-on comments from [0]. Also fixes a compile time warning: ``` [25/85] Building CXX object src/CMakeFiles/runtime.dir/output.cpp.o In file included from /usr/include/c++/13.2.1/cassert:44, from /home/dxu/dev/bpftrace/src/log.h:3, from /home/dxu/dev/bpftrace/src/output.cpp:3: /home/dxu/dev/bpftrace/src/output.cpp: In static member function ‘static std::string bpftrace::TextOutput::hist_index_label(int, int)’: /home/dxu/dev/bpftrace/src/output.cpp:53:16: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const uint32_t’ {aka ‘const unsigned int’} [-Wsign-compare] 53 | assert(index >= n); // Smaller indexes are converted directly. | ~~~~~~^~~~ ``` [0]: #2831 (comment)
Hi - I just found out about this feature. I've used partial-power log tables for many years to get fine grained histograms, and I see something here that seems relevant. I thought about filing a bug but this seems like more of a design decision. The buckets here don't appear to be log-sized. Rather, they're fixed width within each 2^x - 2^(x+1) set. This gives "sizes" that change abruptly at the border: four buckets that are 32-wide, then four that are 64-wide, and so on (32, 32, 32, 32, 64). Basically, what's the rationale behind the decision to bucket them like this and not like as described in #1120? If a fast bitwise partial-power log is needed, I have one that's polished I'm happy to offer - we use it here inside of bpftrace probes. |
Allow a second optional argument in hist(n, k) to map each power of 2 into 2^k buckets, thus creating a logarithmic scale with finer granularity and modest runtime overhead (a couple of shifts and add/mask in addition to the original algorithm).
Allowed values of k are 0..5, with 0 as default for backward compatibility.
The implementation follows my earlier code in https://github.com/luigirizzo/lr-cstats
Example below:
Performance:
most of the cost of hist() in bpftrace is in the bpf hash lookup,
and the cost of index calculation is negligible.
One way to measure the overall cost is the following
and on my AMD 5800 most of the samples are in the 900-100us range;
my estimate for index computation (from lr-cstats) is in the 10-20ns range.
Remember to check
/proc/sys/kernel/perf_
to make sure that the codecan generate a sufficient number of samples per second.
Checklist
man/adoc/bpftrace.adoc
and if needed indocs/reference_guide.md
CHANGELOG.md