-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize LabelValues in metric proto #36
Optimize LabelValues in metric proto #36
Conversation
I think it's important to have an explicit unspecified value. In our metrics summit in late august we spent a lot of time discussing unspecified label values as they relate to metric handles. This was before LabelSets were introduced, but I see these as equivalent. The question boiled down to this -- should the SDK be required to inspect the labels from the distributed context any time a LabelSet/Handle is used that did not specify a value. We declared that no, the value is explicitly unspecified if the LabelSet/Handle did not declare it. I added caveats that (a) users can explicitly inject distributed context labels if they want, (b) SDKs could decide to do something automatic about distributed context labels. |
From consumption standpoint - what's the difference of non-specified and empty value? Would one want to create a stacked chart where unspecified label and empty label are two different values? If empty is that important for this scenario - perhaps one can report it with the actual word "EMPTY"? If there are some notes I can read thru to understand it - please let me know |
If we specified that empty strings were invalid values, then we could get away with using the empty string to encode an unspecified value. I have mixed feelings here, because it's so much easier to use an empty string than to distinguish unspecified values. But then some users, who have legitimate empty string values will be forced to use "EMPTY", which is also a valid value, and then they're left with ambiguity. |
I feel empty/unspecified string is a discussion that likely needs to happen in the realm of Specification. If there is a decision not to support unspecified or empty strings we can then modify the proto accordingly. For now the proto and this PR mirrors what the Spec requires. |
beea350
to
c27d863
Compare
@SergeyKanzhelev @jmacd can we move forward with this change? It does not introduce new concepts, but merely follows what spec requires today. I created a separate issue to clarify the need for "unspecified" lablels here: open-telemetry/opentelemetry-specification#345 If the conclusion eventually is that we don't need the unspecified labels I will happily modify this proto to remove them in a future PR. For now I believe this PR is good as is. |
In hindsight, I'm not sure we should have written that into the specification. It was related to the concept of required aggregation keys, which was softened in the final spec to recommended aggregation keys. If a system is expecting precisely a fixed number of dimensions and those dimensions are not included in the export format, they are effectively unspecified. The reason we discussed explicitly unspecified values was to declare that the implementation would not be required to scan through the context for distributed correlation labels to satisfy the missing values. @bogdandrutu I'd be happy to strike this from the specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this change, but declaring empty strings and unspecified values to be the same would be a better optimization and easier to understand
@SergeyKanzhelev I agree and if spec decides to do that I will make the corresponding change here. |
@SergeyKanzhelev I didn't say empty string--that I would reject. I said leaving out a key:value entirely (i.e., no key, no value) is effectively saying unspecified. |
Agree. That's exactly possible to do now, by omitting the key from |
I could imagine a solution where instead of having label values be a repeated strings, it would be repeated integers referring to a dictionary of string values. This would support compression -- where the same string is used more than once, and it would allow us to have an explicit value like -1 to indicate unspecified. |
How common is it to have labels with repeated values? Also compression-wise if the concern is about wire size the proto most likely is going to be gziped and that compresses repetitive strings quite efficiently. Performance-wise this elimination of redundant labels could probably be a win, but for performance I am looking into an approach that looks promising, so I'd leave that aside for a moment. Compressing the way you suggested is possible for other entities as well but so far I avoided going in that direction because i think it makes working with data slightly more complicated since you need to use an additional indirection (the way you suggested). It may be worth exploring, I can do some experimentation later if time permits. |
I am starting to agree that we should allow empty strings. Let's do that and see what trouble ensues. Should we specify that the empty string is not a valid attribute/label value? |
@jmacd giving I'm not very deep into metrics - I like the idea to try equating empty strings with unspecified. If there will be legitimate use cases for an opposite - we can always add an indexes array with the indexes saying which empty strings are empty and which are unspecified. Similar to what's proposed in this PR. |
This comment has been minimized.
This comment has been minimized.
Hard to follow this discussion. Can we summarize what is the current status? I personally prefer to first decide if we need to support unspecified or we can do what OpenMetrics does and treat unspecified as empty. |
@bogdandrutu to summarize: I believe we can move forward and merge this PR since it is a straight improvement over current state and does not change any semantics, it is simply a performance optimization without functional changes. We can separately discuss if we want to eliminate "unspecified label" concept as part of open-telemetry/opentelemetry-specification#345 that I created for that discussion. Depending on the decision resulting from that discussion I will make any necessary changes in this proto repository. |
@tigrannajaryan I think there is a difference - unspecified values on API and unspecified values on schema (it is my take). It seems like there is little objection to have "empty mean unspecified" on schema. So this PR can be improved further |
c27d863
to
c5e7aef
Compare
@SergeyKanzhelev @jmacd I removed the ability to have unspecified value from this PR. Please have another look. |
c5e7aef
to
2463a6e
Compare
2463a6e
to
81003a1
Compare
LabelValues is now a string array. Benchmarking results are below (Baseline is current `master`, Proposed is after this commit). This change reduces CPU usage up to 40% for one-data-point timeseries encoding and reduces memory consumption by about 25%. ``` ===== Encoded sizes Encoding Uncompressed Improved Compressed Improved Baseline/Metric/Histogram 13569 bytes [1.000], gziped 774 bytes [1.000] Proposed/Metric/Histogram 13159 bytes [1.031], gziped 781 bytes [0.991] Encoding Uncompressed Improved Compressed Improved Baseline/Metric/MixOne 48530 bytes [1.000], gziped 1671 bytes [1.000] Proposed/Metric/MixOne 45720 bytes [1.061], gziped 1677 bytes [0.996] Encoding Uncompressed Improved Compressed Improved Baseline/Metric/MixSeries 97867 bytes [1.000], gziped 6620 bytes [1.000] Proposed/Metric/MixSeries 95067 bytes [1.029], gziped 6587 bytes [1.005] goos: darwin goarch: amd64 pkg: github.com/tigrannajaryan/exp-otelproto/encodings BenchmarkEncode/Baseline/Metric/Int64-8 34 156113692 ns/op BenchmarkEncode/Proposed/Metric/Int64-8 67 91936291 ns/op BenchmarkEncode/Baseline/Metric/Summary-8 120 50891750 ns/op BenchmarkEncode/Proposed/Metric/Summary-8 141 42172658 ns/op BenchmarkEncode/Baseline/Metric/Histogram-8 91 64816532 ns/op BenchmarkEncode/Proposed/Metric/Histogram-8 100 56837230 ns/op BenchmarkEncode/Baseline/Metric/HistogramSeries-8 37 159643499 ns/op BenchmarkEncode/Proposed/Metric/HistogramSeries-8 40 149983727 ns/op BenchmarkEncode/Baseline/Metric/Mix-8 20 287582195 ns/op BenchmarkEncode/Proposed/Metric/Mix-8 31 196378683 ns/op BenchmarkEncode/Baseline/Metric/MixSeries-8 8 681122815 ns/op BenchmarkEncode/Proposed/Metric/MixSeries-8 10 540018386 ns/op BenchmarkDecode/Baseline/Metric/Int64-8 18 319038450 ns/op 206696040 B/op 5724000 allocs/op BenchmarkDecode/Proposed/Metric/Int64-8 22 265034276 ns/op 154696039 B/op 4724000 allocs/op BenchmarkDecode/Baseline/Metric/Summary-8 52 117972148 ns/op 79496034 B/op 2024000 allocs/op BenchmarkDecode/Proposed/Metric/Summary-8 60 104489249 ns/op 69096035 B/op 1824000 allocs/op BenchmarkDecode/Baseline/Metric/Histogram-8 36 161440606 ns/op 104296028 B/op 2624000 allocs/op BenchmarkDecode/Proposed/Metric/Histogram-8 40 147001580 ns/op 93896032 B/op 2424000 allocs/op BenchmarkDecode/Baseline/Metric/HistogramSeries-8 16 345726946 ns/op 233896053 B/op 5324000 allocs/op BenchmarkDecode/Proposed/Metric/HistogramSeries-8 16 333473715 ns/op 223496045 B/op 5124000 allocs/op BenchmarkDecode/Baseline/Metric/Mix-8 9 611361046 ns/op 391240035 B/op 10326000 allocs/op BenchmarkDecode/Proposed/Metric/Mix-8 10 531026674 ns/op 318440033 B/op 8926000 allocs/op BenchmarkDecode/Baseline/Metric/MixSeries-8 5 1179081837 ns/op 776840057 B/op 18026000 allocs/op BenchmarkDecode/Proposed/Metric/MixSeries-8 5 1077597483 ns/op 704040035 B/op 16626000 allocs/op ```
165c523
to
c91401f
Compare
@jmacd I'm merging this just to move forward. If you feel that we need to revisit this decision - please file an issue. |
LGTM |
Optimize LabelValues in metric proto
LabelValues is now a string array.
Benchmarking results are below (Baseline is current
master
, Proposed is after this commit).This change reduces CPU usage up to 40% for one-data-point timeseries encoding and reduces
memory consumption by about 25%.