Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add adaptable circular buffer implementation for ExponentialCounter. #4087

Conversation

jsuereth
Copy link
Contributor

  • Add "Adapting Counter Array" which attempts to use the smallest possible amount of memory for storing counts.
  • Add "Adapting Circular Buffer" which attempts to store N count buckets in a contiguous (circular) array, leveraging the adapting counter array.
  • Expose configurable "Factory"/"Strategy" interfaces that allow configuring the following for ExponentialHistograms:
    • Starting scale
    • Backing Counter (MapCounter or AdaptingCircularBuffer)
    • Maximum # of buckets
  • Add performance benchmarks.

A few results to call out:

  • MapCounter is pretty bad in memory usage, consuming a large amount and requiring GC during perf benchmarks.
  • AdaptingCircularBufferCounter is less memory efficient than ExplicitBucketHistogram, but way more memory efficient than MapCounter.
  • AdaptingCircularBufferCounter is more performant (wall-clock time) than MapCounter. (~95 ms [1ms stdev] vs 116ms [3ms stdev]

Given ExponentialHistogram is not released publicly, the idea here is to continue experimenting with implementation and tuning parameters until we find the best available for our instrumentation package. I'd expect us to prune down implementation a bit prior to an official ExponentialHistogram release.

Copy link
Contributor

@jamesmoessis jamesmoessis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, the addition of the factories make switching between and testing different implementations cleaner. The tests give me confidence too, since all implementations are parameterised and tested side by side.

Just a few comments and questions. Also, the linter is complaining about javadoc and style by the looks of it.

/** Constructs fresh new buckets. */
DoubleExponentialHistogramBuckets newBuckets();

/** Constructs "empty" count buckets using settings from previous recorded bucket. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to think if it's a safe assumption to use the scale of the previous accumulations after resetting. Using cumulative aggregation temporality, certainly it's safe. merge() will downscale it anyway, so it would save some CPU time.

However for delta temporality, could there be a use case where someone might want the highest precision possible for each export?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, I noticed diff / merge / zero are implemented as immutable operations but could they be mutable? Intuitively, always mutating a long array seems like the allocations would be fine (preallocated just once on initialization) and avoid the dynamic dispatch of the byte-short-int-long stuff. The idea does seem to at least require the scale to be preserved on reset though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've exposed it as a thing we can "toy with" and generate a strategy that does NOT do this to compare against other strategies.

However, there are two common use cases we need to optimise for:

  1. High throughput traffic in a distribution
  2. Low throughput traffic in a distribution

Given our instrumentation, we're likely to see both of these. For 2 we want smallest amount of memory consumption, for 1 we want high performance. The idea behind saving Scale (and eventually integer array size) is so that we can preserve as much performance as possible for #1 by shifting distributions from use cases #2 to #1 as we "detect" high throughput.

Copy link
Contributor

@jamesmoessis jamesmoessis Jan 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could they be mutable?

@anuraaga I asked this initially when writing those operations, and it seemed that accumulations need to be immutable the way the code is at the current moment. I created an issue to allow mutable accumulations: #3766

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short answer on mutable accumulation is "no". They're shared between exporters now, so mutating would (initially) lead to bugs. I'm hoping to do some more optimisation work toward that end,specifically making sure none of it lands on the hot-path. IIRC - all merge/diff operations are already off the 'hot' path and only performed during collection/export.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're able to eventually make them mutable, I think we'd be able to make the histogram aggregations no-alloc which should allow removing AdaptingIntegerArray as the memory overhead would be fixed, or at least trim it down to being bimorphic which Java deals pretty well with. Will be a good area to explore as that class really just feels like a "root of all evil"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure how this particular optimisation is any more evil than hand-rolling Protobuf serialization, from both a maintenance and "scare" sense. Can you elaborate on primary concerns, especially given the benchmark showing performance? I'm happy to add testing around this, but I definitely don't see it as anywhere near as scary as hand-rolled serialization.

Related, I'm really not sure what you mean about mutable + optimising. Could you lay out that scenario/idea? I'd be happy to try it out.

I'm afraid, though, that in practice the we need to optimise for the two scenarios I listed above. "Eventually" using consistent amount of memory may not be good enough if the overhead per-timeseries is still too high. However, I'm not sure I understand what kind of mutation you're proposing, and I'm not sure if I've accurately described why using the adapting array plays to the current use case we have for histograms so well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protobuf optimizations are pretty deterministic - we can reason out the difference in the old and new code to know where performance will come from. The adapting arrays are based on heuristics - in particular, we would expect direct array access to always perform quite a bit better than an access through a megamorphic switch statement, so there is a clear tradeoff involved. This makes it much scarier and microbenchmarks will never be able to give the full picture when relying on heuristics and assumptions. Gradually going up in complexity tends to at least be a bit less scary, for example MapCounter -> long[] circular buffer -> then AdaptingIntegerArray cirulcar buffer.

Related, I'm really not sure what you mean about mutable + optimizing. Could you lay out that scenario/idea? I'd be happy to try it out.

As far as I could tell, if all the operations on the bucket counts were mutating an array, including zeroing at the end of collection, there wouldn't be a need to reallocate the bucket count storage during operations / collection. If so, it would seem to incentivize only having one type, long[] or at most two types since Java could still optimize bimorphic code paths well. If it's not possible to remove the allocations, though, then it is what it is and raises the importance of the adapting integers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradually going up in complexity tends to at least be a bit less scary, for example MapCounter -> long[] circular buffer -> then AdaptingIntegerArray cirulcar buffer.

I shared (a few weeks back, when I first sent the perf benchmarks) a branch I had where I had all three of these.

The long[] counter is the clear winner for raw CPU performance. There is a minor performance degredation going to the Adapting array (as you call out, due to the switch). However, as shown in this PR, the Adapting use case saves a good bit of RAM in addition to remaining faster than MapCounter. Given the differential, and the sheer # of timeseries I see coming out of instrumentation histograms today, I think this RAM optimisation (or something like it) is worth it. Perhaps there's a different mechanism we can use to change the array implementation that would be more ideal in leading to (eventual) monomorphic optimisation?

As far as I could tell, if all the operations on the bucket counts were mutating an array, including zeroing at the end of collection, there wouldn't be a need to reallocate the bucket count storage during operations / collection. If so, it would seem to incentivize only having one type, long[] or at most two types since Java could still optimize bimorphic code paths well. If it's not possible to remove the allocations, though, then it is what it is and raises the importance of the adapting integers

We could actually do this today. These counters are mutable, we only need them immutable in the collection route. Instead of "new"-ing up the counters, we can just create copies to expose externally and then call .zero. Happy to do that in this PR or another.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this RAM optimisation (or something like it) is worth it.

300 buckets * 8 is 2.4K per histogram, 2000 histograms in 5MB seems like a lot of histograms for what in Java is not a lot of RAM.

5MB allocations per cycle on the other hand becomes more cause for concern which is where the point of mutability comes in. If reallocations an be avoided, then I really feel the case for an adapting array becomes weaker.

Anyways I already approved since don't care about this code that much. It's just I've generally seen situations that jump at the most complex solution to not scale back to a simpler one even if that happened to be more appropriate, going in steps tends to work better to not get too deep into a complexity hole.


/**
* These are extra test cases for buckets. Much of this class is already tested via more complex
* test cases at {@link DoubleExponentialHistogramAggregatorTest}.
*/
public class DoubleExponentialHistogramBucketsTest {

@Test
void testRecordSimple() {
static Stream<ExponentialBucketStrategy> bucketStrategies() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about @MethodSource, nice.

ExponentialCounterFactory.mapCounter(), ExponentialCounterFactory.circularBufferCounter());
}

@ParameterizedTest
Copy link
Contributor

@jamesmoessis jamesmoessis Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are one or two more tests we need once we have sufficient implementations of BucketMapper - since the current implementation can be inaccurate in certain cases. I can get to that in the next couple of weeks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be great. I was looking at the Go code (which I based this on), and there's some big complex optimisations to half the cost of the log lookup. That may be the next thing in the critical path here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the ticket: #3842

@@ -22,14 +23,23 @@
implements Aggregator<ExponentialHistogramAccumulation> {

private final Supplier<ExemplarReservoir> reservoirSupplier;
private final ExponentialBucketStrategy bucketStrategy;

DoubleExponentialHistogramAggregator(Supplier<ExemplarReservoir> reservoirSupplier) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you foresee the instrument / user API looking like for this? Something like new ExponentialHistogramInstrument(maxbuckets, startingScale)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a discussion on the specification right now, but I'll give you my thoughts:

  1. If we make a user specify anything by default, we've failed (fundamentally) on the goal of these histograms.
  2. If users want to specify to optimise their use case, it should match their expectations of data. Specifically, We can allow:
    • Users may give us a dynamic (base) range. E.g. I want to record between (0.0, 4000.0).
    • Users may give us the maximum # of buckets (or RAM) they want to use to store for histograms.
    • The SDK would calculate initial scale + bucket maximum based on these values.

result = byteBacking[idx] + count;
if (result > Byte.MAX_VALUE) {
// Resize + add
resizeToShort();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of an edge case where result is, for this case, higher than Short.MAX_VALUE, in which case resizeToShort() will cause overflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be optimised, but effectively it doesn't write the new value. It resizes to Short THEN attempts to write again where it'd detect the overflow. We can fix this so it resizes tot he appropriate value straight away, happy to fix that in this PR if you'd like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah I missed the recursive call, so it's correct. If you want to optimise it then go ahead, but I'm not too fussed seeing as it's a super edge case and not that much more CPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reality, we're mostly bumping counts by 1, so I'm not sure any changes will make it more optimal outside of preserving integer-size once a specific histogram "bumps".

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a commit missing from the PR corresponding to the "Add performance benchmarks." in the description?

/** Constructs fresh new buckets. */
DoubleExponentialHistogramBuckets newBuckets();

/** Constructs "empty" count buckets using settings from previous recorded bucket. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, I noticed diff / merge / zero are implemented as immutable operations but could they be mutable? Intuitively, always mutating a long array seems like the allocations would be fine (preallocated just once on initialization) and avoid the dynamic dispatch of the byte-short-int-long stuff. The idea does seem to at least require the scale to be preserved on reset though

*
* <p>Instances start by attempting to store one-byte per-cell in the integer array. As values grow,
* this will automatically instantiate the next-size integer array (byte => short => int => long)
* and copy over values into the larger array. This class expects most usage to remain within the
Copy link
Contributor

@anuraaga anuraaga Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a valid expectation? If collection interval is 1s, then that's still only 128 recordings per bucket. 128 qps is quite easy on a single app and even if not all of it will be in the same bucket, it seems reasonable to expect 128 to make it in. 32K qps is much less common though so if sticking with type-switching, I would probably remove the BYTE case and start with SHORT. Though I'm suspicious of the general real world performance benefit of type-switching even if synthetic benchmarks can find cases where it helps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, a lot of latency-based monitoring matches a normal distribution (or a sum of a few normals). Additionally, if you look at our actual client implementation, we tend to have two types of endpoints: Those infrequently used (e.g. periodic maintenance type RPCs, metadata things, infrequent user operations, etc.) + those with heavy traffic (core use case).

We need to optimise across this board, and we're creating a LOT of distributions, as we track them by HTTP endpoint right now in instrumentation. So even if a server has 128 qps, we need endpoints to also see that and we need to optimise for high+low traffic scenarios with our defaults.

I do think it's worthwhile to put into place "detection" for these high-qps latency distributions where we can:

  • Preserve the previous Scale (assuming we'll be seeing similar traffic in the next collection cycle.
  • Preserve the previous sized integer (I was going to work on that as a future optimisation).

For the purposes of this PR though, we're already seeing better performance than MapCounter and I think we have room to go further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this changes the conversation, but the default collection interval is 60s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really.

If we assume normal distribution (A simple HTTP server usually looks like this. In practice, you have sum-of-normal distributions when you look at latency for various paths your request will take that alter the normal distribution).

We have 320 buckets to represent counts in boundaries within this normal distribution. We know about 68% of counts will occur within the stdeviation from the mean. It's hard to identify how good a job of "tuning" the exponential algorithm can do with aligning its bucket thresholds + scale against the std-deviation, but let's assume we're halfway decent, meaning 160 buckets are within this boundary.

32K qps means we'll have 1,920,000 measurements in a minute = 32K qps * 60s. If we divide by 160 buckets, that averages about 12,000 per-bucket (naively). Back-of-the-envelope (we could use normal distribution function to fix this), I'd argue we would probably see peaks into 20k per-bucket. This is still below the 32k we get from just short alone, and that's for something with relatively high QPS.

Reversing this, let's say if we average 100 counts-per-bucket to fit within byte boundary, I'd say we can measure upwards of 400/500 qps w/ the byte arary.

Remember that, by semantic convention, we're measuring URL Paths with individual timeseries, not just a single web endpoint. This means, e.g. in a REST application we need to allocate a histogram for every single possible endpoint (including things like GET /entity/{id}). It's very likely the <500 qps optimisation in memory usage pays off. It's entirely likely we never need to upgrade to long for real world usage.

This is all back-of-the-envelope math. I'm sure someone else could do a better calculation for a more realistic example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. If we follow through with the optimization to preserve the previously size integer I'd expect it to be nicely adapted to a variety of use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that optimisation in this PR.

@jsuereth
Copy link
Contributor Author

@anuraaga Is there a commit missing from the PR corresponding to the "Add performance benchmarks." in the description?

Yes. We only need to add configuration to the enum values for the benchmark though, as the code/logic for generating data and running the benchmark already exists. See HistogramAggregationParam.java

@codecov
Copy link

codecov bot commented Jan 17, 2022

Codecov Report

Merging #4087 (bd8a251) into main (03c41ec) will increase coverage by 0.06%.
The diff coverage is 93.61%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main    #4087      +/-   ##
============================================
+ Coverage     90.21%   90.28%   +0.06%     
- Complexity     4515     4605      +90     
============================================
  Files           533      537       +4     
  Lines         13831    14091     +260     
  Branches       1321     1347      +26     
============================================
+ Hits          12478    12722     +244     
- Misses          923      926       +3     
- Partials        430      443      +13     
Impacted Files Coverage Δ
.../aggregator/DoubleExponentialHistogramBuckets.java 88.23% <91.42%> (-1.87%) ⬇️
.../internal/state/AdaptingCircularBufferCounter.java 91.80% <91.80%> (ø)
...lemetry/sdk/metrics/internal/state/MapCounter.java 94.54% <92.30%> (-1.29%) ⬇️
...k/metrics/internal/state/AdaptingIntegerArray.java 93.75% <93.75%> (ø)
...gregator/DoubleExponentialHistogramAggregator.java 100.00% <100.00%> (ø)
...internal/aggregator/ExponentialBucketStrategy.java 100.00% <100.00%> (ø)
...rics/internal/state/ExponentialCounterFactory.java 100.00% <100.00%> (ø)
.../sdk/autoconfigure/MeterProviderConfiguration.java 53.84% <0.00%> (-10.86%) ⬇️
...ension/trace/jaeger/sampler/OkHttpGrpcService.java 75.30% <0.00%> (-6.18%) ⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03c41ec...bd8a251. Read the comment docs.


private AdaptingIntegerArray(int size, ArrayCellSize cellSize) {
this.cellSize = cellSize;
switch (cellSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for the missed lines in this class, or if they don't have a short term need maybe remove the unused methods? There are some methods in other classes too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. I think I added these in preparation of preserving cell-size between collection intervals, I'll either add tests now or move that to a seperate branch/PR fully.

/** Constructs fresh new buckets. */
DoubleExponentialHistogramBuckets newBuckets();

/** Constructs "empty" count buckets using settings from previous recorded bucket. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're able to eventually make them mutable, I think we'd be able to make the histogram aggregations no-alloc which should allow removing AdaptingIntegerArray as the memory overhead would be fixed, or at least trim it down to being bimorphic which Java deals pretty well with. Will be a good area to explore as that class really just feels like a "root of all evil"

/** Convert from byte => short backing array. */
@SuppressWarnings("NullAway")
private void resizeToShort() {
this.shortBacking = new short[this.byteBacking.length];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's build up in a local variable and assign to the field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason, or is this a style thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly style (setup, then apply), also makes the function more future proof, if locking or volatility changes in the future, it doesn't matter, we see this function acting atomically so easier to reason about it.

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small suggestions but looks good. 👍

import io.opentelemetry.sdk.metrics.internal.state.ExponentialCounterFactory;

/** The configuration for how to create exponential histogram buckets. */
interface ExponentialBucketStrategy {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have an interface here with the single implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring left-overs. I first added this and gutted all the usage, then wound up only making one implementation. I can fix this up, however I'd also be happy to just remove MapCounter unless someone objects. Memory usage/allocations were pretty high, and I think a circular buffer is likely more performant for most histogram use cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also be happy to just remove MapCounter unless someone objects.

No objections from me.


package io.opentelemetry.sdk.metrics.internal.state;

public interface ExponentialCounterFactory {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this ExponentialCounterFactory<T extends ExponentialCounter>, and have:

T newCounter(int maxSize)
T copy(T other)

That will simplify the constructor public AdaptingCircularBufferCounter(ExponentialCounter toCopy) which currently tries to accommodate other implementations of ExponentialCounter. Is there a situation where multiple implementations would be used in tandem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DataDog histogram swaps between Dense/Sparse counter implementations. However, they're using double + partial counts for shuffling/resizing dynmaically which isn't an option for us.

I left the ability for us to mix/match dense (circular buffer) with sparse (map counter) to see if we can get any gains, but I wasn't able to find a scenario where this was performant, especially given how we expose dense-buckets in OTLP, and our most likely use case (latency monitoring) tends to follow normal-distributions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I started wiring this through, and it actually needs the type to embedd pretty far up into the hierarchy to the point you have ExponentialHistogramAggregation<T>. where T = the counter type. It starts to feel pretty clunky and wierd. Let me know if:

  1. You'd like me to continue with that change.
  2. I can just delete MapCounter and remove a lot of the factory/strategy options.
  3. We leave it as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove MapCounter in this PR if it makes things easier

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't feel strongly about this. Is it useful to keep MapCounter around for future performance comparisons or has it officially run its course?

Can also just have public AdaptingCircularBufferCounter(ExponentialCounter toCopy) throw if !(toCopy instanceOf AdaptingCircularBufferCounter).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for AFK on this. You'll have to bear with my availability this month :(.

My opinion is that MapCounter is sufficiently poorly performing I think we can oust it for now. There might be a world where spare-vector could perform well here, but not with the currently expected defaults of exponential histograms (specifically 320 buckets). Today though, it puts serious strain on the garbage collector as we're duplicating a large map of atomic integers frequently.

Given @anuraaga's general concerns, I've still been spending time thinking through alternative implementations and mechanisms here. Specifically, I want to prioritize a few optimisations in metrics:

  1. Figuring out a way to avoid as many allocations in the collection path. As @anuraaga called out, allocating 5MB of RAM every 1 minute (or less) is unacceptable.
  2. Trying to leverage "dumb java code" that hotspots well as much as possible. While I still think the layered integer is necessary for RAM savings at our current bucket defaults, I would still love a simpler implementation if it didn't come with 2-4x overhead in RAM.

I'll try to shore up this PR's remaining comments. Right now, it looks like I should be able to attend the SiG this week, so perhaps we can discuss in person future work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I had some ideas where having MapCounter around (for the collection path) may be a decent optimisation. Specifically for low-frequency histogram timeseries, if we detect # of buckets with values < 5 e.g. (or count is 0) we can optimise the collection path by NOT allocating a new huge integer array and just returning the Sparse (or empty) counter implementation. However, I'll want to benchmark that to make sure it's actually worth it.

I'd like to shore up other comments in this PR (for toCopy, it's actually important it works on all ExponentialCounter implementations), and then I'll work on a new PR that measures performance during collection we can use to identify and tune allocations in that path.

*
* <p>Instances start by attempting to store one-byte per-cell in the integer array. As values grow,
* this will automatically instantiate the next-size integer array (byte => short => int => long)
* and copy over values into the larger array. This class expects most usage to remain within the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this changes the conversation, but the default collection interval is 60s.

@jkwatson
Copy link
Contributor

Any reason not to go ahead and merge this?

- Fix a bug in equality where it was forcing ExponentialCounter to have
  the same offset, even if it had stored 0 counts in all buckets. This
  interacts negatively with merge/diff tests where creating a fresh
  exponential bucket would have different indexStart then diff-ing
  another.
- Modify default exponential bucket counter to be adapting circular
  buffer.
- Remove some not-well-though-out methods (like zeroOf, zeroFrom) in
  favor of a "clear" method on ExponentialCounter
- Modify ExponentialBucketStrategy to be an actual implementation.
@jsuereth jsuereth force-pushed the jsuereth/wip-expontential-benchmark-and-comparisons branch from b38a8f0 to d31a643 Compare January 26, 2022 18:13
@jsuereth
Copy link
Contributor Author

@jkwatson Give me another few hours (maybe 24) to do some cleanup. Given the discussion, I'm removing some uneccessary interfaces and improving some testing that was a bit too tied to MapCounter implementation.

@jsuereth
Copy link
Contributor Author

Ok, this should be ready. Further cleanups needed, but I need to flesh out a collection-time benchmark first so we can measure performance on that path. Right now it's pretty much a huge memory hog.

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few places to double-check and if fine should be good to merge

* </ul>
*/
private boolean sameBucketCounts(DoubleExponentialHistogramBuckets other) {
int min = Math.min(this.counts.getIndexStart(), other.counts.getIndexStart());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm this logic? Shouldn't it be something like a = this.start, b = other.start, a <= this.end && b <= other.end; a++ && b++)?

It looks like we could refer to an index that is below start for one of the two buckets right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually that's the point.

What can happen right now (with diff) is that we have a bucket, e.g. at index 2 which "diffs" to 0. In the unit tests, we instantiate the histogram-buckets by saying "You should see count 5 at index 3", which means we have one bucket with index start @ 3 and another with index start @ 2. They're semantically equivalent counts.

The previous MapCounter implementation actually did GC so any count that reached zero was removed from the Map. Unfortunately, that's not really feasible in the circular buffer implementation.

Copy link
Contributor

@jamesmoessis jamesmoessis Jan 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be what you are saying:

bucket counts: 
0,0,5,1,2,0,0, == 5,1,2

Makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't notice the change from throwing to returning 0 for those indices. Can you confirm there is a unit test for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

*/
private boolean sameBucketCounts(DoubleExponentialHistogramBuckets other) {
int min = Math.min(this.counts.getIndexStart(), other.counts.getIndexStart());
int max = Math.max(this.counts.getIndexEnd(), other.counts.getIndexStart());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be referencing other.getIndexEnd()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch, good catch. I got lucky in the tests based on how it orders equals, will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the unlucky unit test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather find a way to make the unit tests NOT care about equality here, but yeah I'll see if I can get a better unit test (not see my comment to james on alternatives here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the unit test. I think we may be able to do some refactoring of ExponentialHistogramBuckets at some point to clean up this code, and hopefully ensure we're not relying on hashCode/equals in production.

// If toCopy is an AdaptingCircularBuffer, just do a copy of the underlying array
// and baseIndex.
if (toCopy instanceof AdaptingCircularBufferCounter) {
this.backing = new AdaptingIntegerArray(((AdaptingCircularBufferCounter) toCopy).backing);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toCopy indicates we're supposed to copy but this doesn't appear to copy, is it safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does copy, it's using the convention of passing the thing to copy into the consstructor, so: new AdaptingIntegerArray(otherIntegerArray) performs a copy. Is there a different convention I should be using for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that the else statement does the copy in this code while the if doesn't. Both should have consistent semantics to prevent confusion like mine, is it possible to copy the backing here instead of relying on the constructor?

Copy link
Contributor Author

@jsuereth jsuereth Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They actually both copy, just via different mechanism.

Edited: Just noticed your comment. Do you mean you want me to add a .copy method to AdaptingIntegeryArray instead of the constructor variant to make this more obvious it's copying? I guess my original question is asking whether new Collection(otherCollection) isn't already the Java convention for copying things. I'm happy either way, it just feels unidiomtic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a .copy method would be cleaner, or otherwise even a comment would be fine too. Currently this code block is not obvious that it is copying due to the significant difference in pattern between the two if / else branches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment on line 31 not enough here?

I switched to using a direct copy method, but I'm confused a bit here, as I had already added those comments a while ago to help keep this section clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads much better, thanks. The issue with the comment is it says "do a copy" without the logic for the copy being there. If there was a comment "Constructor will copy the passed-in array" a bit below it, that could have been fine, but the copy method seems good.

return getBucketCounts().equals(other.getBucketCounts())
&& this.getOffset() == other.getOffset()
&& this.scale == other.scale;
return this.scale == other.scale && sameBucketCounts(other);
Copy link
Contributor

@jamesmoessis jamesmoessis Jan 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have you omitted the offset from hashCode() and equals()? I would think offset needs to be the same for two bucket sets to be strictly equal.

edit: oh I see, you are comparing across an extended window of both buckets. I suppose two bucket sets can have different offset but represent the same measurements. The buckets with the lower offset must have all zeroes up until the first non-zero bucket on the other bucket set in order for this to be the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly the case. Additionally the other bucket needs to have ONLY zeros after the first bucket set ends.

We could fix this in the diff method with a "normalize" type method that ensures the first index is populated. I'm fine going that route if you think it's cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely worth a comment in here, either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed, could use a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I had documented the other method, but in-context, it wasn't very good. Added here.

@jack-berg jack-berg merged commit 5c1bd6c into open-telemetry:main Feb 4, 2022
@jsuereth jsuereth deleted the jsuereth/wip-expontential-benchmark-and-comparisons branch February 4, 2022 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants