Adding two level hashing in metrics hashmap #1564

lalitb · 2024-02-21T08:47:44Z

Changes

This PR is opened just for the discussion, of one of the possible changes to improve the performance of concurrent recording/updation of measurements in the hot-path. In today's (Feb 20th, 2024) community meeting, we discussed multiple approaches

Using dashmap - sharded hashmap.
Using flurry - uses fine-grained atomic locks in hashmap
Using sharde-slab - building hashmap using concurrent data-structures provided by this crate.
Using thread-local hashmaps in the hot-path to record measurements and then merge them during collection. The collection would be a performance-heavy process. And the SDK doesn't have control over the life cycle of threads created by the application, so exiting a thread by the application will also remove all it's aggregated data.
Cijo spoke about the lock-free approach for hot-path as used in otel-dotnet, where the first level of hashmap only stores the index to the aggregated data.

The approach in this PR is modifying the ValueMap store values in two-level hashing to minimize lock contention. It's basically a more simpler form of sharding as provided by dashmap.

Existing:

struct ValueMap<T: Number<T>> {
    values: Mutex<HashMap<AttributeSet, T>>, 
    has_no_value_attribute_value: AtomicBool,
    no_attribute_value: T::AtomicTracker,
}

PR:

struct ValueMap<T: Number<T>> {
    buckets: Arc<HashMap<u8, Mutex<HashMap<AttributeSet, T>>>>, // ? or RWLock
    has_no_value_attribute_value: AtomicBool,
    no_attribute_value: T::AtomicTracker,
    total_count: AtomicUsize,
}

First level of hashing is to distribute the values across the fixed set of 256 buckets. Each bucket is guarded by its own mutex and contains a second-level hash map for storing AttributeSet to aggregation mapping.

Please note

This approach won't see any improvement in single-threaded application, as we still lock. Infact, the complexity added by sharding may slightly impact the performance. I didn't see any perf improvement/degradation in benchmark tests as they are single threaded.
If all the attributes are hashed to the same bucket ( rare scenario), there won't be any perf improvement as all the threads will contest for the same lock.
This approach may impact locality of reference during collection cycle, as the OS can allocate these buckets in different segments of memory, and so mayn't be cached together.

All above are the limitation of any sharded hashmap (including dashmap).

This is the result of our metrics stress test in my machine. However, I don't think results are really consistent across machines, so good if someone can test it too:

Threads	Main Throughput (millions/sec)	PR Throughput (millions/sec)	PR Throughput (millions/sec) with hashbrown
3	6.6	6.6	6.8
5	10.1	12.5	13.2
8	5.6	18.9	20.0
10	4.5	22.0	24.0

Based on the above results- with PR, the perf seems to be increasing significantly with the number of threads. In the main branch, the performance increases with threads till a threshold and then starts degrading.

Benchmark result. This PR won't improve performance there, as these are single-threaded tests. There is slight retrogress for two-level indirection, which gets compensated (in fact improved) with hashbrown/ahash:

Test	time (Main branch)	time (PR branch)	time (PR with hashbrown)
AddNoAttrs	30.994 ns	31.286 ns	31.544 ns
AddNoAttrsDelta	31.009 ns	31.009 ns	31.455 ns
AddOneAttr	169.76 ns	184.00 ns	154.39 ns
AddOneAttrDelta	172.54 ns	185.37 ns	157.08 ns
AddThreeAttr	396.93 ns	393.36 ns	368.69 ns
AddThreeAttrDelta	396.41 ns	393.68 ns	368.60 ns
AddFiveAttr	706.21 ns	718.80 ns	672.07 ns

Hashbrown/ahash seems widely adopted crates in terms of number of downloads, and also secure enough for external ddos attacks - https://github.com/tkaitchuck/aHash/wiki/How-aHash-is-resists-DOS-attacks

Merge requirement checklist

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

codecov · 2024-02-21T08:52:27Z

Codecov Report

Attention: Patch coverage is 90.93333% with 34 lines in your changes are missing coverage. Please review.

Project coverage is 69.6%. Comparing base (f203b03) to head (d973c4d).
Report is 9 commits behind head on main.

Files	Patch %	Lines
opentelemetry-sdk/src/metrics/internal/sum.rs	83.0%	34 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #1564     +/-   ##
=======================================
+ Coverage   69.3%   69.6%   +0.3%     
=======================================
  Files        136     136             
  Lines      19637   19946    +309     
=======================================
+ Hits       13610   13894    +284     
- Misses      6027    6052     +25

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cijothomas · 2024-02-21T14:39:48Z

Thanks! This is amazing!
As discussed in the SIG call, a key bottleneck today is the contention among every Update() calls (irrespective of if they are updating the same metric point (timeseries) or not). This eases that a lot, leading to way better scaling, as seen in the throughput increases for higher number of cores!

Lets sit on this for 1- 2 days to see if any concerns, before proceeding to do similar for other instrument types. Also, lets do this after the pending release.

hdost · 2024-02-21T23:32:42Z

@lalitb love the performance gains! Were you able to test this impl against some of the single level hashmap alternative implementations?

Lets sit on this for 1- 2 days to see if any concerns, before proceeding to do similar for other instrument types. Also, lets do this after the pending release.

@cijothomas If it's truly for discussion a week seems like a more reasonable timeframe.

lalitb · 2024-02-22T00:41:18Z

Were you able to test this impl against some of the single level hashmap alternative implementations?

@hdost Do you have any alternative implemetations in mind, can test that.

I tested dashmap (which is also a sharded hashmap) but the performance was worse than our existing implementation - https://github.com/open-telemetry/opentelemetry-rust/compare/main...lalitb:metrics-dashmap?expand=1

main branch (using hashmap):
Number of threads: 8
Throughput: 6,640,600 iterations/sec
Throughput: 6,203,800 iterations/sec

using dashmap:
Number of threads: 8
Throughput: 1,096,800 iterations/sec
Throughput: 1,184,600 iterations/sec

As tested separately, the dashmap was much faster for concurrent read-only operations, but our scenario is concurrent updates. Also, we were using dashmap in the old metrics implementation, but was removed in the new PR (didn't find the reason in history). In general, reluctant to take dependency on any external crate unless it is widely used :)

hdost · 2024-02-22T03:53:42Z

From what it seems there's no public interface effect with this so let's take the win for now and see if we can't make further improvements later 👍

opentelemetry-sdk/src/metrics/internal/sum.rs

opentelemetry-sdk/Cargo.toml

cijothomas · 2024-02-26T16:44:22Z

opentelemetry-sdk/src/metrics/internal/sum.rs

+                        } else {
+                            // TBD - Update total_count ??
+                            values
+                                .entry(STREAM_OVERFLOW_ATTRIBUTE_SET.clone())


do we need to store the overflow attribute in each bucket? or once only?

overflow attribute is handled separately, and is stored only once.

i am not sure how is this handled.. it looks to me like we'll store overflow for each of the hasmaps.

Good point, this was indeed overflow every hashmap. Have fixed it now. Will also add tests to validate this.

opentelemetry-sdk/src/metrics/internal/sum.rs

hdost · 2024-02-26T18:52:22Z

I see you've got some further improvements, how's the performance improvement for you ?

lalitb · 2024-02-27T20:30:13Z

I see you've got some further improvements, how's the performance improvement for you ?

The additional improvements are with (optionally) using the fast hashing with hashbrown/ahash. Will update the improvements later from the same machine I used as base earlier. As of now, I am replicating the same changes for other aggregation.

lalitb · 2024-02-28T07:33:08Z

I see you've got some further improvements, how's the performance improvement for you ?

The additional improvements are with (optionally) using the fast hashing with hashbrown/ahash. Will update the improvements later from the same machine I used as base earlier. As of now, I am replicating the same changes for other aggregation.

Updated the latest throughputs in the PR description. The perf-boost using hashbrown/ahash is marginal.

opentelemetry-sdk/src/metrics/internal/sum.rs

lalitb · 2024-03-07T05:21:55Z

As discussed during the community meeting, have added the tests for overflow and concurrent measurement recordings.

lalitb · 2024-03-07T16:34:49Z

opentelemetry-sdk/src/metrics/mod.rs

+            .get_finished_metrics()
+            .expect("metrics are expected to be exported.");
+        // Every collect cycle produces a new ResourceMetrics (even if no data is collected).
+        // TBD = This needs to be fixed, and then below assert should validate for one entry


Will create an issue for this.

lalitb · 2024-03-07T16:35:05Z

opentelemetry-sdk/src/metrics/mod.rs

+            .init();
+
+        // sleep for random ~5 milis to avoid recording during first collect cycle
+        // (TBD: need to fix PeriodicReader to NOT collect data immediately after start)


Will create an issue for this.

cijothomas · 2024-03-07T18:22:38Z

opentelemetry-sdk/src/metrics/internal/sum.rs

+            let bucket_guard = bucket_mutex.lock().unwrap();
+
+            let is_new_entry = if let Some(bucket) = &*bucket_guard {
+                !bucket.contains_key(&attrs)


in the common path where attributes are already existing, now we have to acquire lock once, do lookup, release lock, and the re-acquire the lock, and do the lookup+update.

Apart from the perf hit, this loses the atomicity of the update. It is possible that, between the time we release the lock and re-acquire, other entries might have occurred and the limit was hit, so this attribute should be going into over-flow.

We need to ensure atomicity and avoid the two-step lock-release-re-lock.

Good observation, made the required changes. All operations should be (theoretically) atomic now, and perf should be fast for common scenarios of attributes already existing.

cijothomas

See https://github.com/open-telemetry/opentelemetry-rust/pull/1564/files#r1516624961

cijothomas · 2024-03-08T02:35:59Z

opentelemetry-sdk/src/metrics/internal/sum.rs

+        let (bucket_index, attrs) = if under_limit {
+            (bucket_index, attrs) // the index remains same
+        } else {
+            // TBD - Should we log, as this can flood the logs ?


shouldn't log. the existing code was incorrectly doing it.

cijothomas · 2024-03-08T02:48:12Z

opentelemetry-sdk/src/metrics/internal/sum.rs

+        };
+
+        // Lock and update the relevant bucket
+        let mut final_bucket_guard = self.buckets[bucket_index].lock().unwrap();


this wont work, because by the time we reach and acquire this lock, collect() may have triggered and wiped the hashmap clean, and another updates could have occurred as well making our assumption that this attribute is not-an-overflow invalid.

let under_limit = self.try_increment();
It is entirely possible that this thread loses its CPU right after the above statement is executed. When it gets a chance to execute again, under_limit could have changed already. We need to do this whole thing atomically.

There are two changes in the design

We increment before the actual insert.

The increment i.e, try_increment() is atomic.

There is a narrow window where a measurement with new set of attributes go to overflow index even when there is empty index (this can happen during collect), which should be fine. However, there won't be a scenario where the number of data points in the map get exceeded beyond the cardinality limit.

And thanks for the thorough review of this PR, thinking of all possible scenarios and edge conditions :)

take this example:
thread1 is executing update with attributes (foo=bar)., the current total_unique is 100, with limit as 2000
say bucket_index is 4
take lock
(foo=bar) is not present.
drop lock
try_increment is called which increments total_unique to 101, which is within the limit., so this thread will go and update bucket_index 4.
====thread1 lost its CPU===

// in other threads
collect() occurred and it reset total_unique to 0
other updates() occurred and take total_unique to 2000 (limit hit). None of these updates() were having (foo=bar) as attribute.
At this stage (foo=bar) should be going into overflow.

====thread1 got back its CPU===
It takes lock to update bucket_index 4 with (foo=bar). But this is not correct. (foo=bar) should be going to overflow only.

collect() occurred and it reset total_unique to 0

It won't reset to 0, but will only adjust(decrement) according to the number of entries read/drained.

collect() read and drained the entire hashmaps. Won't it hit 0 then?

In the above scenario, when the number of entries in hashmap is 100, while the unique_count is 101 (for foo=bar), if collect happens before foo=bar is inserted, the count will get reset to 1 (not 0), and then the insert happens after/during the collect.

take the example where 5 threads try to update with (foo=bar) which is not existing currently. The current unique_count is 1999, and limit is 2000.
All 5 threads realize entry not found, so they all attempt to increment the unique_count.
first thread succeeds, and attempts to store into correct bucket but remaining 4 threads will attempt to put foo=bar into overflow. I believe this is the part which you said might be okay?

However, consider the same scenario, but the current unique_count was just 10.
All 5 threads realize entry not found, so they all attempt to increment the unique_count.
All threads would succeed, so unique_count reaches 15. (though it should actually be 11)
One threads would then insert the hashmap and rest would update the hashmap.

Collect() runs, and it drains the map, but when it drains (foo=bar), it'll reduce unique_count by 1 (from 15 to 14). So we now have 4 wasted entries. No matter how many collect() runs afterwards, we'll never reclaim that wasted spot... If we run stress test long enough (may weeks), we might see all entries going to overflow, even though there is plenty of space.

Please confirm if the above is correct.

lalitb · 2024-03-19T06:58:05Z

Just an update, I am still working on it, particularly to mitigate the challenges around concurrency and atomicity - The main issue is making sure the operations of checking if we're under the cardinality limit and then acting on it (like inserting a new entry) happen seamlessly without any race conditions or inconsistencies due to concurrent updates. This actually turned out to be more complex than initially anticipated, especially when trying to do so without compromising on performance. Still working on it, hoping to have a solution during this week :)

cijothomas · 2024-04-21T19:17:18Z

Just an update, I am still working on it, particularly to mitigate the challenges around concurrency and atomicity - The main issue is making sure the operations of checking if we're under the cardinality limit and then acting on it (like inserting a new entry) happen seamlessly without any race conditions or inconsistencies due to concurrent updates. This actually turned out to be more complex than initially anticipated, especially when trying to do so without compromising on performance. Still working on it, hoping to have a solution during this week :)

Let us know if this is ready for another review. If not, we can mark as draft again.

@hdost I am waiting for confirmation that this is ready for review, as we have open conversations that are not marked resolved.

lalitb · 2024-04-30T15:29:26Z

Let us know if this is ready for another review. If not, we can mark as draft again.

Sorry for the delay. Will revisit it now. If it takes more time, will move it to draft.

lalitb · 2024-07-02T20:11:09Z

There have been substantial perf improvements in metrics instrumentation with #1833. Will revisit this PR after logs beta release. Closing this for now to keep the PR list minimal.

lalitb and others added 2 commits February 20, 2024 18:04

initial commiy

98db4f3

handle empty resource

1c440e0

lalitb requested a review from a team February 21, 2024 08:47

fix lint

75c853e

hdost approved these changes Feb 22, 2024

View reviewed changes

TommyCpp reviewed Feb 22, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/internal/sum.rs Outdated Show resolved Hide resolved

cijothomas mentioned this pull request Feb 23, 2024

Overview of Major blockers for GA/Stable Release #1572

Open

use static vector at first level hash

fd858f6

lalitb commented Feb 23, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/internal/sum.rs Outdated Show resolved Hide resolved

lalitb added 3 commits February 23, 2024 23:38

add method to calculate data point size

1f05dcf

lint error

d191cf7

add hashbrown and ahash as optional dependency

fd94caa

lalitb commented Feb 24, 2024

View reviewed changes

opentelemetry-sdk/Cargo.toml Show resolved Hide resolved

lalitb added 2 commits February 24, 2024 01:25

Merge branch 'main' into two-level-hash

da817b5

Merge branch 'main' into two-level-hash

498f088

cijothomas reviewed Feb 26, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/internal/sum.rs Outdated Show resolved Hide resolved

cijothomas reviewed Feb 26, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/internal/sum.rs Outdated Show resolved Hide resolved

add const for bucket count

2930fe1

Merge branch 'main' into two-level-hash

ebe4a38

lalitb changed the title ~~[FOR DISCUSSION] Adding two level hashing in metrics hashmap~~ Adding two level hashing in metrics hashmap Feb 28, 2024

cijothomas reviewed Feb 28, 2024

View reviewed changes

opentelemetry-sdk/src/metrics/internal/sum.rs Outdated Show resolved Hide resolved

lalitb added 2 commits March 6, 2024 20:52

Merge branch 'main' into two-level-hash

cc12da1

remove leftover method

90bbb2d

lalitb commented Mar 7, 2024

View reviewed changes

cijothomas reviewed Mar 7, 2024

View reviewed changes

cijothomas requested changes Mar 7, 2024

View reviewed changes

lalitb added 3 commits March 7, 2024 21:20

fix atomic

f89c3ea

fix

603305e

more comments

21b0b3e

cijothomas reviewed Mar 8, 2024

View reviewed changes

lalitb added 3 commits March 8, 2024 17:26

fix race condition for concurrent same attribute insert

f33729c

fix lint

8b48564

Merge branch 'main' into two-level-hash

04c68c2

lalitb and others added 2 commits March 19, 2024 14:05

Merge branch 'main' into two-level-hash

fdb5020

Merge branch 'main' into two-level-hash

d973c4d

hdost requested a review from cijothomas April 21, 2024 18:10

hdost approved these changes Apr 21, 2024

View reviewed changes

lalitb mentioned this pull request Apr 22, 2024

Run a stress thread for each logical core #1675

Merged

4 tasks

cijothomas mentioned this pull request May 11, 2024

Metrics SDK improvements #1740

Open

lalitb mentioned this pull request May 25, 2024

Metrics Aggregation - Improve throughput by 10x #1833

Merged

lalitb closed this Jul 2, 2024

cijothomas mentioned this pull request Sep 27, 2024

ValueMap interface change #2117

Merged

4 tasks

lalitb mentioned this pull request Oct 31, 2024

Metrics collect stress test #2247

Closed

4 tasks

fraillt mentioned this pull request Nov 11, 2024

Add simple sharding in ValueMap to boost metrics update performance under stress. #2297

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding two level hashing in metrics hashmap #1564

Adding two level hashing in metrics hashmap #1564

lalitb commented Feb 21, 2024 •

edited

Loading

codecov bot commented Feb 21, 2024 •

edited

Loading

cijothomas commented Feb 21, 2024

hdost commented Feb 21, 2024

lalitb commented Feb 22, 2024 •

edited

Loading

hdost commented Feb 22, 2024

cijothomas Feb 26, 2024

lalitb Feb 26, 2024

cijothomas Mar 4, 2024

lalitb Mar 5, 2024

hdost commented Feb 26, 2024

lalitb commented Feb 27, 2024 •

edited

Loading

lalitb commented Feb 28, 2024

lalitb commented Mar 7, 2024

lalitb Mar 7, 2024

lalitb Mar 7, 2024

cijothomas Mar 7, 2024

lalitb Mar 7, 2024 •

edited

Loading

cijothomas left a comment

cijothomas Mar 8, 2024

cijothomas Mar 8, 2024

lalitb Mar 8, 2024 •

edited

Loading

lalitb Mar 8, 2024

cijothomas Mar 8, 2024

lalitb Mar 8, 2024 •

edited

Loading

cijothomas Mar 8, 2024

lalitb Mar 8, 2024

cijothomas Mar 8, 2024

lalitb commented Mar 19, 2024

cijothomas commented Apr 21, 2024

lalitb commented Apr 30, 2024

lalitb commented Jul 2, 2024

Adding two level hashing in metrics hashmap #1564

Adding two level hashing in metrics hashmap #1564

Conversation

lalitb commented Feb 21, 2024 • edited Loading

Changes

Merge requirement checklist

codecov bot commented Feb 21, 2024 • edited Loading

Codecov Report

cijothomas commented Feb 21, 2024

hdost commented Feb 21, 2024

lalitb commented Feb 22, 2024 • edited Loading

hdost commented Feb 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hdost commented Feb 26, 2024

lalitb commented Feb 27, 2024 • edited Loading

lalitb commented Feb 28, 2024

lalitb commented Mar 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lalitb Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

cijothomas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lalitb Mar 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lalitb Mar 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lalitb commented Mar 19, 2024

cijothomas commented Apr 21, 2024

lalitb commented Apr 30, 2024

lalitb commented Jul 2, 2024

lalitb commented Feb 21, 2024 •

edited

Loading

codecov bot commented Feb 21, 2024 •

edited

Loading

lalitb commented Feb 22, 2024 •

edited

Loading

lalitb commented Feb 27, 2024 •

edited

Loading

lalitb Mar 7, 2024 •

edited

Loading

lalitb Mar 8, 2024 •

edited

Loading

lalitb Mar 8, 2024 •

edited

Loading