feat: add storage processing time metrics #1177

KolbyML · 2024-02-19T01:18:03Z

What was wrong?

We didn't have a clear way to see if storage calls were performing well or not. This caused us to make a major regression in performance.

How was it fixed?

By adding grafana metrics on how long different storage calls take. I think this is the first step in getting well rounded performance metrics throughout our codebase.

In a future PR I would want to make a graph which says which network is using the most I/O and by how much.

morph-dev

Few concerns, but I'm fine with going with this for now (since it's better than not having anything).

Also, I didn't look at grafana config since I know nothing about it.

morph-dev · 2024-02-19T08:46:47Z

trin-history/src/storage.rs

@@ -233,6 +238,7 @@ impl HistoryStorage {
                return Err(err);


What happens with the timer when we exit early? Is the time automatically recorded when it goes out of scope, or the entire event is skipped or?

I think it would be useful for our metric to have additional param (together with protocol and storageFunction), to indicate whether operation was successful.
Code might not be elegant (we would have to manually catch all usages of ? operator), but maybe we can refactor it and have metric measurements our single function.

If something fails nothing is recorded. We have logs for that.

morph-dev · 2024-02-19T09:01:36Z

trin-history/src/storage.rs

@@ -182,11 +182,15 @@ impl HistoryStorage {
    }

    fn total_entry_count(&self) -> Result<u64, ContentStoreError> {
+        let timer = self
+            .metrics
+            .start_storage_process_timer("total_entry_count");
        let conn = self.sql_connection_pool.get()?;


The metric would capture the 'self.sql_connection_pool.get()' call.
My understanding is that here we are waiting to get a connection to the db. Is there a limit on number of connection and sometimes we wait here or is it usually instant?

I'm not sure this should qualify as time spent as part of this storage operation (in case db is blocked by other queries).

There is a read write lock over storage instance. If self.sql_connection_pool.get() is blocking to the point it matters something is what with our code and this will be visable in the graphs, which is what we want. This is important information if it is blocking then that is important information.

We want to know if these calls are slow self.sql_connection_pool.get() is apart of that

Doesn't read-write lock allow multiple (unlimited?) reads at the same time? If that's the case, and sql_connection_pool has a limit, then we could be blocked on one request due to other requests taking long time.

And if that's desired depends on how you interpret the graphs. If there is congestion, all query metrics might go up while in fact problem can be with only one type of query.

If you time the requests without getting connection from the pool, you know how long each request takes and if there is congestion you will see exactly what is causing it. It's true that in this case you don't know how affected other queries are, so there is a trade off.

Either way, I'm fine with whatever you decide.

https://www.sqlite.org/whentouse.html#:~:text=SQLite%20supports%20an%20unlimited%20number,than%20a%20few%20dozen%20milliseconds.
SQLite supports an unlimited number of simultaneous readers

Apparently there is unlimited readers, but only one writer in sqlite.

The r2d2 crate that we use has a max number of connections.

Maybe we can report a new metric on how many connections is being used then in a future PR. If we are hitting the max number of connections and that is causing delays I think there is a problem with our code. My goal with these stats is to see how long it takes to execute the functions storage. Getting a connection to the database is included in that as that is what the code calling storage functions has to deal with.

njgheorghita

We need to think about how we can best solidify our grafana json template workflow. Here are the following places (that I can think of right now) where we use the trin-metrics grafana template.

trin-bench
kurtosis
running metrics locally
If we're adding a graph / making changes to the dashboard, we want to make sure that the graph is being updated / deployed in all 3 situations, while maintaining the appropriate data source ids.

njgheorghita · 2024-02-20T14:51:40Z

trin-metrics/src/storage.rs

@@ -21,6 +23,14 @@ const BYTES_IN_MB_F64: f64 = 1000.0 * 1000.0;

 impl StorageMetrics {
    pub fn new(registry: &Registry) -> anyhow::Result<Self> {
+        let storage_process_time = register_histogram_vec_with_registry!(
+            histogram_opts!(
+                "trin_storage_process_time",


iirc, it's convention to have a suffix of the unit of measurement, I'd go for something like storage_process_timer_secs. Also, timer seems a bit more intuitive than time imo

Aren't Histograms only able to be in seconds I am fine if we added this let me know

trin-metrics/src/storage.rs

njgheorghita · 2024-02-20T15:08:21Z

trin-history/src/storage.rs

-    /// Public method for looking up a content value by its content id
-    pub fn lookup_content_value(&self, id: [u8; 32]) -> anyhow::Result<Option<Vec<u8>>> {
+    /// Internal method for looking up a content value by its content id
+    fn lookup_content_value(&self, id: [u8; 32]) -> anyhow::Result<Option<Vec<u8>>> {


If something fails nothing is recorded. We have logs for that.

Can you elaborate on that here? If lookup_content_value() fails, then it's not obvious to me where we're emitting these logs. Also, just to clarify, if the process timer is never stopped, then nothing is recorded?

Also, just to clarify, if the process timer is never stopped, then nothing is recorded?

To my knowledge yes

This error then gets forwarded to the overlay which then reports the error

This timer can be stopped and observed at most once, either automatically (when it goes out of scope) or manually. Alternatively, it can be manually stopped and discarded in order to not record its value.

@KolbyML My understanding from the docs is that if it goes out of scope (eg. what happens in this case when lookup_content_value()? fails) then it will record the time when it is dropped and count it as an observation. Seems like we probably need to handle these cases and use stop_and_discard() on failures? Otherwise we'll end up with false observations skewing the histogram?

Good catch I should I should have read the documentation a bit more 😆.

I just pushed a fix for this issue by implementing a CustomHistogramTimer which on drop does stop_and_discard() instead of stop_and_record().

The reason I choose to fix the problem in this way, is because I did a lot of thinking and came to the conclusion this was the cleanest way to achieve our goal well staying within the structural designs of our project in relation to how we handle metrics

KolbyML · 2024-02-20T18:02:40Z

We need to think about how we can best solidify our grafana json template workflow. Here are the following places (that I can think of right now) where we use the trin-metrics grafana template.

trin-bench

kurtosis

running metrics locally
If we're adding a graph / making changes to the dashboard, we want to make sure that the graph is being updated / deployed in all 3 situations, while maintaining the appropriate data source ids.

I am not exactly answering this question, but I think this point should be made clear.

Currently for a single node view and a multi node view we use the same dashboard. I think we should instead support 2
A single node view dashboard: more in depth specific stats
And a multi node view dashboard: An arrgrated overview

We already have a multi node dashboard, but we currently don't have a dashboard optimized to get an indepth look at 1 node

KolbyML · 2024-02-20T18:05:01Z

Ready for another look @njgheorghita

KolbyML · 2024-02-20T21:17:34Z

@njgheorghita I am ready for another review cargo-fmt ci broke because of an upstream change to the nightly docker excutor in ci here is a pr for that #1179

njgheorghita · 2024-02-20T23:46:51Z

trin-metrics/src/utils.rs

+            self.observe(false);
+        }
+    }
+}


Now that we're implementing custom logic, this seems worth adding some unit tests to validate the new functionality

Ok I added a test. I took the pre-existing test promethus has for this and modified it to work for our CustomHistogramTimer
https://docs.rs/prometheus/0.13.3/src/prometheus/histogram.rs.html#1217-1260

I also added 3 comments which describe the functionality which is special aka not counting on drop

njgheorghita · 2024-02-21T14:26:38Z

trin-history/src/storage.rs

@@ -268,7 +273,7 @@ impl HistoryStorage {
                "Capacity reached, deleting farthest: {}",
                hex_encode(id_to_remove)
            );
-            if let Err(err) = self.evict(id_to_remove) {
+            if let Err(err) = self.db_remove(&id_to_remove) {


It looks like the debug log below is inaccurate.

debug!("Error removing content ID {id_to_remove:?} from db: {err:?}");

njgheorghita · 2024-02-21T14:31:18Z

trin-metrics/src/utils.rs

@@ -0,0 +1,135 @@
+use std::time::Instant;


nitpick this feels like more than a simple util imo, and deserves its own named module, something like trin-metrics/src/timer.rs

njgheorghita · 2024-02-21T14:32:43Z

trin-metrics/src/utils.rs

+/// This timer can be stopped and observed at most once manually.
+/// Alternatively, if it isn't manually stopped it will be discarded in order to not record its
+/// value.
+#[must_use = "Timer should be kept in a variable otherwise it cannot observe duration"]


Nice! Didn't know about this

njgheorghita · 2024-02-21T14:42:41Z

trin-metrics/src/utils.rs

+/// value.
+#[must_use = "Timer should be kept in a variable otherwise it cannot observe duration"]
+#[derive(Debug)]
+pub struct CustomHistogramTimer {


This name isn't great imo. The Custom prefix doesn't really give any information about the actual type. aka. All it really communicates is that this is a "custom" implementation. I think a better name would answer why are we using a "custom" implementation. But... this is a tricky one. I'm not coming up with any better names tbh. DropableHistogramTimer (well dropable isn't a word, and the regular histogram timer is also "dropable") / UnobservedDropHistogramTimer / HistogramTimerWithDiscardedDrop / HistogramTimerWithUnobservedDrops....

Idk, the last one is probably my favorite, but I'll leave this one up to you, at the very least the docstring above is very useful for understanding the purpose of the custom timer. Ideally, the name would also communicate the purpose of the custom timer, but there is a tradeoff here against the wordiness of the name.

suggestion: DiscardOnDropHistogramTimer

Dropable is totally a word. If it isn't how did you say it xD

njgheorghita · 2024-02-21T14:44:15Z

trin-metrics/src/utils.rs

+    }
+
+    fn observe(&mut self, record: bool) -> f64 {
+        let v = Instant::now().saturating_duration_since(self.start);


We should avoid single letter variables

njgheorghita · 2024-02-21T14:46:45Z

trin-metrics/src/utils.rs

+        });
+        assert!(handler.join().is_ok());
+
+        let mut mfs = histogram.collect();


Hahah, all I can think of here is motherf***ers... Maybe that's just me? Imo it's worthwhile to update it to metric_families and the following mf -> metric_family, m -> metric... It makes the code more readable and avoids single letter variables

njgheorghita · 2024-02-21T14:55:35Z

trin-metrics/src/utils.rs

+
+    /// Test taken from https://docs.rs/prometheus/0.13.3/src/prometheus/histogram.rs.html#1217-1260
+    /// Modified to work with CustomHistogramTimer
+    #[test]


I'm not sure about this. The point of testing our implementation of CustomHistogramTimer is not to make sure that the wrapped HistogramTimer functions properly (which is what this test is testing iiuc). But to make sure that our custom api is working as expected. I'd like to see a test that tests CustomHistogramTimer::observe_duration, CustomHistogramTimer::stop_and_record, CustomHistogramTimer::stop_and_discard, and CustomHistogramTimer::observe (although this last will be tested indirectly through the other fns).

I think I did it?

KolbyML · 2024-02-21T18:49:01Z

@njgheorghita ready for another look

njgheorghita

🚀

njgheorghita · 2024-02-21T20:10:54Z

trin-metrics/src/timer.rs

+    }
+
+    #[test]
+    fn test_discard_through_explict_drop() {


explicit & below in the histogram name. and implicit for the same cases in the tests below

KolbyML requested review from mrferris, njgheorghita, ogenev and morph-dev February 19, 2024 01:18

KolbyML self-assigned this Feb 19, 2024

KolbyML changed the title ~~feat(storage): add storage processing time metrics~~ feat: add storage processing time metrics Feb 19, 2024

morph-dev reviewed Feb 19, 2024

View reviewed changes

morph-dev approved these changes Feb 19, 2024

View reviewed changes

njgheorghita reviewed Feb 20, 2024

View reviewed changes

KolbyML requested a review from njgheorghita February 20, 2024 18:03

njgheorghita reviewed Feb 20, 2024

View reviewed changes

KolbyML added 3 commits February 20, 2024 16:50

feat(storage): add storage processing metrics

ab7808e

fix: pr concerns

6ca5aad

fix: pr concerns

88bba16

KolbyML force-pushed the add-metrics branch from adda2fa to 88bba16 Compare February 20, 2024 23:50

KolbyML added 2 commits February 20, 2024 17:06

fix: add test for CustomHistogramTimer

c1a5c2f

fix: lint

0ffa4ee

KolbyML requested a review from njgheorghita February 21, 2024 00:09

njgheorghita reviewed Feb 21, 2024

View reviewed changes

fix: pr concerns

804b09d

njgheorghita approved these changes Feb 21, 2024

View reviewed changes

fix: pr concerns

787c77f

KolbyML merged commit e74f817 into ethereum:master Feb 21, 2024
8 checks passed

KolbyML deleted the add-metrics branch January 22, 2025 07:52

feat: add storage processing time metrics #1177

feat: add storage processing time metrics #1177

Conversation

KolbyML commented Feb 19, 2024 • edited by mrferris Loading

What was wrong?

How was it fixed?

morph-dev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KolbyML Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KolbyML Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njgheorghita left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KolbyML Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njgheorghita Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KolbyML commented Feb 20, 2024

KolbyML commented Feb 20, 2024

KolbyML commented Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KolbyML commented Feb 21, 2024

njgheorghita left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KolbyML commented Feb 19, 2024 •

edited by mrferris

Loading

KolbyML Feb 19, 2024 •

edited

Loading

KolbyML Feb 19, 2024 •

edited

Loading

KolbyML Feb 20, 2024 •

edited

Loading

njgheorghita Feb 20, 2024 •

edited

Loading

KolbyML commented Feb 20, 2024 •

edited

Loading