-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add storage processing time metrics #1177
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few concerns, but I'm fine with going with this for now (since it's better than not having anything).
Also, I didn't look at grafana config since I know nothing about it.
@@ -233,6 +238,7 @@ impl HistoryStorage { | |||
return Err(err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with the timer when we exit early? Is the time automatically recorded when it goes out of scope, or the entire event is skipped or?
I think it would be useful for our metric to have additional param (together with protocol
and storageFunction
), to indicate whether operation was successful.
Code might not be elegant (we would have to manually catch all usages of ?
operator), but maybe we can refactor it and have metric measurements our single function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something fails nothing is recorded. We have logs for that.
@@ -182,11 +182,15 @@ impl HistoryStorage { | |||
} | |||
|
|||
fn total_entry_count(&self) -> Result<u64, ContentStoreError> { | |||
let timer = self | |||
.metrics | |||
.start_storage_process_timer("total_entry_count"); | |||
let conn = self.sql_connection_pool.get()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric would capture the 'self.sql_connection_pool.get()' call.
My understanding is that here we are waiting to get a connection to the db. Is there a limit on number of connection and sometimes we wait here or is it usually instant?
I'm not sure this should qualify as time spent as part of this storage operation (in case db is blocked by other queries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a read write lock over storage instance. If self.sql_connection_pool.get()
is blocking to the point it matters something is what with our code and this will be visable in the graphs, which is what we want. This is important information if it is blocking then that is important information.
We want to know if these calls are slow self.sql_connection_pool.get()
is apart of that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't read-write lock allow multiple (unlimited?) reads at the same time? If that's the case, and sql_connection_pool
has a limit, then we could be blocked on one request due to other requests taking long time.
And if that's desired depends on how you interpret the graphs. If there is congestion, all query metrics might go up while in fact problem can be with only one type of query.
If you time the requests without getting connection from the pool, you know how long each request takes and if there is congestion you will see exactly what is causing it. It's true that in this case you don't know how affected other queries are, so there is a trade off.
Either way, I'm fine with whatever you decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.sqlite.org/whentouse.html#:~:text=SQLite%20supports%20an%20unlimited%20number,than%20a%20few%20dozen%20milliseconds.
SQLite supports an unlimited number of simultaneous readers
Apparently there is unlimited readers, but only one writer in sqlite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The r2d2
crate that we use has a max number of connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can report a new metric on how many connections is being used then in a future PR. If we are hitting the max number of connections and that is causing delays I think there is a problem with our code. My goal with these stats is to see how long it takes to execute the functions storage. Getting a connection to the database is included in that as that is what the code calling storage functions has to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to think about how we can best solidify our grafana json template workflow. Here are the following places (that I can think of right now) where we use the trin-metrics
grafana template.
- trin-bench
- kurtosis
- running metrics locally
If we're adding a graph / making changes to the dashboard, we want to make sure that the graph is being updated / deployed in all 3 situations, while maintaining the appropriate data source ids.
trin-metrics/src/storage.rs
Outdated
@@ -21,6 +23,14 @@ const BYTES_IN_MB_F64: f64 = 1000.0 * 1000.0; | |||
|
|||
impl StorageMetrics { | |||
pub fn new(registry: &Registry) -> anyhow::Result<Self> { | |||
let storage_process_time = register_histogram_vec_with_registry!( | |||
histogram_opts!( | |||
"trin_storage_process_time", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, it's convention to have a suffix of the unit of measurement, I'd go for something like storage_process_timer_secs
. Also, timer
seems a bit more intuitive than time
imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't Histograms only able to be in seconds I am fine if we added this let me know
/// Public method for looking up a content value by its content id | ||
pub fn lookup_content_value(&self, id: [u8; 32]) -> anyhow::Result<Option<Vec<u8>>> { | ||
/// Internal method for looking up a content value by its content id | ||
fn lookup_content_value(&self, id: [u8; 32]) -> anyhow::Result<Option<Vec<u8>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something fails nothing is recorded. We have logs for that.
Can you elaborate on that here? If lookup_content_value()
fails, then it's not obvious to me where we're emitting these logs. Also, just to clarify, if the process timer is never stopped, then nothing is recorded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error then gets forwarded to the overlay which then reports the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timer can be stopped and observed at most once, either automatically (when it goes out of scope) or manually. Alternatively, it can be manually stopped and discarded in order to not record its value.
@KolbyML My understanding from the docs is that if it goes out of scope (eg. what happens in this case when lookup_content_value()?
fails) then it will record the time when it is dropped and count it as an observation. Seems like we probably need to handle these cases and use stop_and_discard()
on failures? Otherwise we'll end up with false observations skewing the histogram?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch I should I should have read the documentation a bit more 😆.
I just pushed a fix for this issue by implementing a CustomHistogramTimer which on drop does stop_and_discard()
instead of stop_and_record()
.
The reason I choose to fix the problem in this way, is because I did a lot of thinking and came to the conclusion this was the cleanest way to achieve our goal well staying within the structural designs of our project in relation to how we handle metrics
I am not exactly answering this question, but I think this point should be made clear. Currently for a single node view and a multi node view we use the same dashboard. I think we should instead support 2 We already have a multi node dashboard, but we currently don't have a dashboard optimized to get an indepth look at 1 node |
Ready for another look @njgheorghita |
@njgheorghita I am ready for another review cargo-fmt ci broke because of an upstream change to the nightly docker excutor in ci here is a pr for that #1179 |
trin-metrics/src/utils.rs
Outdated
self.observe(false); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we're implementing custom logic, this seems worth adding some unit tests to validate the new functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I added a test. I took the pre-existing test promethus has for this and modified it to work for our CustomHistogramTimer
https://docs.rs/prometheus/0.13.3/src/prometheus/histogram.rs.html#1217-1260
I also added 3 comments which describe the functionality which is special aka not counting on drop
@@ -268,7 +273,7 @@ impl HistoryStorage { | |||
"Capacity reached, deleting farthest: {}", | |||
hex_encode(id_to_remove) | |||
); | |||
if let Err(err) = self.evict(id_to_remove) { | |||
if let Err(err) = self.db_remove(&id_to_remove) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the debug
log below is inaccurate.
debug!("Error removing content ID {id_to_remove:?} from db: {err:?}");
trin-metrics/src/utils.rs
Outdated
@@ -0,0 +1,135 @@ | |||
use std::time::Instant; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick this feels like more than a simple util
imo, and deserves its own named module, something like trin-metrics/src/timer.rs
trin-metrics/src/utils.rs
Outdated
/// This timer can be stopped and observed at most once manually. | ||
/// Alternatively, if it isn't manually stopped it will be discarded in order to not record its | ||
/// value. | ||
#[must_use = "Timer should be kept in a variable otherwise it cannot observe duration"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Didn't know about this
trin-metrics/src/utils.rs
Outdated
/// value. | ||
#[must_use = "Timer should be kept in a variable otherwise it cannot observe duration"] | ||
#[derive(Debug)] | ||
pub struct CustomHistogramTimer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name isn't great imo. The Custom
prefix doesn't really give any information about the actual type. aka. All it really communicates is that this is a "custom" implementation. I think a better name would answer why are we using a "custom" implementation. But... this is a tricky one. I'm not coming up with any better names tbh. DropableHistogramTimer
(well dropable
isn't a word, and the regular histogram timer is also "dropable") / UnobservedDropHistogramTimer
/ HistogramTimerWithDiscardedDrop
/ HistogramTimerWithUnobservedDrops
....
Idk, the last one is probably my favorite, but I'll leave this one up to you, at the very least the docstring above is very useful for understanding the purpose of the custom timer. Ideally, the name would also communicate the purpose of the custom timer, but there is a tradeoff here against the wordiness of the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: DiscardOnDropHistogramTimer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropable is totally a word. If it isn't how did you say it xD
trin-metrics/src/utils.rs
Outdated
} | ||
|
||
fn observe(&mut self, record: bool) -> f64 { | ||
let v = Instant::now().saturating_duration_since(self.start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid single letter variables
trin-metrics/src/utils.rs
Outdated
}); | ||
assert!(handler.join().is_ok()); | ||
|
||
let mut mfs = histogram.collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hahah, all I can think of here is motherf***ers
... Maybe that's just me? Imo it's worthwhile to update it to metric_families
and the following mf
-> metric_family
, m
-> metric
... It makes the code more readable and avoids single letter variables
trin-metrics/src/utils.rs
Outdated
|
||
/// Test taken from https://docs.rs/prometheus/0.13.3/src/prometheus/histogram.rs.html#1217-1260 | ||
/// Modified to work with CustomHistogramTimer | ||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this. The point of testing our implementation of CustomHistogramTimer
is not to make sure that the wrapped HistogramTimer
functions properly (which is what this test is testing iiuc). But to make sure that our custom api is working as expected. I'd like to see a test that tests CustomHistogramTimer::observe_duration
, CustomHistogramTimer::stop_and_record
, CustomHistogramTimer::stop_and_discard
, and CustomHistogramTimer::observe
(although this last will be tested indirectly through the other fns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I did it?
@njgheorghita ready for another look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
trin-metrics/src/timer.rs
Outdated
} | ||
|
||
#[test] | ||
fn test_discard_through_explict_drop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicit
& below in the histogram name. and implicit
for the same cases in the tests below
What was wrong?
We didn't have a clear way to see if storage calls were performing well or not. This caused us to make a major regression in performance.
How was it fixed?
By adding grafana metrics on how long different storage calls take. I think this is the first step in getting well rounded performance metrics throughout our codebase.
In a future PR I would want to make a graph which says which network is using the most I/O and by how much.