Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add storage processing time metrics #1177

Merged
merged 7 commits into from
Feb 21, 2024
Merged

Conversation

KolbyML
Copy link
Member

@KolbyML KolbyML commented Feb 19, 2024

What was wrong?

We didn't have a clear way to see if storage calls were performing well or not. This caused us to make a major regression in performance.

How was it fixed?

By adding grafana metrics on how long different storage calls take. I think this is the first step in getting well rounded performance metrics throughout our codebase.

image
In a future PR I would want to make a graph which says which network is using the most I/O and by how much.

@KolbyML KolbyML self-assigned this Feb 19, 2024
@KolbyML KolbyML changed the title feat(storage): add storage processing time metrics feat: add storage processing time metrics Feb 19, 2024
Copy link
Collaborator

@morph-dev morph-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few concerns, but I'm fine with going with this for now (since it's better than not having anything).

Also, I didn't look at grafana config since I know nothing about it.

@@ -233,6 +238,7 @@ impl HistoryStorage {
return Err(err);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with the timer when we exit early? Is the time automatically recorded when it goes out of scope, or the entire event is skipped or?

I think it would be useful for our metric to have additional param (together with protocol and storageFunction), to indicate whether operation was successful.
Code might not be elegant (we would have to manually catch all usages of ? operator), but maybe we can refactor it and have metric measurements our single function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something fails nothing is recorded. We have logs for that.

@@ -182,11 +182,15 @@ impl HistoryStorage {
}

fn total_entry_count(&self) -> Result<u64, ContentStoreError> {
let timer = self
.metrics
.start_storage_process_timer("total_entry_count");
let conn = self.sql_connection_pool.get()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric would capture the 'self.sql_connection_pool.get()' call.
My understanding is that here we are waiting to get a connection to the db. Is there a limit on number of connection and sometimes we wait here or is it usually instant?

I'm not sure this should qualify as time spent as part of this storage operation (in case db is blocked by other queries).

Copy link
Member Author

@KolbyML KolbyML Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a read write lock over storage instance. If self.sql_connection_pool.get() is blocking to the point it matters something is what with our code and this will be visable in the graphs, which is what we want. This is important information if it is blocking then that is important information.

We want to know if these calls are slow self.sql_connection_pool.get() is apart of that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't read-write lock allow multiple (unlimited?) reads at the same time? If that's the case, and sql_connection_pool has a limit, then we could be blocked on one request due to other requests taking long time.

And if that's desired depends on how you interpret the graphs. If there is congestion, all query metrics might go up while in fact problem can be with only one type of query.

If you time the requests without getting connection from the pool, you know how long each request takes and if there is congestion you will see exactly what is causing it. It's true that in this case you don't know how affected other queries are, so there is a trade off.

Either way, I'm fine with whatever you decide.

Copy link
Member Author

@KolbyML KolbyML Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.sqlite.org/whentouse.html#:~:text=SQLite%20supports%20an%20unlimited%20number,than%20a%20few%20dozen%20milliseconds.
SQLite supports an unlimited number of simultaneous readers

Apparently there is unlimited readers, but only one writer in sqlite.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The r2d2 crate that we use has a max number of connections.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can report a new metric on how many connections is being used then in a future PR. If we are hitting the max number of connections and that is causing delays I think there is a problem with our code. My goal with these stats is to see how long it takes to execute the functions storage. Getting a connection to the database is included in that as that is what the code calling storage functions has to deal with.

Copy link
Collaborator

@njgheorghita njgheorghita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think about how we can best solidify our grafana json template workflow. Here are the following places (that I can think of right now) where we use the trin-metrics grafana template.

  • trin-bench
  • kurtosis
  • running metrics locally
    If we're adding a graph / making changes to the dashboard, we want to make sure that the graph is being updated / deployed in all 3 situations, while maintaining the appropriate data source ids.

@@ -21,6 +23,14 @@ const BYTES_IN_MB_F64: f64 = 1000.0 * 1000.0;

impl StorageMetrics {
pub fn new(registry: &Registry) -> anyhow::Result<Self> {
let storage_process_time = register_histogram_vec_with_registry!(
histogram_opts!(
"trin_storage_process_time",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, it's convention to have a suffix of the unit of measurement, I'd go for something like storage_process_timer_secs. Also, timer seems a bit more intuitive than time imo

Copy link
Member Author

@KolbyML KolbyML Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't Histograms only able to be in seconds I am fine if we added this let me know

trin-metrics/src/storage.rs Outdated Show resolved Hide resolved
trin-metrics/src/storage.rs Outdated Show resolved Hide resolved
trin-metrics/src/storage.rs Outdated Show resolved Hide resolved
trin-metrics/src/storage.rs Outdated Show resolved Hide resolved
/// Public method for looking up a content value by its content id
pub fn lookup_content_value(&self, id: [u8; 32]) -> anyhow::Result<Option<Vec<u8>>> {
/// Internal method for looking up a content value by its content id
fn lookup_content_value(&self, id: [u8; 32]) -> anyhow::Result<Option<Vec<u8>>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something fails nothing is recorded. We have logs for that.

Can you elaborate on that here? If lookup_content_value() fails, then it's not obvious to me where we're emitting these logs. Also, just to clarify, if the process timer is never stopped, then nothing is recorded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Also, just to clarify, if the process timer is never stopped, then nothing is recorded?

To my knowledge yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error then gets forwarded to the overlay which then reports the error

Copy link
Collaborator

@njgheorghita njgheorghita Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timer can be stopped and observed at most once, either automatically (when it goes out of scope) or manually. Alternatively, it can be manually stopped and discarded in order to not record its value.

@KolbyML My understanding from the docs is that if it goes out of scope (eg. what happens in this case when lookup_content_value()? fails) then it will record the time when it is dropped and count it as an observation. Seems like we probably need to handle these cases and use stop_and_discard() on failures? Otherwise we'll end up with false observations skewing the histogram?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch I should I should have read the documentation a bit more 😆.

I just pushed a fix for this issue by implementing a CustomHistogramTimer which on drop does stop_and_discard() instead of stop_and_record().

The reason I choose to fix the problem in this way, is because I did a lot of thinking and came to the conclusion this was the cleanest way to achieve our goal well staying within the structural designs of our project in relation to how we handle metrics

@KolbyML
Copy link
Member Author

KolbyML commented Feb 20, 2024

We need to think about how we can best solidify our grafana json template workflow. Here are the following places (that I can think of right now) where we use the trin-metrics grafana template.

  • trin-bench
  • kurtosis
  • running metrics locally
    If we're adding a graph / making changes to the dashboard, we want to make sure that the graph is being updated / deployed in all 3 situations, while maintaining the appropriate data source ids.

I am not exactly answering this question, but I think this point should be made clear.

Currently for a single node view and a multi node view we use the same dashboard. I think we should instead support 2
A single node view dashboard: more in depth specific stats
And a multi node view dashboard: An arrgrated overview

We already have a multi node dashboard, but we currently don't have a dashboard optimized to get an indepth look at 1 node

@KolbyML KolbyML requested a review from njgheorghita February 20, 2024 18:03
@KolbyML
Copy link
Member Author

KolbyML commented Feb 20, 2024

Ready for another look @njgheorghita

@KolbyML
Copy link
Member Author

KolbyML commented Feb 20, 2024

@njgheorghita I am ready for another review cargo-fmt ci broke because of an upstream change to the nightly docker excutor in ci here is a pr for that #1179

self.observe(false);
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're implementing custom logic, this seems worth adding some unit tests to validate the new functionality

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I added a test. I took the pre-existing test promethus has for this and modified it to work for our CustomHistogramTimer
https://docs.rs/prometheus/0.13.3/src/prometheus/histogram.rs.html#1217-1260

I also added 3 comments which describe the functionality which is special aka not counting on drop

@KolbyML KolbyML requested a review from njgheorghita February 21, 2024 00:09
@@ -268,7 +273,7 @@ impl HistoryStorage {
"Capacity reached, deleting farthest: {}",
hex_encode(id_to_remove)
);
if let Err(err) = self.evict(id_to_remove) {
if let Err(err) = self.db_remove(&id_to_remove) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the debug log below is inaccurate.

debug!("Error removing content ID {id_to_remove:?} from db: {err:?}");

@@ -0,0 +1,135 @@
use std::time::Instant;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick this feels like more than a simple util imo, and deserves its own named module, something like trin-metrics/src/timer.rs

/// This timer can be stopped and observed at most once manually.
/// Alternatively, if it isn't manually stopped it will be discarded in order to not record its
/// value.
#[must_use = "Timer should be kept in a variable otherwise it cannot observe duration"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Didn't know about this

/// value.
#[must_use = "Timer should be kept in a variable otherwise it cannot observe duration"]
#[derive(Debug)]
pub struct CustomHistogramTimer {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name isn't great imo. The Custom prefix doesn't really give any information about the actual type. aka. All it really communicates is that this is a "custom" implementation. I think a better name would answer why are we using a "custom" implementation. But... this is a tricky one. I'm not coming up with any better names tbh. DropableHistogramTimer (well dropable isn't a word, and the regular histogram timer is also "dropable") / UnobservedDropHistogramTimer / HistogramTimerWithDiscardedDrop / HistogramTimerWithUnobservedDrops....

Idk, the last one is probably my favorite, but I'll leave this one up to you, at the very least the docstring above is very useful for understanding the purpose of the custom timer. Ideally, the name would also communicate the purpose of the custom timer, but there is a tradeoff here against the wordiness of the name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: DiscardOnDropHistogramTimer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropable is totally a word. If it isn't how did you say it xD

}

fn observe(&mut self, record: bool) -> f64 {
let v = Instant::now().saturating_duration_since(self.start);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid single letter variables

});
assert!(handler.join().is_ok());

let mut mfs = histogram.collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahah, all I can think of here is motherf***ers... Maybe that's just me? Imo it's worthwhile to update it to metric_families and the following mf -> metric_family, m -> metric... It makes the code more readable and avoids single letter variables


/// Test taken from https://docs.rs/prometheus/0.13.3/src/prometheus/histogram.rs.html#1217-1260
/// Modified to work with CustomHistogramTimer
#[test]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. The point of testing our implementation of CustomHistogramTimer is not to make sure that the wrapped HistogramTimer functions properly (which is what this test is testing iiuc). But to make sure that our custom api is working as expected. I'd like to see a test that tests CustomHistogramTimer::observe_duration, CustomHistogramTimer::stop_and_record, CustomHistogramTimer::stop_and_discard, and CustomHistogramTimer::observe (although this last will be tested indirectly through the other fns).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did it?

@KolbyML
Copy link
Member Author

KolbyML commented Feb 21, 2024

@njgheorghita ready for another look

Copy link
Collaborator

@njgheorghita njgheorghita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

}

#[test]
fn test_discard_through_explict_drop() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicit & below in the histogram name. and implicit for the same cases in the tests below

@KolbyML KolbyML merged commit e74f817 into ethereum:master Feb 21, 2024
8 checks passed
@KolbyML KolbyML deleted the add-metrics branch January 22, 2025 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants