Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored stats code and display median as well as 95% percentile response times in web UI's charts #549

Merged
merged 18 commits into from
Sep 19, 2017

Conversation

heyman
Copy link
Member

@heyman heyman commented Feb 22, 2017

I've refactored the stats code to track the total request stats in it’s own separate StatsEntry instance under RequestStats.total instead of dynamically calculating the total stats from all other stats entries.

I believe this makes the code slightly more clear. The total stats is now accessible through RequestStats.total and there is no longer a need for RequestStats.aggregated_stats(). StatsEntry.extend() is now only used for merging stats from slave reports, and therefore the full_request_history argument is not longer needed.

The background for this is that I'm looking to display the 95% percentile response time in the response time graph (as mentioned in #539 (comment)). When I started to look into the best way of implementing this I realized that the way we currently store stats data, doesn't allow us to retrieve percentile response times for the last couple of seconds, only for the whole test run (since we use the StatsEntry.response_times dict to calculate it). To allow this, we need to store a snapshot of this dict every second for a number of seconds back. We probably don't want to do this for every single stats entry, but it should be okay to do it for the total stats entry. Now that the total stats is tracked in it's own StatsEntry, and isn't calculated dynamically on the fly, it'll be easier to do that.

…parate StatsEntry instead of dynamically calculating the total stats from all other stats entries
… it’s no longer needed when we have a separate StatsEntry instance to track the total stats
…cond, for the last 20 seconds. This allows us to calculate the current response times for certain percentiles. We turn on this cache for the total StatsEntry.
@heyman heyman changed the title Refactored stats code Refactored stats code and display median as well as 95% percentile response times in web UI's charts Mar 13, 2017
…the response times for only the last 10 seconds, to calculate current response time percentiles (before, we used the *total* response times like it was 10 seconds ago, which obviously was wrong)
…lizing/resetting a StatsEntry with use_response_times_cache set to True
…on the master node when running distributed. This makes the current response time percentiles work when running Locust distributed.
@heyman
Copy link
Member Author

heyman commented Sep 16, 2017

I think this is pretty much ready to be merged. Would love if someone could take a look and review it.

These are the main changes:

RequestStats now holds a total attribute of the type StatsEntry that represents the "total" row in the statistics. Before the total was calculated on the fly by aggregating the StatsEntry instances for each entry (row) in the statistics. This allowed me to simplify the StatsEntry.extend() by removing the full_request_history argument.

StatsEntry has gotten a cache that holds the response_times dict for the last 20 seconds. The cache is turned off by default, but it's turned on for RequestStats.total. This cache is used to calculate the current response time percentiles. The 50 and 95 percentiles are now graphed in the web UI.

locust/stats.py Outdated
processed_count = 0
for response_time in sorted(six.iterkeys(response_times), reverse=True):
processed_count += response_times[response_time]
if((num_requests - processed_count) <= num_of_request):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't need extra () around this

Copy link
Member Author

@heyman heyman Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

locust/stats.py Outdated
def get_current_response_time_percentile(self, percent):
"""
Calculate the *current* response time for a certain percentile. We use a sliding
window of (approximately) the last 10 seconds when calculating this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be useful to make this 10 seconds customizable?

could make it a module variable here, I guess, so people can adjust it directly. I'm not sure it'd be used enough to make it a parameter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a module variable should be good. Right now we only expose this to the users as the "current" response time percentile, which lets us get away with it being approximately the last 10 seconds. We also only use it in the very ephemeral web UI charts. If we were to expose it as an option, maybe people would expect it to be more exact.

Will fix!

locust/stats.py Outdated
response_times=copy(self.response_times),
num_requests=self.num_requests,
)
if len(self.response_times_cache) > 20:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems rather arbitrary, 200 seconds seems like a lot to cache (though probably still negligible)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the cache might not hold response times for every second. We scan through the cache at time()-CURRENT_RESPONSE_TIME_PERCENTILE_WINDOW +/- 10 seconds to find a cache entry which we'll use to calculate the delta in order to get the "current" response time percentile.

I've changed the cache size to be CURRENT_RESPONSE_TIME_PERCENTILE_WINDOW + 10.

I guess we could make the number of seconds we scan configurable as well, but I really don't see a use-case where one would need to change this.

locust/stats.py Outdated
use_response_times_cache = False
"""
If set to True, the copy of the response_time dict will be stored in response_times_cache
every second, and kept for 20 seconds. We can use this dict to calculate the *current*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment to my question about the 10 seconds sliding window time, I can see use cases for adjusting this (for example if your operations take a long time you may end up with wonky results here if the cache is only kept/reported for 20 seconds)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it'll be more wonky than the other stats. Since the stats gets reported once a request is complete, if you have endpoints that takes really long (e.g. 60 seconds), the stats will feel kind of delayed.

locust/stats.py Outdated
if global_stats.max_requests is not None and (global_stats.num_requests + global_stats.num_failures) >= global_stats.max_requests:
raise StopLocust("Maximum number of requests reached")

def on_request_failure(request_type, name, response_time, exception):
global_stats.get(name, request_type).log_error(exception)
#global_stats.get(name, request_type).log_error(exception)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be deleted instead of commented?

locust/stats.py Outdated
@@ -439,18 +539,21 @@ def median_from_dict(total, count):
"""

def on_request_success(request_type, name, response_time, response_length):
global_stats.get(name, request_type).log(response_time, response_length)
#global_stats.get(name, request_type).log(response_time, response_length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be deleted instead of commented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes :)!

…ile configurable (through CURRENT_RESPONSE_TIME_PERCENTILE_WINDOW).

Changed so that the size of the response times cache is CURRENT_RESPONSE_TIME_PERCENTILE_WINDOW + 10.
@aldenpeterson-wf
Copy link
Contributor

Looks good to me!

@heyman heyman merged commit eab8429 into master Sep 19, 2017
@heyman
Copy link
Member Author

heyman commented Sep 19, 2017

Okay, great, thanks!

@mbeacom mbeacom deleted the total-stats-entry branch October 18, 2019 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants