-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move User selection responsibility from worker to master in order to fix unbalanced distribution of users and uneven ramp-up #1621
Conversation
897651a
to
52709d2
Compare
Codecov Report
@@ Coverage Diff @@
## master #1621 +/- ##
==========================================
- Coverage 80.58% 80.46% -0.12%
==========================================
Files 30 32 +2
Lines 2683 2851 +168
Branches 412 462 +50
==========================================
+ Hits 2162 2294 +132
- Misses 427 455 +28
- Partials 94 102 +8
Continue to review full report at Codecov.
|
9303f39
to
cab825d
Compare
One of the tests is flaky, I'll look into it tomorrow |
Phew. That's a lot of changes and a lot of new code. Is there any way you could simplify it a little bit? I dont feel I can review it in its current state. Does it really need to be recursive? Is there really any benefit to using the "postfix" style for-loops? Imho, they make things harder to read. I dont like the changes to input handling. Having two ways of handling input makes no sense. Try to stick as closely as possible to the current solution to avoid changing something that really isnt related to the distribution of users. |
95172cf
to
fa7f757
Compare
|
Fixed. |
build is still failing (intermittently?) |
To be frank, I'm still a little skeptical to such a big change and I dont think I'll have time to review it any time soon. Maybe @heyman has some input? |
The new integration test |
I don't know what to tell you, the previous implementation had flaws when running in distributed mode as well as with the load test shape, the latter misbehaving in both modes. The problem are well explained in the two issues this PR is trying to solve. The two years old Issue #896 is labelled as low prio, but I personally think this is a bug. Say you have 100 workers and want to hatch 1000 users at a rate of 10, with the previous implementation, you'd have an effective rate of 1000/s ten times in a row which does not make any sense and can screw up your load test. Now, imagine with even more workers...it's clear the implementation was not scalable and flawed. In my opinion and after a considerable amount of thinking and studying the runners codebase, the most reliable, and probably most precise way to handle this is to have the master manage the state of users and their dispatch as this PR does. Regarding the alternative of having the master send the user count along with a delay so that the worker waits before spawning; I don't think this would scale nor work very well because of the unpredictable nature of the network having variable latencies and response times. Furthermore, each worker might take an unpredictable amount of time to process each received spawn messages (because it is busy doing another task or else). Given these aspects, it would be very difficult for the master to compute this delay accurately. I think a good place to start is by looking at the tests in I think the most challenging changes to make sense of are in the runners module. So, I'll try to summarize the changes here:
Another note is that |
I agree that this is an area that needs fixes, it is just that I dont have time to review it. I have spent way too much time on the project in the last six months anyway :) Low prio is just my way to mark tickets as non-critical (and while this is definitely something that we want to fix, it is not critical) (edit: I renamed "low prio" to "non-critical" to make that more clear) If @heyman, or maybe even someone of the other main contributors (doesnt have to be someone with merge permissions) does a first sweep of feedback I can give it a last look and help with the actual clicking of the merge-button :) Maybe @max-rocket-internet is interested in this kind of feature? |
80bb18d
to
8b41c52
Compare
Not really. As I understand this PR is about the distribution of user classes, not locust users, right? In all our load tests I think only 1 or 2 a using more than a single
By this logic you could say the same about ramping up. |
I see that you showed some interests in solving the distributed hatch rate issue #896 which is addressed by this PR.
It also addresses the issues of fair dispatch of the users in distributed mode.
I think this works well for most of the use cases. Especially when each task is more or less "equal" in terms of runtime. However, I found that when dealing with unbalanced tasks, i.e. some tasks take only few seconds or less whereas others can take several minutes, this way of doing was not working very well. The problem is that, eventually, a large number of users get "stuck" on these long-running tasks even if they have a lower weight compared to the short-running tasks. By separating in different user classes, it is possible to get much more control over how frequent each task is executed.
Absolutely, and I think people should do so. As I see it, the spawn rate is mainly there as a convenience. For instance, it's useful when one only wishes to ramp up linearly to a certain steady-state number of users. |
I'm liking the progress for this, and I have run a few basic tests myself again, and so far everything looks stable. A few more thoughts:
|
When master sends a non-null `host` to the workers, set this `host` for the users.
Other logging related code was also refactored to be cleaner and/or be more exact.
I published
I was not even aware the default weights were set to 10 😛 I went back the history when |
Looks good! Is there a better word than "Updating test with X users" though? Maybe "Target set to X users" or something. "Updating" (to me at least) implies that there was already some relevant state even before (which isnt really the case when running headless)
Hmm. @heyman , do you have the reason for the 10 default weight? I can worry about changing this (worst case it might be something for 3.x :) ) I think we're good to go soon, and having a "flat" structure of only PR:s to master is good for changelog etc, so I'm leaning towards no "v2" branch (possibly one or more 2.0bX prereleases, but that's all). I will also make a 1.6.x branch just before we merge this PR, in case we need to build a release on the old code. |
I think that
Perhaps our interpretation is different, but there is always a prior state. When this is the first spawning when the test just started, the prior state is "0 users".
Just my 2 cents, but for the new code, the weight doesn't matter. The weights could be small |
locust/runners.py
Outdated
# can easily do 200/s. However, 200/s with 50 workers and 20 user classes will likely make the dispatch very | ||
# slow because of the required computations. I (@mboutet) doubt that many Locust's users are spawning | ||
# that rapidly. If so, then they'll likely open issues on GitHub in which case I'll (@mboutet) take a look. | ||
if spawn_rate > 100: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like you are worried about a completely different issue (high load on master) than what this code was trying to warn about (high load on workers due to lots of Users/HTTP sessions being started at the same time)?
I added this warning because people (actually mostly people who had misunderstood the load model) kept filing bugs/SO issues about poor performance on workers during fast ramp-up, so I'd like to keep that. If you want to add a separate warning about high total spawn rate that is fine.
Probably just a language thing. But as a person who is just starting a run (at least in my mind), the previous state is not that there are "zero users running", it is that there is not even a test running at all (so the user count is undefined/impossible, not zero). Maybe "Ramping to X users..."?
👍 I've added a PR and will merge it after this. |
Good for |
I published |
and reword the warning content.
👍 looks really good now. I’ll try to get my colleague to try this out (check the slack thread @DennisKrone ) next week and if he finds no issues I’ll merge and make a first prerelease build (probably 2.0b0). I’ll merge the other breaking changes after that and make new prerelease builds that include those. Also, I have looked into how to make prerelease builds for each commit, so once 2.0 (release version) is done I’ll add that. |
Great! thanks for your valuable feedback these past weeks. And also thank you @domik82 and @dannyfreak for your extensive tests. I've uploaded |
I'm ready to merge this now. There's nothing more that needs to happen first, right @mboutet ? I have also prepared a branch that adds version checking between master/worker, which I intend to merge before 2.0 release. |
Bombs away! |
@domik82 & @dannyfreak, I discussed by DM with @cyberw regarding the performance issue with the master for the kind of scale you have. I think this could be addressed in a separate PR as I don't have a quick fix for you at the moment. 160-300 workers with 5-10 users classes and 25 000 total users is, in my opinion, next-level in terms of difficulty. It's not easy to ensure all the constraints (distribution, balanced users, round-robin, etc.) are respected while having a low computational footprint. I've profiled the code using line_profiler to find the hotspots and optimized a few of them, but at the end of the day, it's Python. So, let's find a way to handle such large-scale in a separate PR. Hopefully, you can get familiar with the code and contribute to improving this use case. |
Fixes #1618
Fixes #896
Fixes #1635
There's quite a lot of changes in this PR, but it was required to solve these two issues.
Changes
locust/distribution.py
).locust/dispatch.py
). The master is now responsible to compute the distribution of users and then dispatching a portion of this distribution to each worker. The spawn rate is also respected in distributed mode.spawning_complete
event now takesuser_class_occurrences
instead ofuser_count
.spawn_rate
. See the docstring oflocust.dispatch.dispatch_users
for the rationale behind this decision. The correct way to have a ramp down is to use the load test shape feature.Demo
Example of a real distributed load test shape with 5 workers and a stop timeout of 30s:
Load test shape profile for this run:
The ramp-up of each stage takes around 4-5s which correlates with the spawn rate of 1. Also, we can see that the users are gracefully stopped (or killed after the stop timeout of 30s). That's why the ramp-down have the small staircases instead of one single drop.
Other
I hope I did not break anything. All the tests are passing, but while running some real load tests (such as the one above), it seems like the users are not always properly stopped. I added an integration test in
test_runners.TestMasterWorkerRunners.test_distributed_shape_with_stop_timeout
and I can't seem to reproduce the issue, so I don't think this problem is caused by my changes. Let me know if you think it might be because of my changes. Otherwise, I will open a new issue with some idea I have that might help with this.