Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribution of user classes is not respected and some user classes are just never spawned #1618

Closed
mboutet opened this issue Nov 7, 2020 · 19 comments · Fixed by #1621
Closed
Labels

Comments

@mboutet
Copy link
Contributor

mboutet commented Nov 7, 2020

Describe the bug

When using LoadTestShape in distributed mode and with multiple user classes having different weights, the distribution of users does not respect the weights (even within some tolerance). Furthermore, users with the smaller weights are often never picked up. The problem appears especially when the LoadTestShape specifies stages with small increments (e.g. 5 users at a time).

Expected behavior

The distribution of users should take into account the overall distribution of users across all workers.

Actual behavior

Some user classes with lower weights are never picked up and the distribution of users does not respect the weights (even within a certain tolerance).

Steps to reproduce

I think the problem is that each worker is responsible for spawning its own users. Consider the following setup:

  • 5 workers
  • Load test shape that increases the users by 5 at a rate of 1/s each minute until 100 users
  • 3 user classes with weights [35, 55, 10]
    Once the test starts, the master will instruct each worker to spawn 1 user every minute. However, the weight_users function will always return the user with the weight of 55.

Possible solutions

I see two aspects that needs to be implemented:

  1. I think that when running in distributed mode, the master runner should instruct the number of each user class to the workers instead of only the number of users. The worker runner would thus spawn the specified users as-is instead of computing the buckets.
  2. The master runner should keep a state of all the running users and their class so that it can spawn the appropriate classes in order to preserve the distribution as much as possible. This state could also serve to solve Hatch rate in distributed mode spawns users in batches equal to number of slaves #896.

I'm not super familiar with the codebase, but would that make sense? Is there some technical limitation I'm not aware of?

Environment

  • OS: Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1031-azure x86_64)
  • Python version: Python 3.7.9
  • Locust version: locust==1.3.1
  • Locust command line that you ran:

Master:

  "${python_exec}" -m locust \
    --locustfile load_tests/locustfile_generated.py \
    --master \
    --master-bind-host "${master_host}" \
    --master-bind-port "${master_port}" \
    --expect-workers "${number_of_workers}" \
    --stop-timeout 900 \
    --csv="${results_path}/results" \
    --logfile "${results_path}/logs.log" \
    --loglevel "DEBUG"

Workers:

  "${python_exec}" -m locust \
    --locustfile load_tests/locustfile_generated.py \
    --worker \
    --master-host "${master_host}" \
    --master-port "${master_port}" \
    --csv="${results_path}/results" \
    --logfile "${results_path}/logs.log" \
    --loglevel "DEBUG"
  • Locust file contents (anonymized if necessary):
    The content of each test has been omitted. Also, this file is rendered from a template, so that is why the classes and tasks have generic names.
import json
import os
import random
import uuid
from pathlib import Path

import locust.stats
from essential_generators import DocumentGenerator
from locust import (
    HttpUser,
    LoadTestShape,
    between,
    task,
)

from load_tests.api.common import (
    create_user,
    delete_user,
    get_password,
)

locust.stats.CSV_STATS_INTERVAL_SEC = 2

current_path = Path(os.path.split(__file__)[0])

host = os.environ['API_HOST']

cached_data_file_path = os.environ['CACHED_DATA_FILE_PATH']

gen = DocumentGenerator()

random.seed()


class Test1(HttpUser):

    wait_time = between(5, 10)
    weight = 35
    host = host

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.user_id = None
        self.username = None
        self.password = None
        with open(cached_data_file_path, "rt") as file:
            self.cached_data = json.load(file)

    def on_start(self):
        self.username = f"test-user-{uuid.uuid4()}"
        self.password = get_password()
        self.user_id, user = create_user(self.client, self.username, self.password)

    def on_stop(self):
        if self.user_id is not None:
            delete_user(self.client, self.user_id)
        self.user_id = None
        self.username = None
        self.password = None

    @task(8)
    def test1(self):
        # omitted
        pass

    @task(8)
    def test2(self):
        # omitted
        pass

    @task(8)
    def test3(self):
        # omitted
        pass

    @task(8)
    def test4(self):
        # omitted
        pass

    @task(8)
    def test5(self):
        # omitted
        pass

    @task(8)
    def test6(self):
        # omitted
        pass

    @task(8)
    def test7(self):
        # omitted
        pass

    @task(8)
    def test8(self):
        # omitted
        pass

    @task(8)
    def test9(self):
        # omitted
        pass

    @task(8)
    def test10(self):
        # omitted
        pass

    @task(8)
    def test11(self):
        # omitted
        pass

    @task(12)
    def test12(self):
        # omitted
        pass


class Test2(HttpUser):

    wait_time = between(0, 0.5)
    weight = 55
    host = host

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.user_id = None
        self.username = None
        self.password = None
        with open(cached_data_file_path, "rt") as file:
            self.cached_data = json.load(file)

    def on_start(self):
        self.username = f"test-user-{uuid.uuid4()}"
        self.password = get_password()
        self.user_id, user = create_user(self.client, self.username, self.password)

    def on_stop(self):
        if self.user_id is not None:
            delete_user(self.client, self.user_id)
        self.user_id = None
        self.username = None
        self.password = None

    @task(50)
    def test1(self):
        # omitted
        pass

    @task(50)
    def test2(self):
        # omitted
        pass


class Test3(HttpUser):

    wait_time = between(5, 10)
    weight = 10
    host = host

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.user_id = None
        self.username = None
        self.password = None
        with open(cached_data_file_path, "rt") as file:
            self.cached_data = json.load(file)

    def on_start(self):
        self.username = f"test-user-{uuid.uuid4()}"
        self.password = get_password()
        self.user_id, user = create_user(self.client, self.username, self.password)

    def on_stop(self):
        if self.user_id is not None:
            delete_user(self.client, self.user_id)
        self.user_id = None
        self.username = None
        self.password = None

    @task(100)
    def test4(self):
        # omitted
        pass


class StagesShape(LoadTestShape):
    """
    A simply load test shape class that has different user and spawn_rate at
    different stages.
    Keyword arguments:
        stages -- A list of dicts, each representing a stage with the following keys:
            duration -- When this many seconds pass the test is advanced to the next stage
            users -- Total user count
            spawn_rate -- Number of users to start/stop per second
            stop -- A boolean that can stop that test at a specific stage
        stop_at_end -- Can be set to stop once all stages have run.
    """
    stages = [
        {"duration": 300, "users": 5, "spawn_rate": 1},
        {"duration": 600, "users": 25, "spawn_rate": 1},
        {"duration": 900, "users": 50, "spawn_rate": 1},
        {"duration": 4500, "users": 100, "spawn_rate": 1},
        {"duration": 5400, "users": 1, "spawn_rate": 1},
        {"duration": 6300, "users": 50, "spawn_rate": 1},
    ]

    for previous_stage, stage in zip(stages[:-1], stages[1:]):
        assert stage["duration"] > previous_stage["duration"]

    def tick(self):
        run_time = self.get_run_time()

        for stage in self.stages:
            if run_time < stage["duration"]:
                tick_data = (stage["users"], stage["spawn_rate"])
                return tick_data

        return None
@mboutet mboutet added the bug label Nov 7, 2020
@cyberw
Copy link
Collaborator

cyberw commented Nov 7, 2020

Thanks for this report! I actually think this is an issue with normal user distribution as well (though it is exacerbated by Shapes, because we are spawning few new users at a time).

Even in "normal" ramp up, if you have 10 workers and say you want 10 users, you will only ever get one User type (the heaviest one if I remember correctly)

I dont think I'll have time to look at this myself any time soon, but my suggestion for solution is for every worker to "shift" which users we start with in weight_users (https://github.com/locustio/locust/blob/master/locust/runners.py#L133)

The math in that function is (overly?) complicated, but I think it can be fixed.

Perhaps it is easiest if we first fix this #1601, so there is a consistent ID for each worker (and not just a guid)

@mboutet
Copy link
Contributor Author

mboutet commented Nov 8, 2020

I created a draft PR #1621 with some changes that I think would help alleviate the problem without having to change or re-factor too much stuff.

@cyberw
Copy link
Collaborator

cyberw commented Nov 8, 2020

Unfortunately that approach is not good enough.

When using a large number of users it might be, but imagine running two types of users and launching two users and sometimes getting 2xUser1, sometimes 1xUser1 + 1xUser2 and sometimes 2xUser2.

Whatever approach we choose must be deterministic, at least in most cases.

@mboutet
Copy link
Contributor Author

mboutet commented Nov 8, 2020

For the determinism, I think that initializing the random seed to a constant is sufficient to give repeatable runs.

I don't think that the users of locust expect absolute determinism regarding the spawned users. Even in the doc it says:

If you wish to simulate more users of a certain type you can set a weight attribute on those classes. Say for example, web users are three times more likely than mobile users:

the keyword being "likely" implying a probability.

However, we should guarantee that each user class will be spawned at least once if the number of users is greater or equal to the number of user classes.

I think the solution is for the weight_users function to compute the desired users based on two constraints:

  1. at least one user of each user class must be running at all time (if the number of users is >= to the number of user classes)
  2. the distance between the desired and actual distribution of each user class must be minimal, i.e. math.sqrt(math.fsum(map(lambda x: x**2, (user_class.actual_percentage - user_class.desired_percentage for user_class in user_classes)))). To compute this optimal distribution, we can use a small brute-force algorithm that perform a given number of trials and keeps the users that minimize this distance. Each trial would compute the users using the random.choices function.

I wrote some code to test this idea on my side and it seems to give good results. Of course, it is not completely deterministic (unless the seed is fixed), but at least it does not have the problems of the current implementation. The most difficult part will be to refactor the master and worker runners to have the master decide which users to spawn on each worker.

@cyberw
Copy link
Collaborator

cyberw commented Nov 8, 2020

A constant random seed could be ok, but you'd need to distribute it to all the worker nodes, determinism is super important.

I didnt have time to think about your math there, but I guess it might be good :) Not entirely sold on a brute force algoritm either.

Also, it feels a bit heavy handed to just rewrite everything, I think there has been a fair amount of thought put into the current implementation (I didnt write it though)

@mboutet
Copy link
Contributor Author

mboutet commented Nov 9, 2020

The brute-force algorithm might not be the smartest approach, although according to my preliminary tests, it converges to a solution quickly. The alternative is to add a dependency such as SciKit and use a real optimization algorithm and I don't know if we want that.

Right now, the idea would be to have the master compute the optimal users distribution at each iteration, then dispatch the users to each worker. The dispatch could take into account the users already running on each worker so that the disruption on each one is minimized (i.e. prevent excessive stop of users). The dispatch will also try to balance the users across the workers as well as ensure the spawn rate (this will fix #896).

I think it's clear the current implementation is somewhat broken and needs to change. Especially regarding handling the new load test shape feature.

I tested some stuff if you want to run it on your machine: https://gist.github.com/mboutet/5047465a315868ac7e7290a354c24bb4.
It's not clean and far from complete but it shows my idea. You can run it and it will print out on the terminal. You can also run it multiple times to see that the distribution is accurate, fairly repeatable and that it guarantees that each user class is present at least once whether the number of users is small or large.

I'm willing to contribute and make all these changes.

@cyberw
Copy link
Collaborator

cyberw commented Nov 9, 2020

Ok, I'm slightly warming up to your idea :)

But what if we could do something much simpler, like always choosing to start/stop the user type with the biggest diff between current and desired distribution (and picking the first one if there are multiple ones with the same diff), and then looping? Or is that not possible for some reason?

@cyberw
Copy link
Collaborator

cyberw commented Nov 9, 2020

Another desirable trait of the algorithm should be to spawn a similar distribution of users on all workers (so one worker doesnt get a radically different user distribution than another, if it can be avoided)

@mboutet
Copy link
Contributor Author

mboutet commented Nov 9, 2020

Another desirable trait of the algorithm should be to spawn a similar distribution of users on all workers (so one worker doesnt get a radically different user distribution than another, if it can be avoided)

Yes, I agree. And this is something I actually need in my own use cases where some user classes perform quite intensive computation to generate random texts whereas some others are low-cpu users. This will be implemented where the current dispatch logic is:

locust/locust/runners.py

Lines 546 to 559 in 30756ce

for client in self.clients.ready + self.clients.running + self.clients.spawning:
data = {
"spawn_rate": worker_spawn_rate,
"num_users": worker_num_users,
"host": self.environment.host,
"stop_timeout": self.environment.stop_timeout,
}
if remaining > 0:
data["num_users"] += 1
remaining -= 1
logger.debug("Sending spawn message to client %s" % (client.id))
self.server.send_to_client(Message("spawn", data, client.id))

This is also where the logic for handling the distributed spawn rate will take place.

But what if we could do something much simpler, like always choosing to start/stop the user type with the biggest diff between current and desired distribution (and picking the first one if there are multiple ones with the same diff), and then looping? Or is that not possible for some reason?

I'm not sure I understand your idea, but I think this is similar to what I'm proposing. In my small PoC, I always consider the current set of users in the calculation of the new set. So, I'm not computing a completely new set of users each time the weight_users is called.

Having the master runner control the distribution will also open the door to presenting stats on the users distribution. This could also be shown in a new tab of the web UI similar to the one that shows the workers.

I'm aware that all of this is quite a big and complicated refactor, but I'll do my best to keep things simple and to not disrupt too much the existing codebase. I believe that if this can be successfully implemented, it will bring locust much more value.

@cyberw
Copy link
Collaborator

cyberw commented Nov 9, 2020

What I mean is instead of choosing a random user, and then checking to see what the size of diff we get, we try adding User1, calculate the diff, try adding User2, calculate the diff, etc and then just pick the one with the minimal diff. So no random selection of Users the way you are currently doing in get_random_weighted_users()

I'm not exactly sure how to scale this for spawning lots of users (we probably dont want to do the above procedure once for each user to spawn), but there should be a way without resorting to random.

@mboutet
Copy link
Contributor Author

mboutet commented Nov 17, 2020

@cyberw, I went with your approach, so everything is deterministic. I still have some work to do on my PR, but once it is ready for review, I will remove the "Draft" status.

@cyberw
Copy link
Collaborator

cyberw commented Nov 17, 2020

Cool! I left a few comments on the current version and will be back for more later, once you are finished!

@TomaszSlawski-TomTom
Copy link

I don't know if it is important but today I have met the same issue but only for FastHttpUsers. HttpUser are working correctly.
locust version: 1.4.1

@mboutet
Copy link
Contributor Author

mboutet commented Dec 3, 2020

From my understanding of the codebase, FastHttpUser and HttpUser are not treated differently in regard to the distribution/spawning logic. So it might be a different problems in your case.

@TomaszSlawski-TomTom
Copy link

I was a bit surprised about this but it is reproducible. If I use HttpUser then it is working correctly but with FastHttpUser only users with first user class are spawned.

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it label Apr 21, 2021
@mboutet
Copy link
Contributor Author

mboutet commented Apr 21, 2021

/remove-lifecycle stale

@github-actions github-actions bot removed the stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it label Apr 22, 2021
@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it label Jun 21, 2021
@mboutet
Copy link
Contributor Author

mboutet commented Jun 21, 2021

/remove-lifecycle stale

@github-actions github-actions bot removed the stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it label Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants