Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response time graph does not work for distributed configuration #1984

Closed
hmliu6 opened this issue Jan 27, 2022 · 4 comments
Closed

Response time graph does not work for distributed configuration #1984

hmliu6 opened this issue Jan 27, 2022 · 4 comments

Comments

@hmliu6
Copy link

hmliu6 commented Jan 27, 2022

Describe the bug

The response time graph is only available for first few seconds, and shows no data afterward.
The statistics page shows that all response time data are normal.
The JSON returned from master shows that the current_response_time_xxxx fields are in null

The symptome looks exactly the same as this old issue #1182

Expected behavior

The response time should keep displaying and not in no data.

Actual behavior

locust-chart
locust-stat
locust-stat-requests

No explicit error log or exception found on the locust master log and also the browser console log.
locust-master-log
broswer-console

Steps to reproduce

Update locust to latest version
Deploy in master workers configuration on kubernetes

Environment

  • OS: Centos 7
  • Python version: Python 3.6.8
  • Locust version: 2.6.0
@hmliu6 hmliu6 added the bug label Jan 27, 2022
@cyberw
Copy link
Collaborator

cyberw commented Jan 27, 2022

The last log line indicates that your workers are overloaded. Can you share your locustfile? Are you maybe doing some heavy calculations or using a bad (not gevent-friendly) client?

@hmliu6
Copy link
Author

hmliu6 commented Jan 27, 2022

@cyberw, thanks for your reply and sorry that I cannot simply share the locust file.
The logic inside is that the user will download a xml to parse it and then download another file inside the xml.
I believe the overloaded worker is due to large number of users are spawned.
But this UI graph issue was happened when we upgrade the version from 1.0.3 to 2.6.0 (latest), and rest of the locustfile did not change.

@cyberw
Copy link
Collaborator

cyberw commented Jan 27, 2022

XML parsing may be a little too heavy to do in-process. See if you can do the loading of xml without blocking the worker, or maybe switch to a faster xml parser. Maybe locust has gotten more sensitive to worker overload, but without more info it is impossible to know. Sorry, but you are on your own here...

@hmliu6
Copy link
Author

hmliu6 commented Mar 19, 2022

Just would like to keep this issue updated with the root cause found earlier and close this issue.
It turns out that it is similar as this stackoverflow question.
https://stackoverflow.com/questions/63287818/load-generated-by-distributed-worker-processes-is-not-equivalent-generated-by-si

I believe that the problem here is mainly about the master VM which is still unclear why it is caused and some of the communication from worker are being dropped, so the statistics is not accurate.
We used the same set of scripts in many use cases like other on-prem clusters and cloud clusters which has no issue on the statistics, but only found issue on this weird on-prem machine.

The solution afterward is that someone was upgraded the BIOS for this on-prem cluster and restarted all machines in cluster, then the whole thing became normal...

@hmliu6 hmliu6 closed this as completed Mar 19, 2022
@cyberw cyberw added the wontfix label Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants