6.0.24 request routing causes massive performance regression #2579

johnmadden-sp · 2024-12-17T20:59:31Z

Issue report

Are you sure this is a bug in Passenger?

Not necessarily a bug but the implementation change -- we think -- caused a severe increase in latency leading to outages. The 6.0.24 changelog mentions: We changed the way we route requests. Instead of picking the least-busy process, we now first prioritize new processes first. Reverting to 6.0.23 fixes the problem.

Please try with the newest version of Passenger to avoid issues that have already been fixed
New issue in 6.0.24

Question 1: What is the problem?

What is the expected behavior? No change in performance from 6.0.23
What is the actual behavior? Unknown performance degradation leads to huge latency issues and outages. With a timeout of 2 minutes, passengers are being restarted automatically, leading to increased CPU use on startup. The concept of routing requests to new passengers is the culprit, I think the issue is these haven't been warmed up yet, leading to a lot of latency and that passenger being killed off again. The proc manager kills the child passenger, sending a 502 to nginx, etc.
How can we reproduce it? Deploy new [Rails] containers on 6.0.24 and send traffic to them. We did nothing special here, the impact was immediate and obvious. These are just docker containers, no rolling restarts. I think we should instead provide a restart configuration option or something to allow the user to pick the algorithm.

Question 2: Passenger version and integration mode:

Enterprise, nginx, libnginx-mod-http-passenger-enterprise

Question 3: OS or Linux distro, platform (including version):

Docker image phusion/passenger-ruby32:3.0.7 amd64.

Question 4: Passenger installation method:

Your answer:

phusion-supplied docker images, but with an apt-get install libnginx-mod-http-passenger-enterprise the 6.0.24 upgrade takes place.

Question 5: Your app's programming language (including any version managers) and framework (including versions):

Ruby 3.2.x, Rails 7.x

Question 6: Are you using a PaaS and/or containerization? If so which one?

We're building containers off of phusion/passenger-ruby32:3.0.7

The text was updated successfully, but these errors were encountered:

CamJN · 2024-12-17T21:34:42Z

No rolling restarts are involved so the difference between 2.0.23 and 2.0.24 is that older processes are prioritized over less busy processes. The newness bias is only involved when rolling restarting (ie processes from both before and after a restart was begun).

So what you are actually seeing is that older more-warmed-up processes are responding more slowly than less-busy processes, which is surprising, so it's possible that it could be due to another change in 6.0.24, or it might just be that the warmed-up processes are slower, we'll have to gather more data.

Can you provide a bit more information:

Are you using multi-threading?
How many app-processes are you having passenger create?
Are you using Passenger's pre-start option?
Can you capture the latencies of the first several requests to a container with 2.0.23 and 2.0.24?

Can you be more detailed with regards to this bit:

With a timeout of 2 minutes, Passengers are being restarted automatically, […] The proc manager kills the child Passenger, sending a 502 to Nginx, etc.

Specifically, is the Passenger process being killed by a watchdog or are the Passenger-managed app processes being killed by Passenger?

CamJN · 2024-12-17T21:49:52Z

Just a note for myself: I can diff these tags with the following to get the most suspect changes: git diff release-6.0.23 release-6.0.24 -- src ':!src/cxx_supportlib/vendor-modified/boost'

johnmadden-sp · 2024-12-17T22:33:09Z

Yes to multi-threading
12 procs (in this app group)
Further config:

passenger_root /usr/lib/ruby/vendor_ruby/phusion_passenger/locations.ini;
passenger_ruby /usr/bin/ruby;
passenger_instance_registry_dir /var/passenger-instreg;
passenger_core_file_descriptor_ulimit 16384;
passenger_start_timeout 180;
passenger_pre_start http://localhost/websocket;
passenger_show_version_in_header off;
passenger_intercept_errors on;
passenger_rolling_restarts on;
passenger_pool_idle_time 3600;
passenger_max_request_queue_size 1000;
passenger_max_request_time 120;
passenger_concurrency_model thread;
passenger_max_pool_size 15;
passenger_thread_count 10;
passenger_request_queue_overflow_status_code 504;

Latencies... No, there'd be now way of capturing the first few. But after 30 minutes of runtime,
p95 on 6.0.23 ~390ms
p95 on 6.0.24 ~1800ms
As for what's doing the killing,

InternalUtils.cpp:80 ]: [Client 14-3168] Sending 502 response: application did not send a complete response
CheckoutSession.cpp:449 ]: [Client 3-3129] Maximum request time of 120 seconds reached, killing process 851

CamJN · 2024-12-17T23:03:47Z

Ok so the oldest process is likely being loaded with 10 requests before the next process is chosen, and then Passenger is killing the app process because some requests aren't done in 120s. @FooBarWidget thoughts?

johnmadden-sp · 2024-12-19T16:03:49Z

We did indeed see high thread use and busyness. (I assumed this was more tied to latency and thus just regular use-the-capacity.) Can we maybe round-robin or otherwise better distribute to new passengers, or maybe only start sending to new ones once a threshold of them are ready? (But again we aren't doing rolling restarts, this is happening in new containers on deployment.)

CamJN · 2024-12-24T21:36:43Z

Can you tell us how you decided on the 12 processes and 10 threads configuration?

CamJN · 2024-12-27T20:35:20Z

I'm having trouble reproducing this issue. I'm able to load nginx+Passenger Enterprise set to 12 procs and 10 threads with 120 concurrent requests and 50,000 total requests and the worst time I get is 1.6 seconds to finish processing a request+reply, with a mean of 208ms and a median of 185ms. And that's with a 6 core machine, so it's actually a bit over-loaded in that configuration.

joelngwt · 2025-01-02T06:20:18Z

I've recently been testing an upgrade from 6.0.18 to 6.0.24, and have also noticed an increase in request times. I'm not sure if this is the exact issue (I haven't got around to trying out 6.0.23 yet), but there has been no other changes other than updating Passenger and nginx (nginx stayed at 1.18, only applied security updates).

I have Locust load testing results below. In 6.0.24, the requests per second has dropped, it has introduced a lot of jitter in the response times, and the average p95 has increased. No change in the median response times.

Edit: I'm not using Enterprise, if that matters. I just noticed the changelog for the routing changes has an [Enterprise] tag on it.

CamJN · 2025-01-02T19:24:37Z

The swap of priority between oldest-process and least-busy-process when routing is also the case in Passenger OSS, the Enterprise bit was the prioritization of new processes during a rolling restart which isn't relevant to this issue.

Are you also using the docker image like @johnmadden-sp?

joelngwt · 2025-01-03T02:54:29Z

Not using Docker, I'm running it on Ubuntu 22 on EC2 instances. Used apt to install (https://www.phusionpassenger.com/docs/tutorials/deploy_to_production/installations/oss/aws/ruby/nginx/)

Other details:

Ruby 3.2.5 with YJIT
Rails 7.1.4
passenger_pre_start is used
passenger_min_instances 15;
passenger_max_pool_size 30;

CamJN · 2025-01-07T19:43:33Z

Actually, come to think of it, @joelngwt if you're using OSS Passenger then the concurrency per process is set to 1 by default as you cannot use multithreading, so the new routing would behave exactly the same as before right @FooBarWidget?

johnmadden-sp · 2025-01-07T20:08:34Z

@CamJN 12 procs as the result of cpu count and threads to get better concurrency with i/o waits. (I can't go into a great level of detail publicly of course.)

CamJN · 2025-01-13T14:11:19Z

@johnmadden-sp are you using sticky sessions by any chance?

CamJN · 2025-01-13T15:18:37Z

@FooBarWidget I had a thought, since joelngwt should be seeing the same process selection as 6.0.23 due to OSS not having multithreading, could it be that I messed up the changes to SessionManagement.cpp? Could you please take a look and think about if I got the logic right there? And if the changes there are fine, can you suggest an approach for getting the rolling-restart state in the session management methods so I can limit the new routing to during RRs?

johnmadden-sp · 2025-01-13T16:09:23Z

@CamJN no, no sticky sessions.

CamJN assigned FooBarWidget and CamJN Dec 17, 2024

CamJN added the NeedsFeedback label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6.0.24 request routing causes massive performance regression #2579

6.0.24 request routing causes massive performance regression #2579

johnmadden-sp commented Dec 17, 2024 •

edited by CamJN

Loading

CamJN commented Dec 17, 2024

CamJN commented Dec 17, 2024

johnmadden-sp commented Dec 17, 2024

CamJN commented Dec 17, 2024 •

edited

Loading

johnmadden-sp commented Dec 19, 2024

CamJN commented Dec 24, 2024

CamJN commented Dec 27, 2024

joelngwt commented Jan 2, 2025 •

edited

Loading

CamJN commented Jan 2, 2025

joelngwt commented Jan 3, 2025

CamJN commented Jan 7, 2025

johnmadden-sp commented Jan 7, 2025

CamJN commented Jan 13, 2025

CamJN commented Jan 13, 2025

johnmadden-sp commented Jan 13, 2025

6.0.24 request routing causes massive performance regression #2579

6.0.24 request routing causes massive performance regression #2579

Comments

johnmadden-sp commented Dec 17, 2024 • edited by CamJN Loading

Issue report

CamJN commented Dec 17, 2024

CamJN commented Dec 17, 2024

johnmadden-sp commented Dec 17, 2024

CamJN commented Dec 17, 2024 • edited Loading

johnmadden-sp commented Dec 19, 2024

CamJN commented Dec 24, 2024

CamJN commented Dec 27, 2024

joelngwt commented Jan 2, 2025 • edited Loading

CamJN commented Jan 2, 2025

joelngwt commented Jan 3, 2025

CamJN commented Jan 7, 2025

johnmadden-sp commented Jan 7, 2025

CamJN commented Jan 13, 2025

CamJN commented Jan 13, 2025

johnmadden-sp commented Jan 13, 2025

johnmadden-sp commented Dec 17, 2024 •

edited by CamJN

Loading

CamJN commented Dec 17, 2024 •

edited

Loading

joelngwt commented Jan 2, 2025 •

edited

Loading