Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

504 in the frontend facia app #10392

Closed
3 of 4 tasks
alinaboghiu opened this issue Jan 29, 2024 · 6 comments · Fixed by #10450
Closed
3 of 4 tasks

504 in the frontend facia app #10392

alinaboghiu opened this issue Jan 29, 2024 · 6 comments · Fixed by #10450
Assignees
Labels
Milestone

Comments

@alinaboghiu
Copy link
Member

alinaboghiu commented Jan 29, 2024

Tasks

Preview Give feedback
  1. cemms1
  2. cemms1
  3. cemms1
@cemms1
Copy link
Contributor

cemms1 commented Jan 31, 2024

The volume of ELB 5xx on the rendering app drops significantly when directing fronts-based traffic to the new facia-rendering app

Graph of rendering app showing the requests vs errors and latency charts for the period before, during and after changing the app doing the fronts rendering:
image

@cemms1
Copy link
Contributor

cemms1 commented Jan 31, 2024

When we change from an ELB to an ALB, load balancer 504 responses turn into 502 responses

HTTP 504 errors on the facia app for DCR requests:
Screenshot 2024-01-31 at 17 13 53

HTTP 502 errors on the facia app for DCR requests:
Screenshot 2024-01-31 at 17 14 03

@cemms1
Copy link
Contributor

cemms1 commented Jan 31, 2024

Current theory is the following from this AWS troubleshooting page:

The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target

The load balancer receives a request and forwards it to the target. The target receives the request and starts to process it, but closes the connection to the load balancer too early. > This usually occurs when the duration of the keep-alive timeout for the target is shorter than the idle timeout value of the load balancer. Make sure that the duration of the keep-alive timeout is greater than the idle timeout value.

Check the values for the request_processing_time, target_processing_time and response_processing_time fields.

See the following example access log entry:

http 2022-04-15T16:52:50.757968Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 10.0.0.1:80 0.001 4.205 -1 502 - 94 326 "GET http://example.com:80 HTTP/1.1" "curl/7.51.0" - - arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337262-36d228ad5d99923122bbe354"

Note: In this access log entry, the request_processing_time is 0.001, the target_processing_time is 4.205, and the response_processing_time is -1.

Useful links:

@cemms1
Copy link
Contributor

cemms1 commented Feb 1, 2024

I am fairly sure the 502s, and previously 504s, were being caused by the target server keep alive timeout being shorter than the load balancer connection idle timeout

The options here are:

  • increase the keep alive timeout on the node application
    e.g. in server.prod.ts:

    const server = app.listen(port);
    server.keepAliveTimeout = 90 * 1000; // ensure this is higher than the default LB idle timeout of 60 seconds
    
  • decrease the idle timeout on the load balancer
    e.g. in the screenshot of the LB settings below, ensure the timeout is lower than the Node default timeout of 5 seconds

    image

There's a blog post about this issue here, which is worth a quick read

@alinaboghiu
Copy link
Member Author

This write up is fantastic, thank you Charlotte, brilliant 🕵️ work.

@cemms1
Copy link
Contributor

cemms1 commented Feb 1, 2024

With guidance from DevX, decided to go with option 2 to decrease the idle timeout on the load balancer. The PR is here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants