Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep Alive option is not working on https.agent #47137

Open
yenfryherrerafeliz opened this issue Mar 17, 2023 · 7 comments
Open

Keep Alive option is not working on https.agent #47137

yenfryherrerafeliz opened this issue Mar 17, 2023 · 7 comments
Labels
https Issues or PRs related to the https subsystem. tls Issues and PRs related to the tls subsystem.

Comments

@yenfryherrerafeliz
Copy link

Version

v18.15.0

Platform

mac

Subsystem

No response

What steps will reproduce the bug?

  • Create a web server that handle requests through https protocol.
    Note: this step is just if you do not have a web server already set up
    • In my case I used an ec2 instance, with a public Ip, and python along with flask as follow:
      • Set up an ec2 install with linux
      • Make sure the security group attached to the instance allows http and https traffic.
      • Install python3 if not installed
      • Create a self signed ssl:
        • openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 365
      • Install the python3 needed dependencies:
        • pip3 install flask
        • pip3 install pyopenssl
        • create a python file touch server.py and paste the following code:
        from flask import Flask
        import time
        import json
        
        app = Flask(__name__)
        
        @app.route('/sleep/<int:secs>')
        def sleep(secs):
            time.sleep(secs)
        
            return ({
               'message': 'Function sleep for ' + str(secs) + ' seconds!'
            })
        
        if __name__ == '__main__':
            app.run(port=443, host='0.0.0.0', ssl_context=('cert.pem', 'key.pem')
      • Then, start the server python3 server.py
  • Then, at client side install wireshark to monitor network traffic. In my case I use brew from a mac brew install --cask wireshark
  • Then, set up a nodejs environment, if does not have nodejs make sure you install it. In my case I use nvm, but also need to install nvm before hands.
  • Then, create a js file touch index.js
  • Then, copy and paste the code below:
import { Agent as HAgent, request as HRequest } from "https";
const serverPort = 443;
const serverHost = 'YOUR WEB SERVER HOST, IN MY CASE I USED MY EC2 INSTANCE PUBLIC IP';
const agent = new HAgent({
    keepAlive: true,
    keepAliveMsecs: 10 * 1000,
    maxSockets: 100,
    maxTotalSockets: 100,
    maxFreeSockets: 256,
    timeout: 2000,
    scheduling: "fifo",
    port: serverPort,
    host: serverHost,

});
const handleRequest = (response) => {
    let data = '';
    response.on('data', (chunk) => {
        data += Buffer.from(chunk).toString('utf-8');
    })
    response.on('end', () => {
        console.log('Response: ', data);
    })
}
const reqFunc = HRequest({
    protocol: 'https:',
    headers: {},
    host: serverHost,
    method: 'GET',
    agent: agent,
    path: 'sleep/100',
    port: serverPort,
}, (res) => handleRequest(res));
reqFunc.on('error', (err) => {
    console.log('Error: ', err)
} )
reqFunc.end()
  • Then, open wireshark and set a filter to monitor the interface the host machine you are on is using. If you do not know which host machine is yours you could do route get 10.10.10.10 and you will see the interface there:
    Screenshot 2023-03-17 at 6 41 33 AM

  • Then, set another filter that just shows the traffic where the destination IP is the IP of your web server.

  • Then, just run the nodejs script to start having some traffic.

  • And, based on the configuration of my agent, we should have a keep alive packet going out every 10 seconds, but it will never happen:
    Screenshot 2023-03-17 at 6 47 55 AM(1)

  • But If instead of using a https agent I use http then, I can see the keep alive packets going:
    Untitled Diagram drawio

How often does it reproduce? Is there a required condition?

It is always reproducible as soon as you use https.Agent

What is the expected behavior? Why is that the expected behavior?

When keep alive is set to true then it should be honored and keep-alive packets should be going out.

What do you see instead?

Not keep-alive packets going out.

Additional information

No response

@bnoordhuis bnoordhuis added tls Issues and PRs related to the tls subsystem. https Issues or PRs related to the https subsystem. labels Mar 17, 2023
@bnoordhuis
Copy link
Member

I think I figured out the cause.

  1. https.Agent calls tls.connect()

  2. tls.connect() returns a tls.TLSSocket object

  3. tls.TLSSocket is backed by an internal TCPWrap C++ object, not a net.Socket

  4. tls.TLSSocket.prototype.setKeepAlive() therefore calls TCPWrap::SetKeepAlive(), not net.Socket.prototype.setKeepAlive()

  5. TCPWrap::SetKeepAlive() expects the delay in seconds, not milliseconds

Ergo, it starts TCP keep-alive with an almost 3 hour delay (10,000 seconds), not 10 seconds.


I think the easiest fix would be to move the milliseconds -> seconds conversion from net.Socket.prototype.setKeepAlive() to TCPWrap::SetKeepAlive():

diff --git a/lib/net.js b/lib/net.js
index 654190551e4..72a33246012 100644
--- a/lib/net.js
+++ b/lib/net.js
@@ -458,7 +458,7 @@ function Socket(options) {

   this[kSetNoDelay] = Boolean(options.noDelay);
   this[kSetKeepAlive] = Boolean(options.keepAlive);
-  this[kSetKeepAliveInitialDelay] = ~~(options.keepAliveInitialDelay / 1000);
+  this[kSetKeepAliveInitialDelay] = options.keepAliveInitialDelay;

   // Shut down the socket when we're finished with it.
   this.on('end', onReadableStreamEnd);
@@ -601,11 +601,10 @@ Socket.prototype.setNoDelay = function(enable) {

 Socket.prototype.setKeepAlive = function(enable, initialDelayMsecs) {
   enable = Boolean(enable);
-  const initialDelay = ~~(initialDelayMsecs / 1000);

   if (!this._handle) {
     this[kSetKeepAlive] = enable;
-    this[kSetKeepAliveInitialDelay] = initialDelay;
+    this[kSetKeepAliveInitialDelay] = initialDelayMsecs;
     return this;
   }

@@ -616,12 +615,12 @@ Socket.prototype.setKeepAlive = function(enable, initialDelayMsecs) {
   if (enable !== this[kSetKeepAlive] ||
       (
         enable &&
-        this[kSetKeepAliveInitialDelay] !== initialDelay
+        this[kSetKeepAliveInitialDelay] !== initialDelayMsecs
       )
   ) {
     this[kSetKeepAlive] = enable;
-    this[kSetKeepAliveInitialDelay] = initialDelay;
-    this._handle.setKeepAlive(enable, initialDelay);
+    this[kSetKeepAliveInitialDelay] = initialDelayMsecs;
+    this._handle.setKeepAlive(enable, initialDelayMsecs);
   }

   return this;
@@ -2047,8 +2046,8 @@ function onconnection(err, clientHandle) {
   }
   if (self.keepAlive && clientHandle.setKeepAlive) {
     socket[kSetKeepAlive] = true;
-    socket[kSetKeepAliveInitialDelay] = self.keepAliveInitialDelay;
-    clientHandle.setKeepAlive(true, self.keepAliveInitialDelay);
+    socket[kSetKeepAliveInitialDelay] = 1e3 * self.keepAliveInitialDelay;
+    clientHandle.setKeepAlive(true, 1e3 * self.keepAliveInitialDelay);
   }

   self._connections++;
diff --git a/src/tcp_wrap.cc b/src/tcp_wrap.cc
index baa7618dfa0..d9919fddcfd 100644
--- a/src/tcp_wrap.cc
+++ b/src/tcp_wrap.cc
@@ -202,7 +202,7 @@ void TCPWrap::SetKeepAlive(const FunctionCallbackInfo<Value>& args) {
   int enable;
   if (!args[0]->Int32Value(env->context()).To(&enable)) return;
   unsigned int delay = static_cast<unsigned int>(args[1].As<Uint32>()->Value());
-  int err = uv_tcp_keepalive(&wrap->handle_, enable, delay);
+  int err = uv_tcp_keepalive(&wrap->handle_, enable, delay / 1000);
   args.GetReturnValue().Set(err);
 }

Pull request welcome. The patch is untested but that's the general thrust of the change.

@Alexander-Ludwig
Copy link

Alexander-Ludwig commented Mar 18, 2023

I think I figured out the cause.

  1. https.Agent calls tls.connect()
  2. tls.connect() returns a tls.TLSSocket object
  3. tls.TLSSocket is backed by an internal TCPWrap C++ object, not a net.Socket
  4. tls.TLSSocket.prototype.setKeepAlive() therefore calls TCPWrap::SetKeepAlive(), not net.Socket.prototype.setKeepAlive()
  5. TCPWrap::SetKeepAlive() expects the delay in seconds, not milliseconds

Ergo, it starts TCP keep-alive with an almost 3 hour delay (10,000 seconds), not 10 seconds.

I think the easiest fix would be to move the milliseconds -> seconds conversion from net.Socket.prototype.setKeepAlive() to TCPWrap::SetKeepAlive():

Pull request welcome. The patch is untested but that's the general thrust of the change.

Hey @bnoordhuis,

Following this suspicion I tried my replication scenario with an agent configured with keepAliveMsecs: 1, . This did not change the underlying issue behavior.

Just as a note, this behavior has been highlighted in this issue thread #41965 , which notes that all happens likely due to the agent not actually keeping the socket alive and sending TCP probes until an HTTP response has been received on a particular connection. I have replicated this very behavior, by first sending a short-lived request on a connection, I can confirm that TCP keepalive probes are then sent. This allows subsequent long-running requests to not have connections closed due to TCP idle timeouts.

I believe this is the gist of the issue: TCP keepalive probes are not sent for the first request made by an https.agent on a given socket. My colleague @yenfryherrerafeliz noted in his testing that this issue does not happen with http.agent.

@bnoordhuis
Copy link
Member

TCP keepalive probes are not sent for the first request made by an https.agent on a given socket

The keepAliveMsecs option isn't applied until the socket is added to the connection pool, i.e., after the first request. That's by design, as far as I'm aware.

I'm 95% sure http.Agent works the same way because https.Agent inherits that behavior from http.Agent.

@Alexander-Ludwig
Copy link

TCP keepalive probes are not sent for the first request made by an https.agent on a given socket

The keepAliveMsecs option isn't applied until the socket is added to the connection pool, i.e., after the first request. That's by design, as far as I'm aware.

I'm 95% sure http.Agent works the same way because https.Agent inherits that behavior from http.Agent.

It would be good to confirm this concretely. Just for some context, this particular behavior of not setting tcp keepalive on initial requests causes issues when Node.js applications operate in an environment subject to connection timeouts.

For example as per [1], if a Node.js application has traffic going through AWS NAT Gateway idle connections can be closed are 350 seconds. The recommendation to avoid this happening is to ensure that TCP Keepalive is used. Since the https.agent appears to not be doing this for the first request on a socket, this means that any long duration requests unexpectedly have a leg of the connection dropped.

I am not too sure about the best way to implement TCP Keep Alive from a systems perspective, but HTTP Agents for Python and Java do not seem to have this issue.

References:

[1] https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-troubleshooting.html#nat-gateway-troubleshooting-timeout

@bnoordhuis
Copy link
Member

It would be good to confirm this concretely.

The relevant code is in lib/_http_agent.js and lib/https.js but the latter is oblivious to keepAliveMsecs, it merely extends http.Agent to use tls.connect() instead of net.connect().

http.Agent.prototype.keepSocketAlive is the method you're most likely looking for.

@yenfryherrerafeliz
Copy link
Author

@bnoordhuis, @Alexander-Ludwig thanks for looking into this. After doing different tests I think I found a way to get around this issue, and is by adding a listener to the request for when a socket is created, and then there we enable keep alive directly from the socket. For example:
This is the piece of code I added:

req.addListener('socket', (socket) => {
    socket.setKeepAlive(agent?.keepAlive || false, agent?.keepAliveMsecs || 0);
});

Here is my full working example:

import { Agent as HAgent, request as HRequest } from "https";
const serverPort = 443;
const serverHost = 'YOUR WEB SERVER HOST, IN MY CASE I USED MY EC2 INSTANCE PUBLIC IP';
const agent = new HAgent({
    keepAlive: true,
    keepAliveMsecs: 10 * 1000,
    maxSockets: 100,
    maxTotalSockets: 100,
    maxFreeSockets: 256,
    timeout: 2000,
    scheduling: "fifo",
    port: serverPort,
    host: serverHost,

});
const handleRequest = (response) => {
    let data = '';
    response.on('data', (chunk) => {
        data += Buffer.from(chunk).toString('utf-8');
    })
    response.on('end', () => {
        console.log('Response: ', data);
    })
}
const reqFunc = HRequest({
    protocol: 'https:',
    headers: {},
    host: serverHost,
    method: 'GET',
    agent: agent,
    path: 'sleep/100',
    port: serverPort,
}, (res) => handleRequest(res));
reqFunc.on('error', (err) => {
    console.log('Error: ', err)
} )
req.addListener('socket', (socket) => {
    socket.setKeepAlive(nodeHttpsOptions?.agent?.keepAlive || false, nodeHttpsOptions?.agent?.keepAliveMsecs || 0)
});
reqFunc.end()

However, this is a temporary workaround but, it would be really nice if this can be fixed at nodejs side so we do not have to add extra logic to our code.

Thanks!

@bnoordhuis
Copy link
Member

The discussion here wandered a little but I think it identified two issues/bugs:

  1. TCP keep-alive not being applied to TLS sockets: Keep Alive option is not working on https.agent  #47137 (comment))

  2. keepAliveMsecs not being applied until after the first request: Keep Alive option is not working on https.agent  #47137 (comment)

(1) is an obvious bug for which a pull request is welcome.

(2) is a policy change. Best way forward is to open a pull request and see how it's received.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
https Issues or PRs related to the https subsystem. tls Issues and PRs related to the tls subsystem.
Projects
None yet
Development

No branches or pull requests

3 participants