Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SslConnection.DecryptedEnpoint.flush eternal busy loop #4217

Closed
baldersheim opened this issue Oct 17, 2019 · 8 comments
Closed

SslConnection.DecryptedEnpoint.flush eternal busy loop #4217

baldersheim opened this issue Oct 17, 2019 · 8 comments
Assignees

Comments

@baldersheim
Copy link

This morning 1 of our servers was caught redhanded with a jetty thread in a very tight busy loop in org.eclipse.jetty.io.ssl.SslConnection.DecryptedEnpoint.flush.

OS - RHEL 6.10 docker container running on a RHEL 7.7
JVM:
$ java -version
openjdk version "11.0.4" 2019-07-16
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.4+11)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.4+11, mixed mode)
Jetty: 9.4.20.v20190813

Here is an exploded view of the interesting callgraph taken with the 'perf' tool on linux.

  • 16.10% 2.37% ... Lorg/eclipse/jetty/io/ssl/SslConnection$DecryptedEndPoint;.flush
  • 85.31% Lorg/eclipse/jetty/io/ssl/SslConnection$DecryptedEndPoint;.flush
    • 98.10% Lsun/security/ssl/SSLEngineImpl;.wrap
      • 98.74% Lsun/security/ssl/SSLEngineImpl;.writeRecord
        • 78.41% _complete_monitor_locking_Java
        • 15.99% Lsun/security/ssl/TransportContext;.getHandshakeStatus
        • 2.44% 0xffffffff9658defa
        • 2.25% SharedRuntime::complete_monitor_unlocking_C
      • 1.12% Lsun/security/ssl/TransportContext;.getHandshakeStatus
    • 1.22% Lsun/security/ssl/TransportContext;.getHandshakeStatus
      0.55% Lsun/security/ssl/SSLEngineImpl;.writeRecord

We have taken a heap dump and analysed it with the code and have observed the following.
1 - The connection in handshake ->org.eclipse.jetty.io.ssl.SslConnection$Handshake = INITIAL
2 - _encryptedOutput buffer is empty with a capacity of 17408.
3 - _sslEngine.conContext.conSession.acceptLargeFragments = true

These values will lead to
891 wrapResult = _sslEngine.wrap(appOuts, _encryptedOutput);
returning a status of BUFFER_OVERFLOW if _encryptedOutput does not have at least room for getApplicationBufferSize().
The reason is that getPacketBufferSize() will return 33093 when acceptLargeFragments is true, as opposed to 16709 normally. Unless maxPacketSize has been set or it has completed negotiation and reached a maxNegotiatedPacketSize, but neither were set in this heap dump.

This means that if you have ended up in this state with a too small empty _encryptedOutput buffer you will have a tight loop with
852 while true {
......
931 case BUFFER_OVERFLOW:
932 if (!flushed)
933 return result = false;
934 -------> continue;

Since flushed is true, due to _encryptedOutput is empty you have the tight loop sealed with the continue statement on line 934.

The clue here is that SSLEngine.getSession().getPacketBufferSize() might change during the handshake phase of the connection.

We did the same mistake in our own use of the SSLEngine and discovered that the getPacketBufferSize() could change during the handshake phase.

We suggest adding a call to releaseEncryptedOutputBuffer() before the 'continue' statement on line 934. That should trigger the generation of a new buffer with the correct size.

I hope that there is enough information here for you to figure out how it could end up in this state, and how it can be reproduced. My knowledge of jetty and ssl is limited and since this has just been observed once I am unsure how to bring it back in this state.

Feel free to ask for more details. I hope that you have something to go by. I have a heap dump and a server in bad state, but the server will be restarted and put back in operation in soon.

@sbordet
Copy link
Contributor

sbordet commented Oct 17, 2019

Thanks for the detailed analysis. I'm looking into this.

@sbordet sbordet self-assigned this Oct 17, 2019
@sbordet
Copy link
Contributor

sbordet commented Oct 17, 2019

@baldersheim I cannot see how to test the real case, where a TLS read causes the buffer expansion, and a subsequent wrap() causes BUFFER_OVERFLOW.

However, I have a test that verifies that we release the buffer in case of BUFFER_OVERFLOW as you suggest.

Have you been able to test your suggested solution and verified that it works?

Out of curiosity, what is the client that uses large TLS fragments?

@sbordet
Copy link
Contributor

sbordet commented Oct 17, 2019

PR #4218.

@baldersheim can you test the fix (build the PR branch and try it out)?

@baldersheim
Copy link
Author

Yes, I will do that. But probably not until Monday.

The think the client in this case is an nginx server running as a proxy. We recently rebuilt it to use open ssl 1.1.1 instead of the older 1.0.1e. I wonder if that might be part of the equation.
Will keep you posted with our findings.

sbordet added a commit that referenced this issue Oct 19, 2019
Releasing the decrypted input buffer so that it can be re-acquired
with an expanded capacity.
Looping around only if the buffer size has changed.

Signed-off-by: Simone Bordet <[email protected]>
sbordet added a commit that referenced this issue Oct 19, 2019
Updates after review.
Added test case.

Signed-off-by: Simone Bordet <[email protected]>
sbordet added a commit that referenced this issue Oct 21, 2019
…er_overflow_busy_loop

Fixes #4217 - SslConnection.DecryptedEnpoint.flush eternal busy loop.
joakime added a commit that referenced this issue Oct 24, 2019
joakime added a commit that referenced this issue Oct 25, 2019
+ Small cleanup of duplicate if statements

Signed-off-by: Joakim Erdfelt <[email protected]>
@joakime
Copy link
Contributor

joakime commented Oct 29, 2019

@baldersheim were you able to confirm that it was nginx compiled against openssl 1.1.1 that was the source of the problematic ssl/tls client?

@baldersheim
Copy link
Author

No, source has not been found yet. I was not able to reproduce in our production system.
The suspect is that this was our internal paranoids team doing port scans and testing various vulnerabilities. That is a background activity outside of our control that runs very frequently. They have caused issue earlier on protocols that were not safe, but after enabling TLS everywhere this is the first incident.

@joakime
Copy link
Contributor

joakime commented Oct 29, 2019

@baldersheim out of curiosity, since you mentioned a custom nginx, is your nginx compiled with the dynamic TLS record patch from cloudflare?

https://blog.cloudflare.com/optimizing-tls-over-tcp-to-reduce-latency/

If you have a configuration in your nginx for options like ssl_dyn_rec_size_hi then you probably have it compiled in.

joakime added a commit that referenced this issue Oct 30, 2019
+ Cleanup from review

Signed-off-by: Joakim Erdfelt <[email protected]>
joakime added a commit that referenced this issue Oct 30, 2019
joakime added a commit that referenced this issue Oct 30, 2019
+ Flush on BUFFER_OVERFLOW

Signed-off-by: Joakim Erdfelt <[email protected]>
joakime added a commit that referenced this issue Oct 30, 2019
joakime added a commit that referenced this issue Oct 30, 2019
joakime added a commit that referenced this issue Oct 31, 2019
Issue #4217 - (9.2.x) SslConnection DecryptedEndpoint flush eternal busy loop
joakime added a commit that referenced this issue Oct 31, 2019
…flush-loop

Issue #4217 - (9.3.x) SslConnection DecryptedEndpoint flush eternal busy loop
@joakime
Copy link
Contributor

joakime commented Nov 6, 2019

This fix has been backported to jetty-9.2.x and jetty-9.3.x as well.
Releases are planned for both of those versions of Jetty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants