Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in CRT HTTP engine connection shutdown #413

Closed
aajtodd opened this issue Nov 11, 2021 · 6 comments · Fixed by #497
Closed

Segfault in CRT HTTP engine connection shutdown #413

aajtodd opened this issue Nov 11, 2021 · 6 comments · Fixed by #497
Assignees
Labels
bug This issue is a bug.

Comments

@aajtodd
Copy link
Collaborator

aajtodd commented Nov 11, 2021

Describe the bug

One of our async stress tests for the CRT HTTP engine is segfaulting on windows.

2021-11-10T16:50:20.6269236Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [channel-bootstrap] - id=000001AEABDCDA20: channel 000001AEABC0D800 shutdown with error 0.
2021-11-10T16:50:20.6270298Z [INFO] [2021-11-10T16:48:43Z] [000010d4] [http-connection] - 000001AEAD1C45B0: Client shutdown completed with error 0 (AWS_ERROR_SUCCESS).
2021-11-10T16:50:20.6271351Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [connection-manager] - id=000001AEABDD7370: shutdown received for connection (id=000001AEAD1C45B0)
2021-11-10T16:50:20.6272476Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [connection-manager] - id=000001AEABDD7370: snapshot - state=1, idle_connection_count=0, pending_acquire_count=695, pending_connect_count=1, vended_connection_count=15, open_connection_count=15, ref_count=1
2021-11-10T16:50:20.6273464Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [channel] - id=000001AEABC0D800: destroying channel.
2021-11-10T16:50:20.6274148Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [socket] - id=000001AEAC5CEB90, handle=FFFFFFFFFFFFFFFF: cleaning up socket.
2021-11-10T16:50:20.6274854Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [socket] - id=000001AEAC5CEB90 handle=FFFFFFFFFFFFFFFF: closing
2021-11-10T16:50:20.6275597Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [channel-bootstrap] - id=000001AEABDCDA20: releasing bootstrap reference
2021-11-10T16:50:20.6276397Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [task-scheduler] - id=000001AEAD56DE70: Running on_channel_setup_complete task with <Running> status
2021-11-10T16:50:20.6277154Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [channel] - id=000001AEABC0B400: setup complete, notifying caller.
2021-11-10T16:50:20.6278187Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [channel] - id=000001AEABC0B400: message pool 000001AEADBDCDB0 found in event-loop local storage: using it.
2021-11-10T16:50:20.6279287Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [channel-bootstrap] - id=000001AEABDCDA20: channel 000001AEABC0B400 setup succeeded: bootstrapping.
2021-11-10T16:50:20.6280343Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [channel] - id=000001AEABC0B400: creating new slot 000001AEAB4F3610.
2021-11-10T16:50:20.6281291Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [socket-handler] - id=000001AEABE15510: Socket handler created with max_read_size of 262144
2021-11-10T16:50:20.6282185Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [socket] - id=000001AEAC5CC970 handle=00000000000008B0: subscribing to readable event
2021-11-10T16:50:20.6283430Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [channel-bootstrap] - id=000001AEABDCDA20: Setting up socket handler on channel 000001AEABC0B400 with handler 000001AEABE15510 on slot 000001AEAB4F3610.
2021-11-10T16:50:20.6284338Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [channel] - id=000001AEABC0B400: scheduling task with wrapper task id 000001AEABC0B668.
2021-11-10T16:50:20.6285093Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [event-loop] - id=000001AEABE064B0: scheduling task 000001AEABC0B668 in-thread for timestamp 0
2021-11-10T16:50:20.6285933Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [task-scheduler] - id=000001AEABC0B668: Scheduling window update task task for immediate execution
2021-11-10T16:50:20.6286867Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [http-connection] - static: Socket connected, creating client connection object.
2021-11-10T16:50:20.6287651Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [channel] - id=000001AEABC0B400: creating new slot 000001AEAB4F3A90.
2021-11-10T16:50:20.6288593Z [INFO] [2021-11-10T16:48:43Z] [000010d4] [http-connection] - id=000001AEAD1C0FF0: HTTP/1.1 client connection established.
2021-11-10T16:50:20.6289497Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [connection-manager] - id=000001AEABDD7370: Received new connection (id=000001AEAD1C0FF0) from http layer
2021-11-10T16:50:20.6290403Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [connection-manager] - id=000001AEABDD7370: Grabbing pooled connection (000001AEAD1C0FF0)
2021-11-10T16:50:20.6291706Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [connection-manager] - id=000001AEABDD7370: snapshot - state=1, idle_connection_count=0, pending_acquire_count=694, pending_connect_count=0, vended_connection_count=16, open_connection_count=16, ref_count=1
2021-11-10T16:50:20.6292957Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [connection-manager] - id=000001AEABDD7370: Successfully completed connection acquisition with connection id=000001AEAD1C0FF0
2021-11-10T16:50:20.6293988Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [http-connection] - ConnManager Acquired Conn: conn: 000001AEAD1C0FF0, manager: 000001AEABDD7370, err_code: 0,  err_str: Success.
2021-11-10T16:50:20.6294933Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [event-loop] - id=000001AEABE064B0: detected more scheduled tasks with the next occurring at 0, using timeout of 0.
2021-11-10T16:50:20.6295695Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [event-loop] - id=000001AEABE064B0: waiting for a maximum of 0 ms
2021-11-10T16:50:20.6296497Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [event-loop] - id=000001AEABE064B0: running scheduled tasks.
2021-11-10T16:50:20.6297358Z [DEBU# A fatal error has been detected by the Java Runtime Environment:
2021-11-10T16:50:20.6298104Z #
2021-11-10T16:50:20.6298599Z #  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007ffccbc0da13, pid=436, tid=5812
2021-11-10T16:50:20.6299071Z #
2021-11-10T16:50:20.6299557Z # JRE version: OpenJDK Runtime Environment Temurin-11.0.13+8 (11.0.13+8) (build 11.0.13+8)
2021-11-10T16:50:20.6300736Z # Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.13+8 (11.0.13+8, mixed mode, tiered, compressed oops, g1 gc, windows-amd64)
2021-11-10T16:50:20.6301502Z # Problematic frame:
2021-11-10T16:50:20.6301953Z # C  [AWSCRT_16365629109837760066350924725629aws-crt-jni.dll+0xda13]
2021-11-10T16:50:20.6302330Z #
2021-11-10T16:50:20.6303015Z # Core dump will be written. Default location: D:\a\aws-sdk-kotlin\aws-sdk-kotlin\aws-runtime\http-client-engine-crt\hs_err_pid436.mdmp
2021-11-10T16:50:20.6303887Z #
2021-11-10T16:50:20.6304270Z # An error report file with more information is saved as:
2021-11-10T16:50:20.6305026Z # D:\a\aws-sdk-kotlin\aws-sdk-kotlin\aws-runtime\http-client-engine-crt\hs_err_pid436.log
2021-11-10T16:50:20.6306498Z Compiled method (nm)   16764 6397     n 0       software.amazon.awssdk.crt.http.HttpClientConnection::httpClientConnectionShutdown (native)
2021-11-10T16:50:20.6308172Z  total in heap  [0x000001ae9fd04a90,0x000001ae9fd04e68] = 984
2021-11-10T16:50:20.6308811Z  relocation     [0x000001ae9fd04c08,0x000001ae9fd04c38] = 48
2021-11-10T16:50:20.6309567Z  main code      [0x000001ae9fd04c40,0x000001ae9fd04e60] = 544
2021-11-10T16:50:20.6310179Z  oops           [0x000001ae9fd04e60,0x000001ae9fd04e68] = 8
2021-11-10T16:50:20.6310582Z #
2021-11-10T16:50:20.6310958Z # If you would like to submit a bug report, please visit:
2021-11-10T16:50:20.6311587Z #   https://github.com/adoptium/adoptium-support/issues
2021-11-10T16:50:20.6312481Z # The crash happened outside the Java Virtual Machine in native code.
2021-11-10T16:50:20.6313096Z # See problematic frame for where to report the bug.
2021-11-10T16:50:20.6313482Z #
2021-11-10T16:50:20.6314045Z G] [2021-11-10T16:48:43Z] [000010d4] [task-scheduler] - id=000001AEABC0B668: Running window update task task with <Running> status
2021-11-10T16:50:20.6315302Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [socket-handler] - id=000001AEABE15510: increment read window message received, scheduling task for another read operation.
2021-11-10T16:50:20.6316234Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [channel] - id=000001AEABC0B400: scheduling task with wrapper task id 000001AEABE15548.
2021-11-10T16:50:20.6317045Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [event-loop] - id=000001AEABE064B0: scheduling task 000001AEABE15548 in-thread for timestamp 0
2021-11-10T16:50:20.6318064Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [task-scheduler] - id=000001AEABE15548: Scheduling socket_handler_read_on_window_increment task for immediate execution
2021-11-10T16:50:20.6319063Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [event-loop] - id=000001AEABE064B0: detected more scheduled tasks with the next occurring at 0, using timeout of 0.
2021-11-10T16:50:20.6319863Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [event-loop] - id=000001AEABE064B0: waiting for a maximum of 0 ms
2021-11-10T16:50:20.6320513Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [event-loop] - id=000001AEABE064B0: running scheduled tasks.
2021-11-10T16:50:20.6321859Z [DEBUG] [2021-11-10T16:48:43Z] [000010d4] [task-scheduler] - id=000001AEABE15548: Running socket_handler_read_on_window_increment task with <Running> status
2021-11-10T16:50:20.6322785Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [socket-handler] - id=000001AEABE15510: invoking read. Downstream window 262144, max_to_read 262144
2021-11-10T16:50:20.6332389Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [channel] - id=000001AEABC0B400: acquired message 000001AEB3D52740 of capacity 262144 from pool 000001AEADBDCDB0. Requested size was 262144
2021-11-10T16:50:20.6333367Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [socket] - id=000001AEAC5CC970 handle=00000000000008B0: reading from socket
2021-11-10T16:50:20.6334396Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [socket] - id=000001AEAC5CC970 handle=00000000000008B0: read would block, scheduling 0 byte read and returning
2021-11-10T16:50:20.6335476Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [socket-handler] - id=000001AEAB4F3630: total read on this tick 0
2021-11-10T16:50:20.6336438Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [socket-handler] - id=000001AEABE15510: out of data to read on socket. Waiting on event-loop notification.
2021-11-10T16:50:20.6337436Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [event-loop] - id=000001AEABE064B0: detected more scheduled tasks with the next occurring at 859251931100, using timeout of 493.
2021-11-10T16:50:20.6338377Z [TRACE] [2021-11-10T16:48:43Z] [000010d4] [event-loop] - id=000001AEABE064B0: waiting for a maximum of 493 ms
2021-11-10T16:50:20.6338769Z 

Steps to Reproduce

I've only been able to reproduce this in CI so far. Here is a failing CI run with full logs and JVM dump as an artifact:
https://github.com/awslabs/aws-sdk-kotlin/actions/runs/1446034029

Possible Solution

CRT team is currently investigating

Context

@aajtodd aajtodd added the bug This issue is a bug. label Nov 11, 2021
@stiost
Copy link

stiost commented Dec 9, 2021

I'm experiencing the same issue on Linux with OpenJDK 11 and 17. It fails if I'm launching 20 coroutines concurrently, but works on with 5. I'm using the SDK to upload about 2000 files of 100kb - 2mb to S3.

@aajtodd
Copy link
Collaborator Author

aajtodd commented Dec 9, 2021

@stiost Thanks for reporting this. Do you have some example code you might be able to share that triggers this on Linux? We still have only been able to reproduce this in CI on Windows. Any other details about your environment may help as well.

@kiiadi
Copy link
Contributor

kiiadi commented Jan 5, 2022

Same thing happening on Mac:

val s3 = S3Client.fromEnvironment()
repeat(10000) {
    launch {
        val number = it.toString().padStart(8, '0')

        setOf("foo", "bar", "baz").map {
            async {
                gate.withPermit {
                    s3.putObject {
                        bucket = bucketName
                        this.key = "$number/$it"
                        body = ByteStream.fromString("boop")
                    }
                }
            }
        }.awaitAll()
    }
}

@aajtodd
Copy link
Collaborator Author

aajtodd commented Jan 7, 2022

Status update:

I've added a bunch of debugging output to crt-java to track close/shutdown calls. It looks like the Kotlin SDK is in fact invoking shutdown() after close(). It does so because it's handling the same exception twice as it turns out.

The coroutine that gets cancelled causes an exception to be handled here. The second time is triggered from the completion handler registered here (which ends up in this function).

cancelling
roundTrip exception: id=3 throwable=java.util.concurrent.CancellationException: Parent job is Cancelling; conn=aws.sdk.kotlin.crt.http.HttpClientConnectionJVM@7d7758be jniConn=software.amazon.awssdk.crt.http.HttpClientConnection@2bdd8394
[INFO] [2022-01-07T21:39:44Z] [00007000053f1000] [Unknown] - software.amazon.awssdk.crt.http.HttpClientConnection@2bdd8394::releaseNativeHandle(); nativeHandle=140482209647888;
        stack:java.lang.RuntimeException
        at software.amazon.awssdk.crt.http.HttpClientConnection.releaseNativeHandle(HttpClientConnection.java:74)
        at software.amazon.awssdk.crt.CrtResource.release(CrtResource.java:218)
        at software.amazon.awssdk.crt.CrtResource.decRef(CrtResource.java:294)
        at software.amazon.awssdk.crt.CrtResource.close(CrtResource.java:276)
        at software.amazon.awssdk.crt.http.HttpClientConnectionManager.releaseConnection(HttpClientConnectionManager.java:162)
        at aws.sdk.kotlin.crt.http.HttpClientConnectionManager.releaseConnection(HttpClientConnectionManagerJVM.kt:42)
        at aws.sdk.kotlin.runtime.http.engine.crt.CrtHttpEngine.roundTrip(CrtHttpEngine.kt:108)
        at aws.sdk.kotlin.runtime.http.engine.crt.CrtHttpEngine$roundTrip$1.invokeSuspend(CrtHttpEngine.kt)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104)
        at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:274)
        at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
        at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
        at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
        at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
        at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
        at aws.sdk.kotlin.example.MainKt.main(Main.kt:33)
        at aws.sdk.kotlin.example.MainKt.main(Main.kt)

call complete: id=3, handler=aws.sdk.kotlin.runtime.http.engine.crt.SdkStreamResponseHandler@68746f22; throwable=java.util.concurrent.CancellationException: Parent job is Cancelling; conn=aws.sdk.kotlin.crt.http.HttpClientConnectionJVM@7d7758be jniConn=software.amazon.awssdk.crt.http.HttpClientConnection@2bdd8394
forcing closed! handler=aws.sdk.kotlin.runtime.http.engine.crt.SdkStreamResponseHandler@68746f22; forceClose=true; conn=aws.sdk.kotlin.crt.http.HttpClientConnectionJVM@7d7758be jniConn=software.amazon.awssdk.crt.http.HttpClientConnection@2bdd8394; stream=null
[INFO] [2022-01-07T21:39:44Z] [00007000053f1000] [Unknown] - software.amazon.awssdk.crt.http.HttpClientConnection@2bdd8394::shutdown(); nativeHandle=0
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000135684d84, pid=6686, tid=5891
#
# JRE version: OpenJDK Runtime Environment Corretto-11.0.10.9.1 (11.0.10+9) (build 11.0.10+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-11.0.10.9.1 (11.0.10+9-LTS, mixed mode, tiered, compressed oops, g1 gc, bsd-amd64)
# Problematic frame:
# C  [AWSCRT_16415915833975983087226254389958libaws-crt-jni.dylib+0x9d84]  Java_software_amazon_awssdk_crt_http_HttpClientConnection_httpClientConnectionShutdown+0x4
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/todaaron/sandbox/IdeaProjects/sdk-sandbox/hs_err_pid6686.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/corretto/corretto-11/issues/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
./run.sh: line 6:  6686 Abort trap: 6           java -Daws.crt.log.level=Info -Daws.crt.debugnative=false -Dorg.slf4j.simpleLogger.defaultLogLevel=DEBUG -Dorg.slf4j.simpleLogger.showDateTime=true -Dfile.encoding=UTF-8 -Daws.bucket=aaron22-throwaway -cp ./build/libs/sdk-sandbox-1.0-SNAPSHOT-all.jar aws.sdk.kotlin.example.MainKt

These were meant to handle exceptions at different times in the execution of a request to ensure a connection is returned to the pool. It turns out an exception can trigger both paths (since one is registered on the coroutine job). Depending on the state of the connection we can end up calling shutdown() on an already released connection.

aajtodd added a commit that referenced this issue Jan 10, 2022
fixes: #413

Fixes the segfault that can happen when an exception is handled twice
leading to a connection being closed after it has been free'd. This
change refactors the handling of the connection close logic to be
handled in a single place regardless of why the connection is being
closed.
@aajtodd aajtodd self-assigned this Jan 10, 2022
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@aajtodd
Copy link
Collaborator Author

aajtodd commented Jan 13, 2022

This should be fixed in 0.10.1-beta now available. If you still have any problems please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants