Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MQTT5 PKCS#11 timeout with multiple connections when using CA #512

Closed
mqllber opened this issue Sep 13, 2023 · 14 comments
Closed

MQTT5 PKCS#11 timeout with multiple connections when using CA #512

mqllber opened this issue Sep 13, 2023 · 14 comments
Assignees
Labels
bug This issue is a bug. closed-for-staleness p2 This is a standard priority issue response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days.

Comments

@mqllber
Copy link

mqllber commented Sep 13, 2023

Describe the bug

When using MQTT5 with PKCS#11 and CA certificate, the second connection causes TimeoutError. Works as expected if not using CA certificate.
If line client = None is commented out, loop fails on round 1024. Possibly related to tpm2-pkcs11 session limit.

Library versions used:
aws-iot-device-sdk-python-v2: 1.19.0
aws-crt-python: 0.19.1
aws-c-mqtt: 0.9.5
aws-c-io: 0.13.32
aws-c-common: 0.9.3

Expected Behavior

Loop should run indefinitely without exceptions

Current Behavior

TimeoutError

Reproduction Steps

https://gist.github.com/mqllber/d8fe772668aaf163f89e5b8b848a62fd

Possible Solution

No response

Additional Information/Context

No response

SDK version used

1.19.0

Environment details (OS name and version, etc.)

Yocto 4.0.9

@mqllber mqllber added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 13, 2023
@yasminetalby yasminetalby self-assigned this Sep 13, 2023
@yasminetalby
Copy link
Contributor

Hello @mqllber ,

Thank you very much for your submission.
Could you please provide the the Trace log associated with both use cases to help us better investigate the behavior you are experiencing?

(Reminder to make sure to remove any sensitive information before submitting the logs as this is a public platform)
Best regards,

Yasmine

@yasminetalby yasminetalby added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. and removed needs-triage This issue or PR still needs to be triaged. labels Sep 13, 2023
@xiazhvera
Copy link
Contributor

You can enable the log by adding the following line to your source

from awscrt import io

io.init_logging(io.LogLevel.Debug, <log_file_path>)

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. label Sep 13, 2023
@mqllber
Copy link
Author

mqllber commented Sep 14, 2023

@xiazhvera
Copy link
Contributor

xiazhvera commented Sep 14, 2023

Hi @mqllber , thank you for the logs! I think it is possibly a race condition here.
Looking at the logs, the library was trying to close the previous PKCS11 session, while the new session has already inited.
As the two session was using the same token/slot, I suspect that closing of the previous session affect the new session.

// creation of session 72057594037927938, while it was using the same slot as the previous session 
[DEBUG] [2023-09-14T05:22:49Z] [0000ffff953bd020] [pkcs11] - id=0xaaaad96b8390 session=72057594037927938: Session opened on slot 1
[DEBUG] [2023-09-14T05:22:49Z] [0000ffff953bd020] [pkcs11] - id=0xaaaad96b8390 session=72057594037927938: User was already logged in
// close session 72057594037927937 (The old session)
[DEBUG] [2023-09-14T05:22:49Z] [0000ffff90edf120] [pkcs11] - id=0xaaaad96b8390 session=72057594037927937: Session closed

For the second log, where we didn't destroy the client at the end of loop, the old client was terminated at a later time point, therefore, the old session was closed at a later time point and the new client has finished initialization at that time. Therefore, the new client was able to connect. (According to the log, the connection failed afterwards because of a different reason. While it is not relevant to the PKCS11 session issue here.)

[DEBUG] [2023-09-14T05:41:40Z] [0000ffff8d162020] [pkcs11] - id=0xaaaaf2f05c80 session=72057594037927937: Session opened on slot 1
[DEBUG] [2023-09-14T05:41:40Z] [0000ffff8d162020] [pkcs11] - id=0xaaaaf2f05c80 session=72057594037927937: User was already logged in

// START TO SETUP THE NEW CLIENT
...
[DEBUG] [2023-09-14T05:41:40Z] [0000ffff8d162020] [channel-bootstrap] - id=0xaaaaf3025420: acquiring bootstrap reference
[DEBUG] [2023-09-14T05:41:40Z] [0000ffff8d162020] [mqtt5-general] - id=0xaaaaf3157780: aws_mqtt5_client_options_storage host name set to <redacted>-ats.iot.eu-north-1.amazonaws.com
[DEBUG] [2023-09-14T05:41:40Z] [0000ffff8d162020] [mqtt5-general] - id=0xaaaaf3157780: aws_mqtt5_client_options_storage port set to 8883
...
// SKIPPED THE CLIENT SETUP LOGS....

[DEBUG] [2023-09-14T05:41:40Z] [0000ffff7bfff120] [pkcs11] - id=0xaaaaf2f05c80 session=72057594037927938: Session closed

I think the closing previous session could affect the new session as they are using the same token and slot. However, I would need dig deeper into the PKCS11 library to find out the details.

As a temporary workaround, could you try put the program to sleep for a couple seconds after client destroy, so that we could make sure the PKCS11 session is closed before we start the next loop?

...
while True:
        ...
         
        client = None
        time.sleep(1)

@mqllber
Copy link
Author

mqllber commented Sep 15, 2023

Hi @xiazhvera ,

I tested with the sleep added and it is working beyond the previous 1024 limit, so both cases works with the delay added. With the delay it doesn't matter whether the client = None is commented or uncommented.
Note that without CA cartificate both cases works without delay and presence of client = None line doesn't affect it.

@xiazhvera
Copy link
Contributor

Hi @mqllber, I did more investigation into the issue.
I checked out the custom CA file, and the CA files are loaded correctly. My guess here is that with a custom CA file, the SDK would read/process the file. The I/O process might affect the performance on the target platform, which leads to the racing issue on PKCS11.

As we found in the logs, it is a racing issue where the old pkcs11 session is closed after the new session opened, which eventually closed the new session as well (as the two session share the same token/slot). The behavior of Closesession depends on the 3rd party PKCS11 library. We could not do much about it.

The workaround, while not ideal, could serve as a temporary compromise for now.

@xiazhvera
Copy link
Contributor

Additionally, may I ask user case of this sample? Given that the loop is not a typical user case we usually encounter, we probably could provide other workaround for the case here.

@xiazhvera
Copy link
Contributor

I did more investigation into the tpm2 library. It seems that C_CloseSession would not log out for the token if there is another session opened. And we failed on C_SignInit. I opened a issue to tpm2 library to see if they have further context on it.
Meanwhile, we did not see the issue while testing against the softhsm2 library. Would you like to try softhsm2 so that we could narrow down if the issue is the 3rd party library related.

@mqllber
Copy link
Author

mqllber commented Sep 27, 2023

Hi @xiazhvera,

The use case for this sample is to torture test the library, it doesn't represent real world use case.
Testing with softhsm2 isn't possible right now, I may revisit this later. I think the issue is related to tpm2 library.

When torture testing the library with use case closer to real world, I found another instance where the library is misbehaving. The use case is:

  • Connect and subscribe, then publish every second
  • All Future.result() timeouts are set to 5 seconds. If we hit any of those, disconnect and start over with new connection
  • Simultaneosly the internet connection is toggled up/down with variable 0-15 second delay.

After approx. 2500-2700 connections I get RuntimeError: 34 (AWS_ERROR_INVALID_ARGUMENT): An invalid argument was passed to a function. from mqtt5_client_builder.mtls_with_pkcs11(). Here is the debug log from library: https://gist.github.com/mqllber/908e2273c8f481769ac5003edd8fee62

@xiazhvera
Copy link
Contributor

Hi @mqllber,

Thanks for reporting the issue. At a first glance to the log, it seems that the pkcs11 reaches the max session number.

[ERROR] [2023-09-18T02:40:42Z] [0000ffffb7c18020] [pkcs11] - id=0xaaaafccaedf0: C_OpenSession() failed. PKCS#11 error: CKR_SESSION_COUNT (0x000000B1). AWS error: AWS_ERROR_PKCS11_CKR_SESSION_COUNT

PKCS11 session would not closed until client get terminated. However, I didn't see client get TERMINATED in the log.
If the client is terminated, you would see something like the following:

[INFO] [2023-09-14T05:22:49Z] [0000ffff90edf120] [mqtt5-client] - id=0xaaaad9935b30: changing desired client state from STOPPED to TERMINATED

I guess the delay in the client termination eventually cause the pkcs11 reaches the session limit.
Is it possible not to start over with a new connection? You can directly start a new session with the same client after you disconnect. The client would start a new mqtt session. In that case you would not need to worry about the multiple pkcs11 sessions as we are using the single client with the same pkcs11 session.

client.start()
client.stop()

// start again 
client.start()

I will take a deep look into this after finish my current task (it probably would be 1-2 weeks).

@mqllber
Copy link
Author

mqllber commented Sep 28, 2023

Hi @xiazhvera,

I think you're ont he right track with clients not getting terminated: on one test run, there was 1566 occurrences of changing desired client state from STOPPED to TERMINATED and it failed on 2590th connection. 2590-1566=1024, which is the session limit in tpm2-pkcs11.
Other test runs had similar numbers: connection failed after 1024 clients which didn't got terminated.

This leads to a question: how is that only some clients gets terminated?

@xiazhvera
Copy link
Contributor

The clients should be all terminated, while I think there are delays in the termination progress.

The client termination progress would do the following things 1.disconnect from the server 2.clean up all pending operations(The failed publish in this case). As you mentioned that internet connection is toggled up/down there, the client is possibly waiting for operation timeout or it failed to send the disconnect packet, which eventually cause the delay in the client termination. I would need test it out and locate the issue there.

Another thing need to be considered is that the termination process will start when the client is destroyed by Python Garbage Collection. It is possible that Python has a delay in the destroy of the clients. If that is the case, we could not do much here. (Of cause, from my experience, Python is usually doing a good job.) Still I also need run some tests to confirm if that was the case here.

@xiazhvera
Copy link
Contributor

Hi @mqllber,

Though I was not able to reproduce your situation locally, I ran some test over the unsuccessful publish operations and python garbage collection. And here is what I found.

  1. The Python garbage collection does not show a clear delay for client destroy. The client usually get destroyed immediately after the client reference is cleared. (Of cause, the Python behavior might varies between different Python versions and architectures. While at least under my testing environment, the Python garbage collection would not be the blocker here. I'm testing with Python3.11 with MacOS 13)

  2. When I have an unstable network and unsuccessful Publish, the termination does get deferred a little bit. That is because the client need clean up the incomplete operations before it is terminated. The clean up process might take some time. If that is the case, waiting for the on_lifecycle_stopped before you create the new client might help here. As when the client stopped, it would clean up the unsuccessful operations.

As I was not able to reproduce the issue locally, it would also be helpful if you could share your full log so that I could see the details in client termination and publish on your tests. ( The current log does not show any Termination or Publish related info )
Let me know if this helps.

@jmklix jmklix added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. p2 This is a standard priority issue labels Oct 27, 2023
@github-actions
Copy link

Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.

@github-actions github-actions bot added closing-soon This issue will automatically close in 5 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 5 days unless further comments are made. labels Oct 27, 2023
@github-actions github-actions bot closed this as completed Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. closed-for-staleness p2 This is a standard priority issue response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days.
Projects
None yet
Development

No branches or pull requests

4 participants