-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MQTT5 PKCS#11 timeout with multiple connections when using CA #512
Comments
Hello @mqllber , Thank you very much for your submission. (Reminder to make sure to remove any sensitive information before submitting the logs as this is a public platform) Yasmine |
You can enable the log by adding the following line to your source from awscrt import io
io.init_logging(io.LogLevel.Debug, <log_file_path>) |
Hi @yasminetalby , Here is the debug logs for both cases: |
Hi @mqllber , thank you for the logs! I think it is possibly a race condition here.
For the second log, where we didn't destroy the client at the end of loop, the old client was terminated at a later time point, therefore, the old session was closed at a later time point and the new client has finished initialization at that time. Therefore, the new client was able to connect. (According to the log, the connection failed afterwards because of a different reason. While it is not relevant to the PKCS11 session issue here.)
I think the closing previous session could affect the new session as they are using the same token and slot. However, I would need dig deeper into the PKCS11 library to find out the details. As a temporary workaround, could you try put the program to sleep for a couple seconds after client destroy, so that we could make sure the PKCS11 session is closed before we start the next loop?
|
Hi @xiazhvera , I tested with the sleep added and it is working beyond the previous 1024 limit, so both cases works with the delay added. With the delay it doesn't matter whether the |
Hi @mqllber, I did more investigation into the issue. As we found in the logs, it is a racing issue where the old pkcs11 session is closed after the new session opened, which eventually closed the new session as well (as the two session share the same token/slot). The behavior of The workaround, while not ideal, could serve as a temporary compromise for now. |
Additionally, may I ask user case of this sample? Given that the loop is not a typical user case we usually encounter, we probably could provide other workaround for the case here. |
I did more investigation into the tpm2 library. It seems that |
Hi @xiazhvera, The use case for this sample is to torture test the library, it doesn't represent real world use case. When torture testing the library with use case closer to real world, I found another instance where the library is misbehaving. The use case is:
After approx. 2500-2700 connections I get |
Hi @mqllber, Thanks for reporting the issue. At a first glance to the log, it seems that the pkcs11 reaches the max session number.
PKCS11 session would not closed until client get terminated. However, I didn't see client get TERMINATED in the log.
I guess the delay in the client termination eventually cause the pkcs11 reaches the session limit.
I will take a deep look into this after finish my current task (it probably would be 1-2 weeks). |
Hi @xiazhvera, I think you're ont he right track with clients not getting terminated: on one test run, there was 1566 occurrences of This leads to a question: how is that only some clients gets terminated? |
The clients should be all terminated, while I think there are delays in the termination progress. The client termination progress would do the following things 1.disconnect from the server 2.clean up all pending operations(The failed publish in this case). As you mentioned that internet connection is toggled up/down there, the client is possibly waiting for operation timeout or it failed to send the disconnect packet, which eventually cause the delay in the client termination. I would need test it out and locate the issue there. Another thing need to be considered is that the termination process will start when the client is destroyed by Python Garbage Collection. It is possible that Python has a delay in the destroy of the clients. If that is the case, we could not do much here. (Of cause, from my experience, Python is usually doing a good job.) Still I also need run some tests to confirm if that was the case here. |
Hi @mqllber, Though I was not able to reproduce your situation locally, I ran some test over the unsuccessful publish operations and python garbage collection. And here is what I found.
As I was not able to reproduce the issue locally, it would also be helpful if you could share your full log so that I could see the details in client termination and publish on your tests. ( The current log does not show any Termination or Publish related info ) |
Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one. |
Describe the bug
When using MQTT5 with PKCS#11 and CA certificate, the second connection causes TimeoutError. Works as expected if not using CA certificate.
If line
client = None
is commented out, loop fails on round 1024. Possibly related to tpm2-pkcs11 session limit.Library versions used:
aws-iot-device-sdk-python-v2: 1.19.0
aws-crt-python: 0.19.1
aws-c-mqtt: 0.9.5
aws-c-io: 0.13.32
aws-c-common: 0.9.3
Expected Behavior
Loop should run indefinitely without exceptions
Current Behavior
TimeoutError
Reproduction Steps
https://gist.github.com/mqllber/d8fe772668aaf163f89e5b8b848a62fd
Possible Solution
No response
Additional Information/Context
No response
SDK version used
1.19.0
Environment details (OS name and version, etc.)
Yocto 4.0.9
The text was updated successfully, but these errors were encountered: