-
Notifications
You must be signed in to change notification settings - Fork 634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disconnects and handling within library and samples #33
Comments
Hi @tim-nordell-nimbelink, The second problem that you mention will not occur. Yes, We will update the issue as soon as we have figured out a solution. These are really helpful suggestions. If you discover anything else while you are going through the code please do let us know. Thank you for using AWS IoT. Rahul |
Rahul - I discovered this while trying to port the library to an embedded environment running a STM32L4xx series microcontroller, and verified the behavior was the same in the shadow example I had ported. The second problem is a problem according to the documented interfaces, and would be unanticipated behavior by someone porting the system. (I hit this when porting it.) The iot_tls_init(..) documentation states: * Perform any initialization required by the TLS layer.
* Connects the interface to implementation by setting up
* the network layer function pointers to platform implementations. And iot_tls_destroy(...) states: /**
* @brief Perform any tear-down or cleanup of TLS layer
*
* Called to cleanup any resources required for the TLS layer. The functions feel like they're paired up as:
Since the iot_tls_init(...) call is akin to a constructor, and the iot_tls_destroy(...) call is akin to a destructor, one shouldn't expect any of the other calls to the module to be called after iot_tls_destroy(...). I certainly didn't, and it caused a null dereference in my porting attempt. I see now that the sample platform port behaves differently from the anticipated behavior from purely looking at network_interface.h. If this is expected behavior from a given port, please add a note in the documentation for the iot_tls_destroy(...) function that one can expect iot_tls_connect(...) to be called after without another iot_tls_init(...) call. Going back to the reconnect issue, the codebase is inconsistent with how it returns the error codes from iot_tls_read(...) and iot_tls_write(...). 3 of the invocations of iot_tls_read(...) will return back out the error code returned by that function call out to the callees of the parent functions - one always returns FAILURE whenever iot_tls_read(...) returns a failure of any sort. The only invocation of iot_tls_write(...) returns FAILURE from its function when an error had occurred. This inconsistency probably doesn't help with the disconnect being reported from within the iot_tls_write(...) or iot_tls_read(...) calls. As feedback, it'd be a good idea to document the expected return values from iot_tls_xxx() calls, and under what circumstances each error should be reported. Side note: The sample platform implementation doesn't return a NETWORK_DISCONNECTED_ERROR from the iot_tls_write/iot_tls_read calls upon a disconnect, but it seems like some of the codebase will attempt to react to this if they do return this, but not all of the code paths. -- Tim |
Further note about the interfaces for iot_tls: If I had been someone just looking at the interface by itself, I would have assumed that the following is a valid usage: iot_tls_init(...);
iot_tls_connect(...);
...
iot_tls_read(...);
iot_tls_write(...);
...
iot_tls_disconnect(...);
iot_tls_connect(...);
...
iot_tls_read(...);
iot_tls_write(...);
...
iot_tls_disconnect(...);
iot_tls_destroy(...); E.g. being able to disconnect and reconnect without invoking destroy(...), and without leaking. The sample implementation will probably leak with the above. -- Tim |
I've created a pull request for fixing this. I haven't thoroughly tested it, although I have verified the existing auto reconnect unit test still passes. I did not create a test case for what this fixes in the unit tests (although that'd be a good idea to do). Each commit is changing a single chunk of logic in the code so that it's hopefully easier to follow - it shouldn't create any commit by itself that won't work. |
Hi Tim, Please feel free to point out any further issues you notice. This is extremely helpful information. Rahul |
Fixed in v2.1.1 of the SDK. Closing. Rahul |
I've been testing the disconnect path by pointing to a custom MQTT broker and killing the broker on the server while it's connected. With the shadow_sample, this results in the sample exiting without reconnecting despite the reconnect flag being set within the sample. (It calls
aws_iot_shadow_set_autoreconnect_status()). The output from the sample is as follows:
The code path inside _aws_iot_mqtt_internal_yield() needs to handle the NETWORK_DISCONNECTED_ERROR error for the aws_iot_mqtt_internal_cycle_read() call, too, which is why the example application does not reconnect. The suggested change for this point is to do:
which would pass down the NETWORK_DISCONNECTED_ERROR down from the aws_iot_mqtt_internal_cycle_read() call into the handling block for this.
After making this change, however, this exposes that the code below doesn't state transition to disconnected before the client set state call just below. To fix this, I've changed the above to:
I'm still digging through the issues in the codebase around the reconnects to understand how the codebase is supposed to fit together, but thought I'd open this issue as it's quite evident with the shadow_sample and closing the connection from the server side.
The above doesn't work by itself as it exposes the following issue with the codebase, too, relating to reconnecting:
The function call path:
is problematic since the only corresponding iot_tls_init() call is within the aws_iot_mqtt_init() call. An example problem point is within _aws_iot_mqtt_keep_alive() which invokes this path, and returns expecting the caller to attempt to reconnect, but technically the caller will be operating on a destroy'd handle.)
The text was updated successfully, but these errors were encountered: