-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SystemLayerImplSelect: fix Request/ClearCallbackOnPendingXXX() for… #21135
SystemLayerImplSelect: fix Request/ClearCallbackOnPendingXXX() for… #21135
Conversation
PR #21135: Size comparison from e58147e to 3b3b1b1 Increases (2 builds for cc13x2_26x2, cyw30739)
Decreases (15 builds for cc13x2_26x2, linux, nrfconnect, telink)
Full report (43 builds for bl602, cc13x2_26x2, cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, p6, telink)
|
@plan44 Could you look into the test failures, please? |
3b3b1b1
to
683c97d
Compare
PR #21135: Size comparison from b647dfb to 683c97d Increases (3 builds for bl602, telink)
Decreases (3 builds for cyw30739, k32w, nrfconnect)
Full report (32 builds for bl602, cc13x2_26x2, cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, p6, telink)
|
…spatch, clean transport For CHIP_SYSTEM_CONFIG_USE_DISPATCH (Darwin) case, Request/ClearCallbackOnPendingXXX() were not working (no loop calling select to evaluate watches), but were called in TCP and UDP endpoint implementations, despite dispatch-specific extra code for the same purpose in the endpoints. This changeset now moves all CHIP_SYSTEM_CONFIG_USE_DISPATCH specific socket watching code into SystemLayerImplSelect, by making Request/ClearCallbackOnPendingXXX() actually working with dispatch. The issue surfaced in my (a bit exotic) use case when I needed to get callbacks for events on a plain socket not wrapped as TCPEndPoint. Doing this, I realized Request/ClearCallbackOnPendingXXX() were non-functional due to missing select main loop. Fixing this revealed the similar dispatch code in the endpoints that was needed to get those events handled despite Request/ClearCallbackOnPendingXXX() not working. That extra endpoint code now became redundant (delivering events twice), so had to be removed. I am aware that this might be a too deep intervention for coming from a CHIP newbie like me (not newbie in code deep diving, though), but I think it cleans up an inconsistency in the IP endpoints.
683c97d
to
fd85659
Compare
PR #21135: Size comparison from b647dfb to fd85659 Increases (3 builds for bl602, telink)
Decreases (1 build for cyw30739)
Full report (32 builds for bl602, cc13x2_26x2, cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, p6, telink)
|
Fast tracking platform changes |
…spatch, clean transport (#21135) For CHIP_SYSTEM_CONFIG_USE_DISPATCH (Darwin) case, Request/ClearCallbackOnPendingXXX() were not working (no loop calling select to evaluate watches), but were called in TCP and UDP endpoint implementations, despite dispatch-specific extra code for the same purpose in the endpoints. This changeset now moves all CHIP_SYSTEM_CONFIG_USE_DISPATCH specific socket watching code into SystemLayerImplSelect, by making Request/ClearCallbackOnPendingXXX() actually working with dispatch. The issue surfaced in my (a bit exotic) use case when I needed to get callbacks for events on a plain socket not wrapped as TCPEndPoint. Doing this, I realized Request/ClearCallbackOnPendingXXX() were non-functional due to missing select main loop. Fixing this revealed the similar dispatch code in the endpoints that was needed to get those events handled despite Request/ClearCallbackOnPendingXXX() not working. That extra endpoint code now became redundant (delivering events twice), so had to be removed. I am aware that this might be a too deep intervention for coming from a CHIP newbie like me (not newbie in code deep diving, though), but I think it cleans up an inconsistency in the IP endpoints.
…spatch, clean transport (#21135) (#21186) For CHIP_SYSTEM_CONFIG_USE_DISPATCH (Darwin) case, Request/ClearCallbackOnPendingXXX() were not working (no loop calling select to evaluate watches), but were called in TCP and UDP endpoint implementations, despite dispatch-specific extra code for the same purpose in the endpoints. This changeset now moves all CHIP_SYSTEM_CONFIG_USE_DISPATCH specific socket watching code into SystemLayerImplSelect, by making Request/ClearCallbackOnPendingXXX() actually working with dispatch. The issue surfaced in my (a bit exotic) use case when I needed to get callbacks for events on a plain socket not wrapped as TCPEndPoint. Doing this, I realized Request/ClearCallbackOnPendingXXX() were non-functional due to missing select main loop. Fixing this revealed the similar dispatch code in the endpoints that was needed to get those events handled despite Request/ClearCallbackOnPendingXXX() not working. That extra endpoint code now became redundant (delivering events twice), so had to be removed. I am aware that this might be a too deep intervention for coming from a CHIP newbie like me (not newbie in code deep diving, though), but I think it cleans up an inconsistency in the IP endpoints. Co-authored-by: Lukas Zeller <[email protected]>
#if CHIP_SYSTEM_CONFIG_USE_DISPATCH | ||
if (watch->mWrSource) | ||
{ | ||
dispatch_resume(watch->mWrSource); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right... For example, TCP will call RequestCallbackOnPendingWrite
for every single packet, without doing any ClearCallbackOnPendingWrite
in between. So we will have multiple dispatch_resume
calls when the dispatch source is already resumed. You're not allowed to do that, per the API documentation.
I have not carefully checked for other failure modes in TCP and UDP here, but fundamentally assuming that their calls to RequestCallback* and ClearCallback* will be ordered in a sane way is not reasonable: those calls have an API designed around a single boolean flag, not a counter. So we need to keep track with a boolean flag, and ensure our counter is only ever 0 or 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I will update the change accordingly.
@plan44 Can you please fix the issue in #21135 (review) ? |
…learCallback As pointed out in project-chip#21135, the previous version was not safe when calls to RequestCallbackXX and ClearCallbackXX were not balanced, and there's no reason to believe API clients would balance them. This version does no longer try to suspend the source (because that could be problematic when done from within callbacks) but now follows the original semantics of just changing mPendingIO flags to enable/disable callback, with the exception of requesting the callback for the first time (which triggers creation of the dispatch source). WIP: more detailed testing shows: - TCPtest from src/transport/raw/tests never bothers to set up a dispatch queue in the first place, so fails. - when NetworkTestHelpers.cpp is modified to setup a dispatch queue, TCPtest succeeds, crashes or fails randomly - TestSystemWakeEvent from src/system/tests succeeds despite not setting up a dispatch queue. Apparently, the test result does not depend on the handlers set up actually returning events. - all other tests (run on macos, `ninja -C out/host check`) pass As making use of Request/ClearCallbackXXX without having a dispatch queue installed points to a problem (most likely only occurring in tests with only partial system init), I added a ChipLogError to make this visible.
@bzbarsky-apple Turns out to be much more difficult than I thought... I tried to fix and test the issue you pointed out. I tried several variants, but did not get TCPtest from src/transport/raw/tests producing reliable results (random crash/success/failure on every invocation). It does so in original form (where it does not bother to set up a dispatch queue at all) and also when I patched NetworkTestHelpers.cpp to install a dispatch queue. All other tests are fine (run on macOS). First I just made sure suspend/resume remains balanced in SystemLayerImplSelect as you suggested. This is as near as I could get to the original semantics where Request/ClearCallback just change flags - only the first RequestCallback installs the dispatch source. As I do not want to trigger the whole PR machinery with this WIP, this is on a separate branch in my fork for now: https://github.com/plan44/connectedhomeip/tree/PR/fix_socket_event_watching_WIP The failures in TCPtest seem more like a test artifact than a real problem to me, but I definitely don't understand the stack well enough yet to be sure. |
@plan44 When running the unit tests, there is no dispatch queue (long story). So the Basically, when there is no dispatch queue that code is supposed to fall back on the non-dispatch "just do sockets" code.... |
I can imagine ;-) Trying to debug I realized that the PlatformManager would be needed to properly set it up, firing up half of the stack for a test that should be simple.
Ok, thanks. That helped me to understand the problem: In my previous attempt I intentionally only set the flag when the source could actually be installed. This broke the select()-based DriveIO() fallback. The new version now just always sets the flag, and then tries to install the source, but when no dispatch is available it just outputs the warning. In non-test cases, the warning should definitely never appear, in tests it may, depending on the setup (and I changed the wording to make that clear). See new PR #21349 |
Yep, and then the rabbit hole gets deep. :) #21349 looks great, thank you! |
…spatch, clean transport (project-chip#21135) For CHIP_SYSTEM_CONFIG_USE_DISPATCH (Darwin) case, Request/ClearCallbackOnPendingXXX() were not working (no loop calling select to evaluate watches), but were called in TCP and UDP endpoint implementations, despite dispatch-specific extra code for the same purpose in the endpoints. This changeset now moves all CHIP_SYSTEM_CONFIG_USE_DISPATCH specific socket watching code into SystemLayerImplSelect, by making Request/ClearCallbackOnPendingXXX() actually working with dispatch. The issue surfaced in my (a bit exotic) use case when I needed to get callbacks for events on a plain socket not wrapped as TCPEndPoint. Doing this, I realized Request/ClearCallbackOnPendingXXX() were non-functional due to missing select main loop. Fixing this revealed the similar dispatch code in the endpoints that was needed to get those events handled despite Request/ClearCallbackOnPendingXXX() not working. That extra endpoint code now became redundant (delivering events twice), so had to be removed. I am aware that this might be a too deep intervention for coming from a CHIP newbie like me (not newbie in code deep diving, though), but I think it cleans up an inconsistency in the IP endpoints.
…dispatch case, clean transport
Problem
For CHIP_SYSTEM_CONFIG_USE_DISPATCH (Darwin) case, Request/ClearCallbackOnPendingXXX() were not working (no loop calling select to evaluate watches).
But still, these methods were called in TCP and UDP endpoint implementations, despite dispatch-specific extra code for the same purpose in the endpoints.
Change overview
This changeset now moves all CHIP_SYSTEM_CONFIG_USE_DISPATCH specific socket watching code into SystemLayerImplSelect, by making Request/ClearCallbackOnPendingXXX() actually working with dispatch.
The issue surfaced in my (a bit exotic) use case when I needed to get callbacks for events on a plain socket not wrapped as TCPEndPoint.
Doing this, I realized Request/ClearCallbackOnPendingXXX() were non-functional due to the missing select main loop.
Fixing this revealed the similar dispatch code in the endpoints that was needed to get those events handled despite Request/ClearCallbackOnPendingXXX() not working. That dispatch specific endpoint code now became redundant (delivering events twice), so had to be removed.
Testing
I am aware that this might be a too deep intervention and too limited testing for coming from a CHIP newbie like me (not newbie in code deep diving, though), but I think it cleans up an inconsistency in the IP endpoints, and maybe can be of use after review by CHIP experts...