-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop dropping reads on the floor if more than one happens at once. #15338
Merged
yunhanw-google
merged 1 commit into
project-chip:master
from
bzbarsky-apple:stop-dropping-reads
Feb 18, 2022
Merged
Stop dropping reads on the floor if more than one happens at once. #15338
yunhanw-google
merged 1 commit into
project-chip:master
from
bzbarsky-apple:stop-dropping-reads
Feb 18, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes project-chip#15304 The fix for project-chip#15304 is the change in the while loop condition in Engine::Run. Before that change, we would compare numReadHandled to the current count of allocated read handlers. But processing a read handler would deallocate it, so we would only end up processing half the read handlers that were outstanding on entry to Run (because after that numReadHandled would be larger than the remaining number allocated). The change in InteractionModelEngine::OnDone and the management of mRunningReadHandler are to handle a slightly more complicated situation I ran into while writing some tests for this code. If we have at least two subscriptions and some number of reads after them pending when Engine::Run is entered, when we would process the first read, it would be deallocated, mCurReadHandlerIdx would get reset to 0, we would then increment it by 1, and end up walking all but the first subscription again. Which means that our numReadHandled would increase to the point where the loop terminates before we had gotten to all the read handlers. If we had N subscriptions we would miss N-1 read handlers. Those could then get stuck in limbo indefinitely, until either a subscription heartbeat had to happen or someone else issued a read. The change in Engine::OnReportConfirm is to handle the case when we have more than CHIP_IM_MAX_REPORTS_IN_FLIGHT subscriptions that all need reporting, fire off the first CHIP_IM_MAX_REPORTS_IN_FLIGHT of them, and then never call ScheduleRun() after that, so all the other subscriptions get stuck. The issue with CHIP_IM_MAX_REPORTS_IN_FLIGHT was not being caught by the existing tests because those tests manually called Run() on the reporting engine (in a loop, in fact). That was needed because we could end up in a situation where DrainAndServiceIO() has processed all the pending messages, but we have a queued task (from ScheduleRun) that is not a message and will not get run, so Engine::Run was not getting called properly via the "normal" codepath in the test. This was fixed by removing all the manual Run() calls and making DrainAndServiceIO() do a better job of handling async things queued by message reception. TestReadAttributeTimeout had to be modified slightly because in the new setup we could no longer rely on DrainAndServiceIO() after we send the reads not triggering the reports and giving us a chance to tear down the session before the reports did get triggered. So instead of first expiring the client-to-server session, we expire the server-to-client one _before_ doing DrainAndServiceIO. This ensures that we never get replies to our reads, and then we can proceed to expire the client-to-server session, which gets treated as a timeout.
bzbarsky-apple
requested review from
msandstedt,
mrjerryjohns and
yunhanw-google
February 18, 2022 05:51
pullapprove
bot
requested review from
andy31415,
anush-apple,
austinh0,
balducci-apple,
Byungjoo-Lee,
carol-apple,
chrisdecenzo,
chshu,
chulspro,
Damian-Nordic,
dhrishi,
electrocucaracha,
emargolis,
franck-apple,
gjc13,
hawk248,
holbrookt,
harsha-rajendran,
isiu-apple,
jelderton,
jepenven-silabs,
jmartinez-silabs,
kghost and
lazarkov
February 18, 2022 05:52
pullapprove
bot
requested review from
LuDuda,
lzgrablic02,
sagar-apple,
saurabhst,
selissia,
tecimovic,
turon,
vijs,
vivien-apple,
wbschiller,
woody-apple,
xylophone21 and
yufengwangca
February 18, 2022 05:52
PR #15338: Size comparison from d7bcb65 to 1154ed4 Increases (37 builds for cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, p6, qpg, telink)
Full report (43 builds for cyw30739, efr32, esp32, k32w, linux, mbed, nrfconnect, p6, qpg, telink)
|
yunhanw-google
approved these changes
Feb 18, 2022
jmartinez-silabs
approved these changes
Feb 18, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #15304
The fix for #15304 is the change in the while loop condition in
Engine::Run. Before that change, we would compare numReadHandled to
the current count of allocated read handlers. But processing a read
handler would deallocate it, so we would only end up processing half
the read handlers that were outstanding on entry to Run (because after
that numReadHandled would be larger than the remaining number
allocated).
The change in InteractionModelEngine::OnDone and the management
of mRunningReadHandler are to handle a slightly more complicated
situation I ran into while writing some tests for this code. If we
have at least two subscriptions and some number of reads after them
pending when Engine::Run is entered, when we would process the first
read, it would be deallocated, mCurReadHandlerIdx would get reset to
0, we would then increment it by 1, and end up walking all but the
first subscription again. Which means that our numReadHandled would
increase to the point where the loop terminates before we had gotten
to all the read handlers. If we had N subscriptions we would miss N-1
read handlers. Those could then get stuck in limbo indefinitely,
until either a subscription heartbeat had to happen or someone else
issued a read.
The change in Engine::OnReportConfirm is to handle the case when we
have more than CHIP_IM_MAX_REPORTS_IN_FLIGHT subscriptions that all
need reporting, fire off the first CHIP_IM_MAX_REPORTS_IN_FLIGHT of
them, and then never call ScheduleRun() after that, so all the other
subscriptions get stuck.
The issue with CHIP_IM_MAX_REPORTS_IN_FLIGHT was not being caught by
the existing tests because those tests manually called Run() on the
reporting engine (in a loop, in fact). That was needed because we
could end up in a situation where DrainAndServiceIO() has processed
all the pending messages, but we have a queued task (from ScheduleRun)
that is not a message and will not get run, so Engine::Run was not
getting called properly via the "normal" codepath in the test. This
was fixed by removing all the manual Run() calls and making
DrainAndServiceIO() do a better job of handling async things queued by
message reception.
TestReadAttributeTimeout had to be modified slightly because in the
new setup we could no longer rely on DrainAndServiceIO() after we send
the reads not triggering the reports and giving us a chance to tear
down the session before the reports did get triggered. So instead of
first expiring the client-to-server session, we expire the
server-to-client one before doing DrainAndServiceIO. This ensures
that we never get replies to our reads, and then we can proceed to
expire the client-to-server session, which gets treated as a timeout.
Problem
See above
Change overview
See above
Testing
Yes, that's what took up most of the time for this PR.