Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI]: Fix random segmentation fault during event handling #9010

Merged
merged 2 commits into from
Aug 18, 2021
Merged

[CI]: Fix random segmentation fault during event handling #9010

merged 2 commits into from
Aug 18, 2021

Conversation

yufengwangca
Copy link
Contributor

@yufengwangca yufengwangca commented Aug 13, 2021

Problem

What is being fixed? Examples:

  • We have refactored system event handling mechanism recently, and observed the following seg fault in CI check.
python3 ../../third_party/pigweed/repo/pw_build/py/pw_build/python_runner.py --gn-root ../../ --current-path ../../src/transport/raw/tests --default-toolchain=//build/toolchain/host:linux_x64_gcc --current-toolchain=//build/toolchain/host:linux_x64_gcc --touch gen/src/transport/raw/tests/TestTCP_run.pw_pystamp --capture-output -- ../../third_party/pigweed/repo/pw_unit_test/py/pw_unit_test/test_runner.py --runner ../../third_party/pigweed/repo/targets/host/run_test --test tests/TestTCP
INF Test 1/1: [ RUN] TestTCP
ERR ../../third_party/pigweed/repo/targets/host/run_test exited with status 139
OUT [11339]
'#0:','Test-CHIP-Tcp'
'#2:','Setup                     ','PASSED'
'#3:','Simple Init Test IPV4     ','PASSED'
[1628696150.253394][11341:11341] CHIP:IN: TransportMgr initialized
CHIP node ready to service events
[1628696150.254463][11341:11341] CHIP:IN: Freeing connection: connection closed by peer
Segmentation fault (core dumped)
INF Test 1/1: [FAIL] TestTCP

The root cause is the WatchableSocket could be changed by InvokeCallback() which might close the socket and disassociate its file descriptor.

    for (WatchableSocket * watchable = mAttachedSockets; watchable != nullptr; watchable = watchable->mAttachedNext)
    {
        if (watchable->mPendingIO.HasAny())
        {
            watchable->InvokeCallback();
        }
    }

Log should the watchable->mAttachedNext contain garbage value after InvokeCallback. We should not use its value to access the next element in the watchableSocket list.

[1628957118.857855][12770:12770] CHIP:IN: HandleEvents:watchable:140447941584200:364
[1628957118.857858][12770:12770] CHIP:IN: HandleEvents:watchable->mAttachedNext:140447941585480:364
[1628957118.857862][12770:12770] CHIP:IN: InvokeCallback
[1628957118.858559][12770:12770] CHIP:IN: Freeing connection: connection closed by peer
[1628957118.858569][12770:12770] CHIP:IN: ActiveConnectionState:free
[1628957118.858596][12770:12770] CHIP:IN: WatchableSocket::OnRelease
[1628957118.858807][12770:12770] CHIP:IN: HandleEvents:watchable:140447941584200:365
[1628957118.858814][12770:12770] CHIP:IN: HandleEvents:watchable->mAttachedNext:-2314885530818453537:365

Change overview

Move to next watchable before we do InvokeCallback() in case InvokeCallback() close watchable object.

Testing

How was this tested? (at least one bullet point required)

  • Verified by CI

@andy31415
Copy link
Contributor

Please update the summary - I think it preserves some template text (says "Frobnozzle is leaky" and such)

@andy31415
Copy link
Contributor

Could you also update the summary as to 'why does this segfault without the change'?

@bzbarsky-apple
Copy link
Contributor

But also, I'd like to understand the crash scenario. We called InvokeCallback. That closed watchable. But then we move on to the next watchable... so why the crash? Is it that the next watchable got deleted but we still have a stale pointer to it? Something else?

@yufengwangca
Copy link
Contributor Author

But also, I'd like to understand the crash scenario. We called InvokeCallback. That closed watchable. But then we move on to the next watchable... so why the crash? Is it that the next watchable got deleted but we still have a stale pointer to it? Something else?

Yes, see more detail in the conversation above.

1 similar comment
@yufengwangca
Copy link
Contributor Author

But also, I'd like to understand the crash scenario. We called InvokeCallback. That closed watchable. But then we move on to the next watchable... so why the crash? Is it that the next watchable got deleted but we still have a stale pointer to it? Something else?

Yes, see more detail in the conversation above.

@github-actions
Copy link

Size increase report for "esp32-example-build" from f6c861e

File Section File VM
chip-bridge-app.elf .flash.text -48 -48
chip-ipv6only-app.elf .flash.text 172 172
chip-temperature-measurement-app.elf .flash.text -60 -60
Full report output
BLOAT REPORT

Files found only in the build output:
    report.csv

Comparing ./master_artifact/chip-bridge-app.elf and ./pull_artifact/chip-bridge-app.elf:

sections,vmsize,filesize
[Unmapped],0,48
.flash.text,-48,-48

Comparing ./master_artifact/chip-persistent-storage.elf and ./pull_artifact/chip-persistent-storage.elf:

sections,vmsize,filesize

Comparing ./master_artifact/chip-ipv6only-app.elf and ./pull_artifact/chip-ipv6only-app.elf:

sections,vmsize,filesize
.flash.text,172,172
[Unmapped],0,-172

Comparing ./master_artifact/chip-temperature-measurement-app.elf and ./pull_artifact/chip-temperature-measurement-app.elf:

sections,vmsize,filesize
[Unmapped],0,60
.flash.text,-60,-60

Comparing ./master_artifact/chip-shell.elf and ./pull_artifact/chip-shell.elf:

sections,vmsize,filesize

Comparing ./master_artifact/chip-all-clusters-app.elf and ./pull_artifact/chip-all-clusters-app.elf:

sections,vmsize,filesize

Comparing ./master_artifact/chip-pigweed-app.elf and ./pull_artifact/chip-pigweed-app.elf:

sections,vmsize,filesize

Comparing ./master_artifact/chip-lock-app.elf and ./pull_artifact/chip-lock-app.elf:

sections,vmsize,filesize


@woody-apple woody-apple merged commit 26c0e35 into project-chip:master Aug 18, 2021
@yufengwangca yufengwangca deleted the pr/system/event branch August 18, 2021 01:04
nikita-s-wrk pushed a commit to nikita-s-wrk/connectedhomeip that referenced this pull request Sep 23, 2021
…ip#9010)

* Fix random segmentation fault during event handling

* Update for review comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants