Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race in UDPEndpoint on lwIP and OpenThread #20478

Closed
gjc13 opened this issue Jul 8, 2022 · 4 comments · Fixed by #20536
Closed

Data race in UDPEndpoint on lwIP and OpenThread #20478

gjc13 opened this issue Jul 8, 2022 · 4 comments · Fixed by #20536

Comments

@gjc13
Copy link
Contributor

gjc13 commented Jul 8, 2022

Problem

One of our vendors reported random crashes in the light app. The app crashes at line UDPEndPointImplLwIP::LwIPReceiveUDPMessage. The ep->Retain(); line reports invalid reference count and aborted.

The current imlementaion of this function will suffer from data race. ep->Retain() will be executed in the lwIP event loop while ep->Release() is executed in the Matter event loop. The following patch has been suggested to the vendor:

--- a/src/inet/UDPEndPointImplLwIP.cpp
+++ b/src/inet/UDPEndPointImplLwIP.cpp
@@ -371,8 +371,8 @@ void UDPEndPointImplLwIP::LwIPReceiveUDPMessage(void * arg, struct udp_pcb * pcb
         pktInfo->DestPort    = pcb->local_port;
     }
 
-    ep->Retain();
     CHIP_ERROR err = ep->GetSystemLayer().ScheduleLambda([ep, p = System::LwIPPacketBufferView::UnsafeGetLwIPpbuf(buf)] {
+        ep->Retain();
         ep->HandleDataReceived(System::PacketBufferHandle::Adopt(p));
         ep->Release();
     });
@@ -381,10 +381,6 @@ void UDPEndPointImplLwIP::LwIPReceiveUDPMessage(void * arg, struct udp_pcb * pcb
         // If ScheduleLambda() succeeded, it has ownership of the buffer, so we need to release it (without freeing it).
         static_cast<void>(std::move(buf).UnsafeRelease());
     }
-    else
-    {
-        ep->Release();
-    }
 }

This is not correct but will resolve the data race issue. After applying the patch the crash no longer happens. Similar logic is found in the OpenThread UDP endpoint implementation so there will be potential bugs as well.

Proposed solution

One of the following solutions can be adpoted:

  • Add thread-safe implementation for reference counting.
  • Use the lwIP socket API rather than the raw lwIP API so that the received data will be handled in the Matter event loop.
  • Simply remove the Retain and Release logic since in practice nobody is releasing the endpoint.
@gjc13
Copy link
Contributor Author

gjc13 commented Jul 8, 2022

@kpschoedel How do you think of this issue?

@kpschoedel
Copy link
Contributor

Personally, I would remove Retain/Release now and file a bug to add thread-safe reference counting when there aren't higher priorities. (I would probably add a separate thread-safe version since ReferenceCounted is usually used entirely within the Matter thread.)

@kghost
Copy link
Contributor

kghost commented Jul 8, 2022

I do think the correct solution is expand thread lock into a global matter lock

Use the lwIP socket API rather than the raw lwIP API so that the received data will be handled in the Matter event loop.

This also works, but I wonder if it is able to catch 1.0 timeline

The app crashes at line UDPEndPointImplLwIP::LwIPReceiveUDPMessage. The ep->Retain(); line reports invalid reference count and aborted.

This is the evidence that there is a bug inside thread implementation. It won't crash inside ep->Retain() even if there is racing here.

@gjc13
Copy link
Contributor Author

gjc13 commented Jul 11, 2022

This is the evidence that there is a bug inside thread implementation. It won't crash inside ep->Retain() even if there is racing here.

Retain can crash if the incremented counter in a previous Retain is overriden by the decremented value in Release because of data race. It aborts at VerifyOrDie(!kInitRefCount || mRefCount > 0); and we have observed the mRefCount to be zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants