Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash occurs in closePeerConnection #995

Closed
liutuzhao opened this issue Dec 8, 2020 · 15 comments
Closed

Crash occurs in closePeerConnection #995

liutuzhao opened this issue Dec 8, 2020 · 15 comments
Labels
question Further information is requested

Comments

@liutuzhao
Copy link

liutuzhao commented Dec 8, 2020

Logging
crash-in-closePeerConnection.zip

Describe the bug
Doing two livestreming session at the same time . When 1 session detected broken in callbak "onConnectionStateChange" and the "terminateFlag" set as true. Another thread will check each session status and then free this broken session, found crash in SDK function "closePeerConnection ". Back trace as follows:

(gdb) where
#0 0x00504fa4 in pthread_mutex_lock ()
#1 0x00319500 in socketConnectionClosed ()
#2 0x0030f92c in connectionListenerRemoveAllConnection ()
#3 0x003108a0 in iceAgentShutdown ()
#4 0x002d163c in closePeerConnection ()
#5 0x0005a1ac in freeSampleStreamingSession ()
#6 0x00048a10 in CWebRTCClientMaster::SessionCleanupCheck(CQVMessageT*, unsigned int, unsigned int&) ()
#7 0x002ab82c in CQVThreadWorker::OnPolling(unsigned int&) ()
#8 0x002ac954 in CQVThreadWorker::OnThread() ()
#9 0x002aba58 in CQVThread::ThreadProc(void*) ()
#10 0x00503904 in start_thread ()
#11 0x0051cd20 in clone ()

SDK version number
V1.4.0

Open source building
default config in SDK

@liutuzhao liutuzhao added the bug Something isn't working label Dec 8, 2020
@MushMal
Copy link
Contributor

MushMal commented Dec 8, 2020

@liutuzhao this is not the stock application and the issue is not actionable without debug symbols and local variables. We will only look at crashes with stock samples.

Please debug this further on your own. Please pull us in if you can pinpoint the actual crash in the SDK or the stock sample applications.

As the stack trace does not correspond to the sample application, I am not sure what's causing the crash.

Removing "bug" tags.

@MushMal MushMal added question Further information is requested and removed bug Something isn't working labels Dec 8, 2020
@MushMal MushMal changed the title [BUG] Crash occurs in closePeerConnection Crash occurs in closePeerConnection Dec 8, 2020
@MushMal
Copy link
Contributor

MushMal commented Dec 9, 2020

Any updates? Have you been able to reproduce this with stock samples?

@liutuzhao
Copy link
Author

Any updates? Have you been able to reproduce this with stock samples?

We're trying the Alexa's pull request #996 and your pull request #1001 on our camera.
We dis not find crash at the moment. We will keep testing for several days and if no relate crashe occurs again, we can close this issue.

@MushMal
Copy link
Contributor

MushMal commented Dec 10, 2020

Sounds good. I a not sure if any of this will fix a crash. Try to get the stock applications running in parallel on your platform to get wider coverage. Try running under the gdb and have the symbols ready to be loaded if a crash happens

@liutuzhao
Copy link
Author

We encountered another similar crash. @MushMal @codingspirit

(gdb) thread apply 1 bt

Thread 1 (LWP 6105):
#0 0x00505284 in pthread_mutex_lock ()
#1 0x00322f70 in lwsCompleteSync ()
#2 0x00323578 in getIceConfigLws ()
#3 0x002dd128 in getIceConfig ()
#4 0x002de2a0 in executeGetIceConfigSignalingState ()
#5 0x002f3fc0 in stepStateMachine ()
#6 0x002ddfe0 in stepSignalingStateMachine ()
#7 0x00320564 in reconnectHandler ()
#8 0x00503be4 in start_thread ()
#9 0x0051d000 in clone ()
#10 0x0051d000 in clone ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@MushMal
Copy link
Contributor

MushMal commented Dec 12, 2020

Do you have the symbols?

@liutuzhao
Copy link
Author

I am uploading the gdb, coredump and sysmbol files together with the executable file before strip.
the core-xxx is the coredump file. The Sofia1 is execcutable before strip, Sofia is executable after strip, the .symbols is striped symbols file. And the gdb1 is the x86 gdb version with target is our camera platform. The attachement is more than 10M and I compressed it to 4 part. Because github only accept .zip file, I modified the file extention to zip. Please download and modify the file name to bak8-crash-lwsCompleteSync.zip.001, bak8-crash-lwsCompleteSync.zip.002, bak8-crash-lwsCompleteSync.zip.003,bak8-crash-lwsCompleteSync.zip.004, before umcompress them.
bak8-crash-lwsCompleteSync.004.zip

bak8-crash-lwsCompleteSync.003.zip

bak8-crash-lwsCompleteSync.002.zip

bak8-crash-lwsCompleteSync.001.zip

@codingspirit
Copy link
Member

MUTEX_LOCK(pCallInfo->pSignalingClient->lwsSerializerLock);
, I couldn't find any scenario that pCallInfo->pSignalingClient->lwsSerializerLock is NULL while pCallInfo->pSignalingClient is not. Hi @MushMal any clue from your side?

@MushMal
Copy link
Contributor

MushMal commented Dec 17, 2020

I couldn't

None that I can think of. If you are within the LwsApiCalls.c then you should have succeeded creating the entire signaling client object successfully.

Perhaps a stale public header file with the latest codebase that could have shifted the internal structure fields?

Sorry, I haven't had any time to look at the attached log files.

@Nomidia
Copy link
Contributor

Nomidia commented Dec 29, 2020

Similar issue in the same position:

#0 0x003318e4 in lws_callback_on_writable ()
#1 0x00323810 in wakeLwsServiceEventLoop ()
#2 0x00323d18 in lwsCompleteSync ()
#3 0x0032434c in getIceConfigLws ()
#4 0x002dceb4 in getIceConfig ()
#5 0x002de02c in executeGetIceConfigSignalingState ()
#6 0x002f4bd8 in stepStateMachine ()
#7 0x002ddd6c in stepSignalingStateMachine ()
#8 0x002daac4 in refreshIceConfigurationCallback ()
#9 0x002fab04 in timerQueueExecutor ()
#10 0x005049b4 in start_thread ()
#11 0x0051ddd0 in clone ()
#12 0x0051ddd0 in clone ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@hassanctech
Copy link
Contributor

Can you please try the latest commit on master and see if it resolves your issue?
If there is still a crash, please include symbols with the crash stack so we can better help.

@suggestedfixes
Copy link
Contributor

@hassanctech Still reproducible on Windows, would be nice if someone on the AWS side to replicate the Windows scenario.

@MushMal
Copy link
Contributor

MushMal commented Jan 14, 2021

@suggestedfixes this thread is getting stale very quickly. I have requested a dump with symbols + info whether you've made any changes. We do have Windows runs in Travis CI which don't crash. It's hard to for us to try to reproduce something that we have no understanding on.

  • Please include detailed description of the assets in use. Whether there have been any modifications to the samples that are being run.
  • Include detailed description of how the crash happes.
  • Include details on the platform - both hardware and software with their versions
  • Provide symbols for ALL of the threads in the crash dump

@MushMal
Copy link
Contributor

MushMal commented Jan 19, 2021

Updates please?

@MushMal
Copy link
Contributor

MushMal commented Jan 20, 2021

I am resolving this as we have no symbolic info and there is nothing actionable here.

Please use the latest commit which removes the auto-ICE refresh for the crash stack with ICE refresh in signaling. There is very little to work with on the other crash stack related to the connection removal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants