Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to fix the infamous DTLS decrypt alert error (issues #136 and #316) #394

Merged
merged 1 commit into from
Dec 2, 2015

Conversation

lminiero
Copy link
Member

This is an attempt to fix the infamous issues #136 and #316, that is the "tlsv1 alert decrypt error" randomly happening when doing handshakes that force Janus to be restarted (issue affecting OpenSSL but NOT BoringSSL, apparently). The cause might be related to race conditions, and in fact OpenSSL docs state that:

    "OpenSSL can safely be used in multi-threaded applications
    provided that at least two callback functions are set,
    locking_function and threadid_func."

The fix proposed here is heavily derived from a discussion related to RTPEngine, where it was mentioned the issue was fixed in this commit, which does indeed implement the callbacks the OpenSSL docs suggest.

Please test this estensively with OpenSSL (not BoringSSL) and let us know if this does fix it for you. Mentioning those who experienced the issue in the past in particular (at least those who discussed it on those issues pages), that is @ancorgs @ploxiln @LetsVape @Will-I4M @jacobhub @frk2 @amnonbb @Computician @jing3018 @tuijldert @saghul

@ploxiln
Copy link
Contributor

ploxiln commented Nov 27, 2015

It'll be a while before I get an appropriate testing version out and have it stewing for long enough to see for sure, but I think this is the right approach (and this specific implementation looks good to me). Much appreciated 😄

@LetsVape
Copy link

Awesome, good find on the discussion, thanks for working on this Lorenzo
I've compiled this right away and started testing with 30+ users

@ancorgs
Copy link
Contributor

ancorgs commented Nov 30, 2015

I have just deployed it. We are planning a heavy load this week. I will let you know how it behaves.

@lminiero lminiero mentioned this pull request Dec 1, 2015
@LetsVape
Copy link

LetsVape commented Dec 2, 2015

I have been running this for 110 hours and haven't seen the issue, so it's safe to say things improved, as I couldn't run it for much longer than 2-24 hours until handshakes started locking up with OpenSSL before this pull request

Connections seem to fail less in general, though that might be placebo, people had to retry camming up sometimes but that seems less often now

@lminiero
Copy link
Member Author

lminiero commented Dec 2, 2015

@LetsVape thanks for the quick feedback, and really excited about the results!
I'll merge the PR then, it's easy enough to get rid of for people who don't want to use it.

lminiero added a commit that referenced this pull request Dec 2, 2015
Attempt to fix the infamous DTLS decrypt alert error (issues #136 and #316)
@lminiero lminiero merged commit cfc5421 into master Dec 2, 2015
@LetsVape
Copy link

LetsVape commented Dec 2, 2015

👍 Thanks for that! Same here, I hope we can keep this process going and beat the BoringSSL uptime :)

@ancorgs
Copy link
Contributor

ancorgs commented Dec 2, 2015

After three days of usage (heavier usage than usual) I have noticed no slow down, I have experienced no crash, I have not needed to restart the service in order for people to connect and there is no trace of the DTLS decrypt error in the logs.

I would call this a huge success. Thanks a lot!!!

@lminiero
Copy link
Member Author

lminiero commented Dec 2, 2015

Cool, thanks for the update, good to have other feedback that confirms this works!

@HouzuoGuo
Copy link

congratulations on the resolution!

@LetsVape
Copy link

I got 29 days and 23 hours uptime out of this, I then got the following core dump, hope it tells you something useful @lminiero , otherwise I'll try again with a newer commit later, maybe the upcoming reference counters will improve things a bit 👍

https://gist.github.com/LetsVape/80880c1bdc92411e5018

Thanks!

@LetsVape
Copy link

No big CPU or memory jumps, dark blue is cached memory used, but that has been high throughout the uptime, the OS should free that up when needed as far as I'm aware
screen shot 2015-12-27 at 22 44 10

@lminiero
Copy link
Member Author

@LetsVape looks like some internal Glib stuff, not really sure how I can debug this (won't happen before a couple of weeks anyway). If it is some kind of race condition, then yes, a polished and stable version of the reference counters branch may help on that.

~30 days without an issue seems like a good improvement though 😄

@LetsVape
Copy link

Thanks for the reply, no worries at all, just to make you aware in case this backtrace was useful, as running it in AddressSanitizer for 30 days might be a bit much.
It's indeed a great improvement already, I can deal with a minute downtime per 30 days.
I'll keep an eye on the commits and try those out later so we can try for another 30+ days 👍

Cheers and happy new year

@lminiero lminiero deleted the dtls-alert-fix branch June 7, 2016 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants