You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have written a websocket server using sockjs-node (which I realize is faye-websockets at its core) that is working beautifully and is scalable to well over 1MM connections. My problem is that, on occasion, the CPU will get pegged at 100%. Further investigation with strace reveals an infinite loop related to the read() syscall trying to communicate with a file descriptor that has gone into the "can't identify protocol" state.
After analyzing many straces, the missing piece is a close() syscall for the affected file descriptor. I have no idea why it's missing for these particular failure cases. The other successful connections properly have close() and sometimes shutdown() syscalls. I can trigger the problem by sending a large chunk of data from server to client (see #4 below) immediately on connect, but the CPU spike only happens sometimes. Here's the flow:
Client requests some data (details don't matter here). Everything back and forth is serialized JSON.
Server retrieves data from memCache.
Server sends data over the websocket. This data is a bit larger in size - around 55K. I realize it's broken into separate "frames". I can't get this to fail in my own testing.
At some point during this process, a legitimate websocket close is detected and the close event fires for sockjs. I can't tell if it finishes sending the initial 55K of data. Edit: This data doesn't seem to matter. Another strace shows the problem happening many seconds after this initial data. The problem seems to happen within timing of a read() and disconnect operations.
syscall close() DOES NOT HAPPEN for the affected file descriptor. ** I believe this is the root of the problem?
File descriptor ends up in "can't identify protocol". node 9116 web 1365u sock 0,5 1203852860 can't identify protocol
Kernel continutes to try and read() from that FD, causing an infinite loop and 100% CPU.
I'm at my wits end here. I'm not a node beginner and I can't figure this out. I've read many threads that are similar, but have either been fixed or don't apply to my case exactly:
It should be noted that when I don't send the initial data chunk to the client, it works all the time from what I can tell. Even when I send the data, it only occasionally fails, so I think it must have to do with the timing and however the client is disconnecting, yet I can't see the client side of the problem.
Here is the last part of the strace. You can see the disconnect followed by infinite read() syscalls to FD 321. Let me know if you need the full strace. THANK YOU!
The text was updated successfully, but these errors were encountered:
darinspivey
changed the title
Webscoket FD not closed() properly. Causes infinite loop and 100% CPU.
Websocket FD not closed() properly. Causes infinite loop and 100% CPU.
Jul 3, 2014
FIXED: It is with a great deal of humility that I post my own solution. I've made a rookie mistake of not upgrading node sooner. It wasn't until I virtually removed all my code and the problem persisted that I though perhaps it would actually be the OS or node.js itself. I was naive to think that our version v0.10.5 wasn't 'that far' out of date. Wrong. By over a year. An upgrade to v0.10.29 fixed the issue.
At least I learned a ton about strace and my code :) Keep up the good work here - the sockjs package is the best that I've worked with in terms of fallbacks and ease of use.
I have written a websocket server using sockjs-node (which I realize is faye-websockets at its core) that is working beautifully and is scalable to well over 1MM connections. My problem is that, on occasion, the CPU will get pegged at 100%. Further investigation with strace reveals an infinite loop related to the read() syscall trying to communicate with a file descriptor that has gone into the "can't identify protocol" state.
After analyzing many straces, the missing piece is a close() syscall for the affected file descriptor. I have no idea why it's missing for these particular failure cases. The other successful connections properly have close() and sometimes shutdown() syscalls. I can trigger the problem by sending a large chunk of data from server to client (see #4 below) immediately on connect, but the CPU spike only happens sometimes. Here's the flow:
Edit: This data doesn't seem to matter. Another strace shows the problem happening many seconds after this initial data. The problem seems to happen within timing of a read() and disconnect operations.
node 9116 web 1365u sock 0,5 1203852860 can't identify protocol
I'm at my wits end here. I'm not a node beginner and I can't figure this out. I've read many threads that are similar, but have either been fixed or don't apply to my case exactly:
https://idea.popcount.org/2012-12-09-lsof-cant-identify-protocol/
#99
https://groups.google.com/forum/#!topic/sockjs/noR9wD7-esg
nodejs/node-v0.x-archive#3613
It should be noted that when I don't send the initial data chunk to the client, it works all the time from what I can tell. Even when I send the data, it only occasionally fails, so I think it must have to do with the timing and however the client is disconnecting, yet I can't see the client side of the problem.
Details:
node.js v0.10.5
sockjs-0.3.9
Linux, x86_64 x86_64 x86_64 GNU/Linux
Here is the last part of the strace. You can see the disconnect followed by infinite read() syscalls to FD 321. Let me know if you need the full strace. THANK YOU!
[ INFINITE LOOP of clock, epoll, read() ]
What this should look like. Notice the close(447) and shutdown(447) calls.
The text was updated successfully, but these errors were encountered: