Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket FD not closed() properly. Causes infinite loop and 100% CPU. #161

Closed
darinspivey opened this issue Jul 2, 2014 · 1 comment
Closed

Comments

@darinspivey
Copy link

I have written a websocket server using sockjs-node (which I realize is faye-websockets at its core) that is working beautifully and is scalable to well over 1MM connections. My problem is that, on occasion, the CPU will get pegged at 100%. Further investigation with strace reveals an infinite loop related to the read() syscall trying to communicate with a file descriptor that has gone into the "can't identify protocol" state.

After analyzing many straces, the missing piece is a close() syscall for the affected file descriptor. I have no idea why it's missing for these particular failure cases. The other successful connections properly have close() and sometimes shutdown() syscalls. I can trigger the problem by sending a large chunk of data from server to client (see #4 below) immediately on connect, but the CPU spike only happens sometimes. Here's the flow:

  1. Client connects, websocket upgrade (if applicable) succeeds.
  2. Client requests some data (details don't matter here). Everything back and forth is serialized JSON.
  3. Server retrieves data from memCache.
  4. Server sends data over the websocket. This data is a bit larger in size - around 55K. I realize it's broken into separate "frames". I can't get this to fail in my own testing.
  5. At some point during this process, a legitimate websocket close is detected and the close event fires for sockjs. I can't tell if it finishes sending the initial 55K of data.
    Edit: This data doesn't seem to matter. Another strace shows the problem happening many seconds after this initial data. The problem seems to happen within timing of a read() and disconnect operations.
  6. syscall close() DOES NOT HAPPEN for the affected file descriptor. ** I believe this is the root of the problem?
  7. File descriptor ends up in "can't identify protocol".
    node 9116 web 1365u sock 0,5 1203852860 can't identify protocol
  8. Kernel continutes to try and read() from that FD, causing an infinite loop and 100% CPU.

I'm at my wits end here. I'm not a node beginner and I can't figure this out. I've read many threads that are similar, but have either been fixed or don't apply to my case exactly:

https://idea.popcount.org/2012-12-09-lsof-cant-identify-protocol/
#99
https://groups.google.com/forum/#!topic/sockjs/noR9wD7-esg
nodejs/node-v0.x-archive#3613

It should be noted that when I don't send the initial data chunk to the client, it works all the time from what I can tell. Even when I send the data, it only occasionally fails, so I think it must have to do with the timing and however the client is disconnecting, yet I can't see the client side of the problem.

Details:
node.js v0.10.5
sockjs-0.3.9
Linux, x86_64 x86_64 x86_64 GNU/Linux

Here is the last part of the strace. You can see the disconnect followed by infinite read() syscalls to FD 321. Let me know if you need the full strace. THANK YOU!

19790 1404328492.878870 read(1365, "\210\200\0052_Z", 65536) = 6 <0.000017>
19790 1404328492.878939 futex(0x830b098, FUTEX_WAKE, 1) = 1 <0.000076>
19791 1404328492.879038 <... futex resumed> ) = 0 <0.013393>
19791 1404328492.879080 nanosleep({0, 900000},  <unfinished ...>
19790 1404328492.879280 write(1, "7/2/2014 15:14:52  [SockJSManager 19790] -info- Disconnected (137c1d12-98c9-432a-9ac5-f9b6ee70dd10)\n", 100) = 100 <0.000016>
19790 1404328492.879983 write(1, "7/2/2014 15:14:52  [Handler 19790] !!!! sockClose Garbage collection done.  Socket closed.  137c1d12-98c9-432a-9ac5-f9b6ee70dd10  FD: undefined\n", 144) = 144 <0.000015>
...
19790 1404328493.998149 read(1365, "", 65536) = 0 <0.000022>
...
19790 1404328495.397777 clock_gettime(CLOCK_MONOTONIC, {11459665, 858517651}) = 0 <0.000017>
19790 1404328495.397841 read(1365, "", 65536) = 0 <0.000015>
19790 1404328495.397908 futex(0x830b098, FUTEX_WAKE, 1 <unfinished ...>
19791 1404328495.397935 <... futex resumed> ) = 0 <0.002306>
19790 1404328495.397950 <... futex resumed> ) = 1 <0.000031>
19791 1404328495.397967 nanosleep({0, 900000},  <unfinished ...>
19790 1404328495.398026 clock_gettime(CLOCK_MONOTONIC, {11459665, 858760651}) = 0 <0.000010>
19790 1404328495.398072 epoll_ctl(6, EPOLL_CTL_MOD, 1365, {EPOLLOUT, {u32=1365, u64=1365}}) = 0 <0.000020>
19790 1404328495.398136 epoll_wait(6, {{EPOLLERR|EPOLLHUP, {u32=1365, u64=1365}}}, 1024, 13) = 1 <0.000024>
19790 1404328495.398191 clock_gettime(CLOCK_MONOTONIC, {11459665, 858922651}) = 0 <0.000010>
19790 1404328495.398234 read(1365, "", 65536) = 0 <0.000011>
19790 1404328495.398293 clock_gettime(CLOCK_MONOTONIC, {11459665, 859025651}) = 0 <0.000010>
19790 1404328495.398337 epoll_ctl(6, EPOLL_CTL_MOD, 1365, {EPOLLOUT, {u32=1365, u64=1365}}) = 0 <0.000012>
19790 1404328495.398391 epoll_wait(6, {{EPOLLERR|EPOLLHUP, {u32=1365, u64=1365}}}, 1024, 12) = 1 <0.000054>
19790 1404328495.398596 clock_gettime(CLOCK_MONOTONIC, {11459665, 859328651}) = 0 <0.000020>
19790 1404328495.398649 read(1365, "", 65536) = 0 <0.000012>

[ INFINITE LOOP of clock, epoll, read() ]

What this should look like. Notice the close(447) and shutdown(447) calls.

19790 1404327434.794877 read(447, "\210\200u\312\363\201", 65536) = 6 <0.000018>
19790 1404327434.794940 futex(0x830b098, FUTEX_WAKE, 1 <unfinished ...>
19791 1404327434.794965 <... futex resumed> ) = 0 <0.057214>
19790 1404327434.794978 <... futex resumed> ) = 1 <0.000029>
19791 1404327434.794993 nanosleep({0, 900000},  <unfinished ...>
19790 1404327434.795091 write(447, "\210\0", 2) = 2 <0.000021>
19790 1404327434.795307 write(1, "7/2/2014 14:57:14  [SockJSManager 19790] -info- Disconnected (8ff2cf25-d063-4813-a0dc-ed5205ada37f)\n", 100) = 100 <0.000020>
19790 1404327434.795968 write(1, "7/2/2014 14:57:14  [Handler 19790] !!!! sockClose Garbage collection done.  Socket closed.  8ff2cf25-d063-4813-a0dc-ed5205ada37f  FD: undefined\n", 144) = 144 <0.000020>
19790 1404327434.796063 clock_gettime(CLOCK_MONOTONIC, {11458605, 256795651}) = 0 <0.000013>
19790 1404327434.796119 shutdown(447, 1 /* send */) = 0 <0.000018>
19790 1404327434.796189 epoll_ctl(6, EPOLL_CTL_MOD, 447, {EPOLLIN, {u32=447, u64=447}}) = 0 <0.000020>
19790 1404327434.796273 epoll_wait(6, {{EPOLLIN|EPOLLHUP, {u32=447, u64=447}}}, 1024, 204) = 1 <0.000019>
19790 1404327434.796325 clock_gettime(CLOCK_MONOTONIC, {11458605, 257057651}) = 0 <0.000013>
19790 1404327434.796374 read(447, "", 65536) = 0 <0.000014>
19790 1404327434.796470 close(447)      = 0 <0.000023>
@darinspivey darinspivey changed the title Webscoket FD not closed() properly. Causes infinite loop and 100% CPU. Websocket FD not closed() properly. Causes infinite loop and 100% CPU. Jul 3, 2014
@darinspivey
Copy link
Author

FIXED: It is with a great deal of humility that I post my own solution. I've made a rookie mistake of not upgrading node sooner. It wasn't until I virtually removed all my code and the problem persisted that I though perhaps it would actually be the OS or node.js itself. I was naive to think that our version v0.10.5 wasn't 'that far' out of date. Wrong. By over a year. An upgrade to v0.10.29 fixed the issue.

At least I learned a ton about strace and my code :) Keep up the good work here - the sockjs package is the best that I've worked with in terms of fallbacks and ease of use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant