-
Notifications
You must be signed in to change notification settings - Fork 71
Web server and network interrupt debugging
songjiguo edited this page Nov 18, 2013
·
2 revisions
This is basically a log about most things I tried when debugging the performance issue of the web server in Composite. Due to the time-line, I will come back to this problem in February probably.
Motivation: Find that the web server throughput is normally around 5000 reqs/sec. We expect this to be much higher (Gabe used to get about 10000 reqs/sec with Pentium 4 and now we have i7 2.4GHz).
Configuration:
client (Linux) <---> switch <---> server (Composite)
Make sure the control flow is turned off (in Makefile)
Server settings:
1) ifconfig eth0 192.168.1.10 up
2) set Composite as the highest priority
//#define LINUX_ON_IDLE
#define LINUX_HIGHEST_PRIORITY 1
3) in cnet_user.c
#define IPADDR "192.168.1.22"
#define P2PPEER "10.0.2.8"
Client settings:
0) use "ethtool -s eth0 speed 100 duplex full" to set NIC speed
1) ifconfig eth0 192.168.1.20 up
2) route add -net 10.0.2.0 netmask 255.255.255.0 gw 192.168.1.10 eth0
3) use ab or httperf (need patched ab)
-- ab -n 20000 -c 20 10.2.0.8:200/test
-- httperf --server 10.2.0.8 --port 200 --uri /test --num-conn 20 --num-call 1000
Stage 1: observation
In Composite, when print scheduler information every second we found
that the utilization of the idle thread is high (between 60%
~70%). The other two main thread is tcp thread (tcp - conn - http -
ramfs) and network thread (tcp - ip - if), where network thread has
the highest priority among all (except timer thread). Also we print
HTTP connections/requests made every second and it is around the
4000~5000 reqs/sec.
There is something unusual with idle thread. Its utilization is so
high and the network thread utilization is low. The possible reasons are:
1) bug in ramfs/https, connection_mgr, or tcp/ip/if
2) bug in evt component
3) bug in interrupt handling (e.g. long latency)
4) bug in lwip network stack (e.g. backlog size too small)
5) bug in scheduler (e.g. incorrect decision when should switch to network thd)
6) bug in somewhere else
Stage 2: Set up Apache and Nginx and measure the performance as the baseline
1) Apache's throughput is about 13000~16000 req/sec
2) Nginx's throughput is about 22000~25000 req/sec
3) So we expect Composite can at least achieve the similar number as Apache
Stage 3: identification above potential reasons
1) bug in ramfs/https, connection_mgr, or tcp/ip/if
-- cache the first response on conn_mgr
-- use the cached response for all the following requests
-- assume that all response are same
-- throughput does not change
-- rule out the issue from ramfs/https
2) bug in evt component
-- prioritize the evt in the data structure
-- always add the net connection evt to the head in the list
-- throughput does not change
-- rule out the issue in evt
3) bug in interrupt handling (e.g. long latency)
a) number of interrupts
-- use unidirectional UDP to get the number of the interrupts
can be received in Composite, (about 500000 per sec)
-- with TCP, this number is lower as 30000~50000 per sec (TCP is
bidirectional and need response)
b) latency
the latency from the time a packet is put into ring buffer in
the kernel to the time the packet is retrieved in IF component
by network thread. The average is about 30000 cycles. The min is
about 5000 cycles. However, the max is really high (200000 to 400000)
c) does not know if the number really indicates something since the
max is so high and the stddev is also high. Maybe need get the
latency in Linux as well
4) bug in lwip network stack (e.g. backlog size too small)
-- confirm that backlog in liwp is u8_t, which indicates a max size 255
-- tried to change this to 8192 and it improves the throughput only
when NIC is at 100M, which results a consistent 10000~12000 reqs/sec
-- However, this does not seem really relevant (typically this is small)
5) bug in scheduler (e.g. incorrect decision when should switch to network thd)
-- There is a possibility that when the interrupt occurs
(e.g. whenever there are pending events ), the scheduler did not
choose network thread to run
a) measure the idle thread actual running time. Why is it
smaller than Composite report?
b) In idle function, check if there is any pending requests still.
Actually it happens 2 or 3 times that there are pending
requests even when the idle thread is running
6) bug in somewhere else
-- Need continue to debug in Feb