systemtap-toolkit
This is my systemtap toolkit to online analyze the complicated problem on production with heavy load. All tools are based on my favorite linux tracing/probing tool systemtap.
Any guys which want to know what the hell it is in the user space and kernel space should be to learn systemtap which is awesome tool:)
- NAME
- Description
- Requirements
- Contribute
- Thanks
- tcp-passive-syn-ack-time
- tcp-active-syn-ack-time
- tcp-retrans
- who-open-file
- who-ctxswitch-process
- syscall-connect
- sample-bt
- watch-var
- tcp-trace-packet
- ngx-req-watch
- stracelike
- redis-watch-req
- libcurl-watch-req
- pdomysql-watch-query
- io-process-top
- net-process-top
- phpredis-watch-req
- nssdns-watch-question
- phpfpm-watch-req
- swoole-redis-watch
We need systemtap and dwarf. some scripts are working on kernel space and other is working on the user space.
For kernel space, we need kernel debuginfo like kernel-debuginfo-3.10.0-327.28.3.el7.x86_64
.
For user space, we need user application debuginfo like redis-debuginfo-2.8.19-2.el7.x86_64
.
For redhat* linux version, we can install as the following:
yum install yum-utils #for debuginfo-install
yum install systemtap
yum install kernelname-devel-version
debuginfo-install kernelname-version
You can choose the ways as the following to help this project.
- To contribute to this project, clone this repo locally and commit your code on a separate branch.
- Create Github issues.
- You can reach me [email protected].
Special thanks to @brendangregg、 @agentzh and @fche. All I have learn for systemtap is from their amazing blog posts and projects:)
It's used to measure the time of syn packet to ack packet on the server side in the tcp-3-shakehands(tcpguide).
[root@localhost tmp]# ./tcp-passive-syn-ack-timee -p 80 -t 5000
Collecting tcp dport (80)...syn-ack time
interval min:197us, max:858us avg:519us, cnt:3
value |-------------------------------------------------- count
32 | 0
64 | 0
128 |@ 1
256 |@ 1
512 |@ 1
1024 | 0
2048 | 0
It's used to measure the time of syn packet to ack packet on the client side in the tcp-3-shakehands(tcpguide).
[root@localhost systemtap-toolkit]# ./tcp-active-syn-ack-time -p 80 -t 5000
Collecting tcp dport (80)...syn-ack time
dport:80 min:417us, max:542us avg:460us, cnt:3
value |-------------------------------------------------- count
64 | 0
128 | 0
256 |@@ 2
512 |@ 1
1024 | 0
2048 | 0
It's used to collecting which tcp packet being retransmit
[root@localhost systemtap-toolkit]# ./tcp-retrans
Printing tcp retransmission
10.0.2.15:49896 -> 172.17.9.41:80 state:TCP_SYN_SENT rto:0 -> 1000 ms
10.0.2.15:49896 -> 172.17.9.41:80 state:TCP_SYN_SENT rto:1000 -> 2000 ms
10.0.2.15:49896 -> 172.17.9.41:80 state:TCP_SYN_SENT rto:2000 -> 4000 ms
It's used to find who is opening the specified file
[root@localhost systemtap-toolkit]# ./who-open-file -f 123 -t 10000
Collecting who is opening filename 123
cat(13740) is opening the filename: "123"
cat(13741) is opening the filename: "123"
Tracing context switch for specified process.
[root@localhost systemtap-toolkit]# ./who-ctxswitch-process -p 6354
Collecting who is context switch 6354
[0] swapper/0 ( 0)<R> => nginx ( 6354)<R>
[0] nginx ( 6354)<S> => nginx ( 6355)<R>
[0] nginx ( 6355)<D> => nginx ( 6354)<R>
[0] nginx ( 6354)<S> => rcu_sched ( 10)<R>
[0] nginx ( 6355)<D> => nginx ( 6354)<R>
It's used to tracing syscall.connect
telnet(8062) is connecting to [email protected]:1800
telnet(8063) is connecting to [email protected]:1800
telnet(8064) is connecting to [email protected]:1800
telnet(8065) is connecting to [email protected]:1800
telnet(8066) is connecting to [email protected]:1800
telnet(8067) is connecting to [email protected]:1800
telnet(8068) is connecting to [email protected]:1800
telnet(8069) is connecting to [email protected]:1800
telnet(8070) is connecting to [email protected]:1800
It's from agentzh and be used to sampling the backtrace in the user space and kernel space.
$ ./sample-bt -p 8736 -t 5 -u > a.bt
WARNING: Tracing 8736 (/opt/nginx/sbin/nginx) in user-space only...
WARNING: Missing unwind data for module, rerun with 'stap -d stap_df60590ce8827444bfebaf5ea938b5a_11577'
WARNING: Time's up. Quitting now...(it may take a while)
WARNING: Number of errors: 0, skipped probes: 24
It's used to monitor function param changing.
[root@localhost systemtap-toolkit]# ./watch-var -f syscall.open -v filename -p 25849
WARNING: Tracing vars syscall.open filename in 25849...
a.out[25849] kernel.function("SyS_open@fs/open.c:1036").call filename: "" => ""./test""
Like tcpdump, it's used to tracing tcp packet with more detail include tcp flag.
[root@localhost systemtap-toolkit]# ./tcp-trace-packet
WARNING: tracking 0 tcp packet
1478067249998698 10.0.2.15:22 => 10.0.2.2:50627 len:92 SYN:0 ACK:1 FIN:0 RST:0 PSH:1 URG:0
1478067249998955 10.0.2.2:50627 <= 10.0.2.15:22 len:40 SYN:0 ACK:1 FIN:0 RST:0 PSH:0 URG:0
1478067250199252 10.0.2.15:22 => 10.0.2.2:50627 len:172 SYN:0 ACK:1 FIN:0 RST:0 PSH:1 URG:0
1478067250199559 10.0.2.2:50627 <= 10.0.2.15:22 len:40 SYN:0 ACK:1 FIN:0 RST:0 PSH:0 URG:0
1478067250399756 10.0.2.15:22 => 10.0.2.2:50627 len:100 SYN:0 ACK:1 FIN:0 RST:0 PSH:1 URG:0
1478067250399963 10.0.2.2:50627 <= 10.0.2.15:22 len:40 SYN:0 ACK:1 FIN:0 RST:0 PSH:0 URG:0
It tracing the userland, which can watch and filter by specified condition nginx request in real time
[root@localhost systemtap-toolkit]# ./ngx-req-watch -p 5614
WARNING: watching /opt/tengine/sbin/nginx(8521 8522 8523 8524) requests
nginx(8523) GET URI:/123?a=123 HOST:127.0.0.1 STATUS:200 FROM 127.0.0.1 FD:16 RT: 0ms
nginx(8523) GET URI:/123?a=123 HOST:127.0.0.1 STATUS:200 FROM 127.0.0.1 FD:16 RT: 0ms
nginx(8523) GET URI:/123?a=123&b=123 HOST:127.0.0.1 STATUS:200 FROM 127.0.0.1 FD:16 RT: 0ms
nginx(8523) GET URI:/123?w HOST:127.0.0.1 STATUS:200 FROM 127.0.0.1 FD:16 RT: 0ms
nginx(8523) GET URI:/123?w HOST:test STATUS:200 FROM 127.0.0.1 FD:16 RT: 0ms
nginx(8523) GET URI:/123?w=a HOST:test STATUS:200 FROM 127.0.0.1 FD:16 RT: 0ms
Like strace. But it's based on the systemtap
[root@localhost systemtap-toolkit]# ./stracelike -p 4580 -t 20000
WARNING: stracing syscall
Sat Oct 29 12:46:19 2016.094410 epoll_wait(16, 0x1e17b40, 512, 100) = 0 <0.100334>
Sat Oct 29 12:46:19 2016.194756 epoll_wait(16, 0x1e17b40, 512, 100) = 0 <0.100227>
Sat Oct 29 12:46:19 2016.295006 epoll_wait(16, 0x1e17b40, 512, 100) = 0 <0.101086>
It tracing the userland, which can watch and filter by specified condition redis request in real time
[root@localhost systemtap-toolkit]# ./redis-watch-req -p 23261
WARNING: watching /usr/bin/redis-server(23261) requests
redis-server(23261) RT:30(us) REQ: id:2 fd:5 ==> get a #-1 RES: #9
redis-server(23261) RT:23(us) REQ: id:2 fd:5 ==> set a #12 RES: #5
redis-server(23261) RT:16(us) REQ: id:2 fd:5 ==> get foo #-1 RES: #5
It traceing the userland, which can watch and filter by specified condition request for softawre which are based on the libcurl like curl
and php
.
[root@localhost systemtap-toolkit]# ./libcurl-watch-req
WARNING: Tracing libcurl (0) ...
curl(23759) URL:http://www.google.com RT:448(ms) RTCODE:0
curl(23767) URL:http://www.facebook.com/asdfasdf RT:596(ms) RTCODE:0
curl(23769) URL:https://www.facebook.com/asdfasdf RT:902(ms) RTCODE:0
It traceing the userland, which can watch and filter by specified condition request for php's pdo mysql driver.
[root@localhost systemtap-toolkit]# ./pdomysql-watch-query -l /usr/lib64/php/modules/pdo_mysql.so
Tracing pdo-mysql (0)
php-fpm(12896) 172.17.10.196:3306@root: SELECT * from person RT:0(ms) RTCODE:1
php-fpm(12896) 172.17.10.196:3306@root: SELECT * from person RT:8(ms) RTCODE:1
php-fpm(12896)172.17.10.196:3306@root: SELECT sleep(5) RT:5012(ms) RTCODE:1
It traceing the userland, which can trace the php redis request
[root@localhost systemtap-toolkit]# ./phpredis-watch-req -l /usr/lib64/php/modules/redis.s
Tracing phpredis (/usr/lib64/php/modules/redis.so)
php(17226)<zim_Redis___construct[22us]>
php(17226)<zim_Redis_connect[113us]>
php(17226)<zim_Redis_get[157us]>:*2 $3 GET $3 key
php(17226)<zim_Redis_hGet[563us]>:*3 $4 HGET $3 key $6 ffffff
php(17226)<zim_Redis_set[617us]>:*3 $3 SET $3 key $4 abcd
php(17226)<zim_Redis___destruct[12us]>
It traceing io Read|Write with the view of process(pid).
[root@localhost systemtap-toolkit]# ./io-process-top -t 1000
WARNING: Collecting IO Process Top 10 with interval of 1000ms
Process Name Read(KB) Write(KB)
redis-server(4510) 3 0
stapio(28280) 2 0
systemd-journal(442) 0 0
systemd(1) 0 0
sshd(19948) 0 0
in:imjournal(595) 0 0
It traceing net Send|Recv with the view of process(pid).
[root@localhost systemtap-toolkit]# ./net-process-top -t 5000
WARNING: Collecting Net Process Top 10 with interval of 5000ms
Process( 0) dev Send(PK) Recv(PK) Send(KB) Recv(KB)
nginx( 7266) lo 446203 0 144471 0
wrk(27496) lo 156601 0 15599 0
rcu_sched( 10) eth0 0 1 0 0
sshd( 6890) eth0 1 0 0 0
It tracing the libnss_dns for dns query.
[root@localhost systemtap-toolkit]# ./nssdns-watch-question -l /usr/lib64/libnss_dns.so. -t 100000
WARNING: Tracing libnss_dns(/usr/lib64/libnss_dns.so.2) for pid:0
curl(11786): www.google.com 57994us
curl(11788): www.facebook.com 57406us
curl(11790): www.github.com 4203477us
It tracing phpfpm request
# ./phpfpm-watch-req -l /opt/php/sbin/php-fpm
WARNING: Tracing php-fpm for pid(0)
php-fpm(9665) GET /index.php?&123123=123&f=q (208us)
php-fpm(9665) GET /index.php?&123123=123&f=q (172us)
php-fpm(9665) GET /index.php?&123123=123&f=q (154us)
php-fpm(9665) GET /index.php?&123123=123&f=q (151us)
It tracing swoole-redis write and read subroutine
./swoole-redis-watch -l /opt/php/lib/php/extensions/no-debug-non-zts-20131226/swoole.so -t 100000
WARNING: Tracing swoole.so(/opt/php/lib/php/extensions/no-debug-non-zts-20131226/swoole.so) for pid:0
php(25927) is writing for 10.200.175.90:6379 to *3
$4
hGet
$4
step
$10
attachment
len:41
10.200.175.90:6379 get reply: integer:710263830
php(29486) is writing for 10.200.175.90:6379 to *3
$4
hGet
$4
step
$10
attachment
len:41
10.200.175.90:6379 get reply: integer:709720993