SystemTap Tutorial, Part 1

Systemtap time

This is the first of a two-part series on SystemTap, a dynamic method to monitor and trace the operations of a running Linux kernel. SystemTap is useful to systems administrators, kernel developers, support engineers, researchers and students.

“Who is doing the maximum read/write operations on my server?”
“Can I add some debug statements in the kernel without rebuilding it and rebooting the system?”

These are questions you might have asked yourself, if you are a systems administrator or a kernel developer. Let’s see what options are available in answer to these questions:

  • Tracing: Provides information while running, and gives a quick overview of code flow, but gives a lot of information. Tools like strace, ltrace and ftrace are used for tracing.
  • Profiling: Does sampling while running, and we can do the analysis after the event. Oprofile is used for sampling.
  • Debugging: We can set breakpoints, look at variables, memory, registers, stack trace, etc. We can debug only one program at a time, and the debugger stops the program while we do the inspection. GDB/KDB is used for such debugging.

So which of these tools would you use? You’re probably thinking of using a combination of them; wouldn’t it be great to have the capabilities of all these tools combined into one? The response to just such a wish is, SystemTap!

Welcome to SystemTap!

SystemTap can monitor multiple system-wide synchronous and asynchronous events at the same time. It can do scriptable filtering and statistics collection. It’s a dynamic method of monitoring and tracing the operations of a running Linux kernel.

To instrument the running kernel, SystemTap uses Kprobes and return probes. With kernel debug information, it gets the addresses for functions and variables referenced in the script. With utrace, SystemTap supports probing user-space executables and shared libraries as well. SystemTap is, therefore, useful to systems administrators, kernel developers, support engineers, researchers and students.

Installation

To install SystemTap on Fedora, run the following commands as root:

yum install systemtap kernel-devel
debuginfo-install kernel

To use SystemTap on Ubuntu or any other distro, you need to install the systemtap package, and the debuginfo packages corresponding to the kernel you’re running.

You need to be the root user to run the SystemTap scripts — or you could add a normal user account to either the stapdev or stapusr groups, to allow that account to run the script.

How does it work?

To understand how SystemTap works, let’s run a script in verbose mode (with the -v switch). The stap program is the front-end to SystemTap. The -e switch instructs it to execute the script in the following argument:

$ stap -v -e 'probe syscall.read {printf("syscall %s arguments %s \n", name, argstr); exit()}'
Pass 1: parsed user script and 65 library script(s) using 83596virt/20428res/2412shr kb, in 150usr/10sys/249real ms.
Pass 2: analyzed script: 1 probe(s), 4 function(s), 0 embed(s), 0 global(s) using 216260virt/115660res/73964shr kb, in 560usr/20sys/946real ms.
Pass 3: translated to C into "/tmp/stapUGVeZi/stap_b40c8268c87acc683f75ded62a52ee66_2113.c" using 216260virt/117180res/75484shr kb, in 320usr/40sys/1014real ms.
Pass 4: compiled C into "stap_b40c8268c87acc683f75ded62a52ee66_2113.ko" in 3010usr/1210sys/12818real ms.
Pass 5: starting run.
syscall read arguments 4, 0x00007fffa773b4c0, 8196
Pass 5: run completed in 20usr/60sys/174real ms.

Let’s see what happened at each of the passes mentioned:

  • Passes 1 and 2: The script we want to run is parsed, and the code is checked for semantic and syntactic errors. Any tapset reference is imported. Debug data (provided via debuginfo packages) is read to find the addresses for functions and variables referenced in the script.
  • Pass 3: The script is translated into C code.
  • Pass 4: The translated C code is compiled to create a kernel module.
  • Pass 5: The compiled module is inserted into the running kernel.

Once the module is loaded, probes are inserted at proper locations. From now on, whenever a probe is hit, the handler for that probe is called.

The basic syntax we used in our one-line script was to write a probe for an event, and the handler to run when that event occurred:

probe <event> { handler }

In this syntax:

  • event is one of the kernel.function, process.statement, timer.ms, begin, end, or (tapset) aliases. For more information, look at the man page for stapprobes.
  • handlercan have:
    • filtering/conditionals (ifnext)
    • control structures (foreach, while)

In the script, you don’t need to declare the type of a variable; it is inferred from the context. To make our life easier, helper functions like pid, execname, log, etc, are predefined. Look at the language reference guide for more information. If you have installed the package, you can find it at /usr/share/doc/systemtap-<version>/langref.pdf.

How to run stap

The stap program can be invoked with multiple syntaxes:

stap -e '<script>' [-c <target program>]
stap script.stp [-c <target program>]
stap -l '<event*>'

Tapset libraries

In the example shown earlier, after probing on the read system call, we printed the name of the system call, and the arguments passed via name and argstr. This was possible because in one of the tapset libraries, /usr/share/systemtap/tapset/syscalls2.stp, the following is defined:

probe syscall.read = kernel.function("SyS_read").call !,
                   kernel.function("sys_read").call
{
        name = "read"
        fd = $fd
        buf_uaddr = $buf
        count = $count
        argstr = sprintf("%d, %p, %d", $fd, $buf, $count)
}

Tapsets provide abstraction to common probe points, and define functions that you can use in your script. They (probe aliases, not probes) are not runnable themselves.

Examples

$ cat syscount.stp
global syscalls
probe syscall.* { syscalls[name] += 1 }
probe timer.s(10) {
        foreach(n in syscalls- limit 5)
        printf("%s = %d\n", n, syscalls[n])
        delete syscalls
}

Here we have taken an associative array, syscalls. An associative array is a collection of unique keys — each key in the array has a value associated with it. Here, the name of each system call would be a unique key into the array. Whenever a system call is made, we increment the value of the element in the array that corresponds to the system call name. After 10 seconds, we print the top five system calls that were made.

$ stap syscount.stp
read = 116
poll = 55
ppoll = 49
setitimer = 24
writev = 22

Let’s look at another script from which we want to get the process name and PID of the process that calls the maximum system calls. We also want to exclude the SystemTap process that launches the script (stapio) from consideration.

$ cat syscount_per_process.stp
global syscalls
probe syscall.* {
        if (execname() == "stapio")
                next
        syscalls[execname(), pid()] += 1
}
probe timer.s(10) {
        foreach([procname, id] in syscalls- limit 5)
        printf("%s[%d] = %d\n", procname, id, syscalls[procname, id])
        delete syscalls
}

(To immediately return from a probe handler, we use the next statement.)

Running the script yields the following:

$ stap syscount_per_process.stp
hald-addon-stor[1074] = 30
sendmail[1157] = 14
rtkit-daemon[1387] = 8
gdm-simple-gree[1374] = 8
gnome-power-man[1370] = 7

We can do other interesting stuff like aggregation, getting a call graph, and even modifying a kernel variable in the running kernel. We will cover this in next month’s issue.

References

All published articles are released under Creative Commons Attribution-NonCommercial 3.0 Unported License, unless otherwise noted.
Open Source For You is powered by WordPress, which gladly sits on top of a CentOS-based LEMP stack.

Creative Commons License.