a linux trace/probe tool.
官網:[https://sourceware.org/systemtap/](https://sourceware.org/systemtap/)
?
### 探測點
?
SystemTap腳本主要是由探測點和探測點處理函數組成的,來看下都有哪些探測點可用。
The essential idea behind a systemtap script is to name events, and to give them handlers.
Systemtap works by translating the script to C, running the system C compiler to create a kernel module from that.
When the module is loaded, it activates all the probed events by hooking into the kernel.
?
**(1) where to probe**
Built-in events (probe point syntax and semantics)
begin:The startup of the systemtap session.
end:The end of the systemtap session.
kernel.function("sys_open"):The entry to the function named sys_open in the kernel.
syscall.close.return:The return from the close system call.
module("ext3").statement(0xdeadbeef):The addressed instruction in the ext3 filesystem driver.
timer.ms(200):A timer that fires every 200 milliseconds.
timer.jiffies(200):A timer that fires every 200 jiffies.
timer.profile:A timer that fires periodically on every CPU.
perf.hw.cache_misses:A particular number of CPU cache misses have occurred.
procfs("status").read:A process trying to read a synthetic file.
process("a.out").statement("*@main.c:200"):Line 200 of the a.out program.
更多信息,可見stapprobes mannual page:
[https://sourceware.org/systemtap/man/stapprobes.3stap.html](https://sourceware.org/systemtap/man/stapprobes.3stap.html)
[http://linux.die.net/man/5/stapprobes](http://linux.die.net/man/5/stapprobes)
?
**(2) what to print**
Systemtap provides a variety of such contextual data, ready for formatting.
The usually appear as function calls within the handler.
tid():The id of the current thread.
pid():The process (task group) id of the current thread.
uid():The id of the current user.
execname():The name of the current process.
cpu():The current cpu number.
gettimeofday_s():Number of seconds since epoch.
get_cycles():Snapshot of hardware cycle counter.
pp():A string describing the probe point being currently handled.
probefunc():If known, the name of the function in which this probe was placed.
$$vars:If available, a pretty-printed listing of all local variables in scope.
print_backtrace():If possible, print a kernel backtrace.
print_ubacktrace():If possible, print a user-space backtrace.
$$parms:表示函數參數
$$return:表示函數返回值
thread_indent():tapset libary中一個很有用的函數,它的輸出格式:
A timestamp (number of microseconds since the initial indentation for the thread)
A process name and the thread id itself.
更多信息,可見stapfuncs mannual page:
[https://sourceware.org/systemtap/man/stapfuncs.3stap.html](https://sourceware.org/systemtap/man/stapfuncs.3stap.html)
[http://linux.die.net/man/5/stapfuncs](http://linux.die.net/man/5/stapfuncs)
?
**(3) Built-in probe point types (DWARF probes)**
內置的探測點,安裝debuginfo后可使用。
This family of probe points uses symbolic debugging information for the target kernel or module,
as may be found in executables that have not been stripped, or in the separate debuginfo packages.
?
目前支持的內置探測點類型:
kernel.function(PATTERN) // 在函數的入口處放置探測點,可以獲取函數參數$PARM
kernel.function(PATTERN).return // 在函數的返回處放置探測點,可以獲取函數的返回值$return,以及可能被修改的函數參數$PARM
kernel.function(PATTERN).call // 取補集,取不符合條件的函數
kernel.function(PATTERN).inline // 只選擇符合條件的內聯函數,內聯函數不能使用.return
kernel.function(PATTERN).exported // 只選擇導出的函數
?
module(MPATTERN).function(PATTERN)
module(MPATTERN).function(PATTERN).return
module(MPATTERN).function(PATTERN).call
module(MPATTERN).function(PATTERN).inline
?
kernel.statement(PATTERN)
kernel.statement(ADDRESS).absolute
module(MPATTERN).statement(PATTERN)
?
示例:
# Refers to all kernel functions with "init" or "exit" in the name
kernel.function("*init*"), kernel.function("*exit*")
# Refers to any functions within the "kernel/time.c" file that span line 240
kernel.function("*@kernel/time.c:240")
# Refers to all functions in the ext3 module
module("ext3").function("*")
# Refers to the statement at line 296 within the kernel/time.c file
kernel.statement("*@kernel/time.c:296")
# Refers to the statement at line bio_init+3 within the fs/bio.c file
kernel.statement("bio_init@fs/bio.c+3")
?
部分在編譯單元內可見的源碼變量,比如函數參數、局部變量或全局變量,在探測點處理函數中同樣是可見的。
在腳本中使用$加上變量的名字就可以飲用了。
?
變量的引用有兩種風格:
$varname // 引用變量varname
$var->field // 引用結構的成員變量
$var[N] // 引用數組的元素
&$var // 變量的地址
?
@var("varname") // 引用變量varname
@var("var@src/file.c") // 引用src/file.c在被編譯時的全局變量varname
@var("varname@file.c")->field // 引用結構的成員變量
@var("var@file.c")[N] // 引用數組的元素
&@var("var@file.c") // 變量的地址
?
$var$ // provide a string that includes the values of basic type values
$var$$ // provide a string that includes all values of nested data types
?
$$vars // 一個包含所有函數參數、局部變量的字符串
$$locals // 一個包含所有局部變量的字符串
$$params // 一個包含所有函數參數的字符串
?
**(4) DWARF-less probing**
當沒有安裝debuginfo時,不能使用內置的探測點。
In the absence of debugging information, you can still use the kprobe family of probes to examine the
entry and exit points of kernel and module functions. You cannot lookup the arguments or local variables
of a function using these probes.
?
當目標內核或模塊缺少調試信息時,雖然不能使用內置的探測點,但仍然可以使用kprobe來探測函數的入口點
和退出點。此時不能使用“$+變量名”來獲取函數參數或局部變量的值。
SystemTap仍然提供了一種訪問參數的方法:
當函數因被探測而停滯在它的進入點時,可以使用編號來引用它的參數。
例如,假設被探測的函數聲明如下:
ssize_t sys_read(unsigned int fd, char __user *buf, size_t count)
可以分別使用unit_arg(1)、pointer_arg(2)、ulong_arg(3)來獲取fd、buf和count的值。
?
此種探測點雖然不支持$return,但可以通過調用returnval()來獲取寄存器的值,函數的返回值通常是保存在
這一寄存器里的,也可以調用returnstr()來獲取返回值的字符串形式。
在處理函數代碼里面,可以調用register("regname")來獲取它被調用時特定CPU寄存器的值。
?
使用格式(不能用通配符):
kprobe.function(FUNCTION)
kprobe.function(FUNCTION).return
kprobe.module(NAME).function(FUNCTION)
kprobe.module(NAME).function(FUNCTION).return
kprobe.statement(ADDRESS).absolute
?
### 語法
?
**(1) 基本格式**
probe probe-point probe- handler,即probe Probe-Point { statement }
用probe指定一個探測點(probe-point),以及在這個探測點處執行的處理函數(probe-handler)。
?
每條語句不用結束符,分號“;”表示空語句。函數用{}括起來。
允許多種注釋語句:
Shell-stype:#
C-style:/* */
C++-style://
?
next語句用于提前退出Probe-handler。
String連接符是“.”,比較符為“==”。
例如:"hello" . "world" ,連接成"helloword"
?
變量屬于弱數據類型,不用事先聲明,不用指定數據類型。
字符串類型和數字類型的轉換:
s = sprint(123)?# s becomes the string "123"
?
probe-handler中定義的變量是局部的,不能在其它探測點處理函數中使用。
global符號用于定義全局變量。
Because of possible concurrency (multiple probe handlers running on different CPUs, each global variable
used by a probe is automatically read-locked or write-locked while the handler is running.
?
next語句:執行到next語句時,會馬上從探測點處理函數中返回。
?
**(2) 函數**
function name(param1, param2)
{
??? statements
??? return ret
}
Recursion is possible, up to a nesting depth limit.
?
**(3) 條件語句**
if (EXPR) STATEMENT [else STATEMENT]
?
**(4) 循環語句**
while (EXPR) STATEMENT
for (A; B; C) STATEMENT
break可以提前退出循環,continue可以跳過本次循環。
?
**(5) 上下文變量**
Allow access to the probe point context. To know which variables are likely to be available, you will need to
be familiar with the kernel source you are probing.
You can use stap -L PROBEPOINT to enumerate the variables available there.
使用stap -L probe-point,來查看執行到這個探測點時,哪些上下文變量是可用的。
Two functions, user_string and kernel_string, can copy char *target variables into systemtap strings.
實例:

?
**(6) 關聯數組**
These arrays are implemented as hash tables with a maximum size that is fixed at startup.
Because they are too large to be created dynamically for individual probes handler runs, they must be
declared as global.
?
關聯數組是用哈希表實現的,最大大小在一開始就設定了。
關聯數組必須是全局的,不能在探測點處理函數內部定義。
數組的索引最多可以有9個,用逗號隔開,可以是數字或字符串。
?
例如:global array[400]
?
6.1 數組
可以用多個索引來定位數組元素。
元素的數據類型有三種:數值、字符串、統計類型。
如果不指定數組的大小,那么默認設為最大值MAXMAPENTRIES(2048)。
?
例如:
foo[4, "hello"]++
processusage[uid(), execname()]++
?
6.2 元素是否存在
例如:if ([4, "hello"] in foo) { }
?
6.3 元素刪除
例如:delete
delete times[tid()] # deletion of a single element
delete times # deletion of all elements
?
6.4 刪除變量
例如:delete var
如果var是一個數值型變量,那么它被重置為0;如果var是一個字符串型變量,那么它被重置為"",
如果var是一個統計類型變量,那么它所在的集合被清空。
?
6.4 遍歷
使用foreach關鍵字,允許使用break/continue,在遍歷期間不允許修改數組。
foreach (x = [a, b] in foo) { fuss_with(x) } # simple loop in arbitrary sequence
foreach ([a, b] in foo+ limit 5) {} # loop in increasing sequence of value, stop after 5
foreach ([a-, b] in foo) {} # loop in decreasing sequence of first key
# Print the first 10 tuples and values in the array in decreasing sequence
foreach(v = [i, j] in foo- limit 10)
??? printf("foo [%d, %s] = %d\n", i, j, v)
?
三中遍歷形式:
foreach (VAR in ARRAY) STMT // 按值遍歷,VAR為元素值
foreach ([VAR1, VAR2, ...] in ARRAY) STMT // 按索引遍歷
foreach (VAR = [VAR1, VAR2, ...] in ARRAY) STMT // 同時得到元素值和元素索引
?
6.5 覆蓋
%表示當數組容量不夠時,允許新的元素覆蓋掉舊的元素。
global ARRAY%[<size>], ARRAY2%
?
**(7) 統計類型**
statistics aggregates是SystemTap特有的數據類型,用于統計全局變量。
操作符為“<<<”
例如:g_value <<< b # 相當于C語言的g_value += b
?
這種變量只能用特定函數操作,主要包括:
@count(g_value):所有統計操作的操作次數
@sum(g_value):所有統計操作的操作數的總和
@min(g_value):所有統計操作的操作數的最小值
@max(g_value):所有統計操作的操作數的最大值
@avg(g_value):所有統計操作的操作數的平均值
?
**(8) 語言安全性**
8.1 時間限制
探測點處理函數是有執行時間限制的,不能占用太多時間,否則SystemTap在把腳本編譯為C語言時會報錯。
每個探測點處理函數只能執行1000條語句,這個數量是可配置的。
?
8.2 動態內存分配
探測點處理函數中不允許動態內存分配。
No dynamic memory allocation whatsoever takes place during the execution of probe handlers.
Arrays, function contexts, and buffers are allocated during initialization.
?
8.3 鎖
多個探測點處理函數搶占一個全局變量鎖時,某幾個探測點處理函數可能會超時,被放棄執行。
訪問全局變量時會加鎖,防止它被并發的修改。
If multiple probes seek conflicting locks on the same global variables, one or more of them will time out and be
aborted. Such events are tailed as skipped probes, and a count is displayed at session end.
?
8.4 bug
內核中少數對時間非常敏感的地方(上下文切換、中斷處理),是不能設為探測點的。
Putting probes indiscriminately into unusually sensitive parts of the kernel (low level context switching, interrupt
dispatching) has reportedly caused crashes in the past. We are fixing these bugs as they are found, and
constructing a probe "blacklist", but it is not complete.
?
8.5 修改限制
通過-D選項可以修改默認的一些限制。
-D NM=VAL emit macro definition into generated C code.
?
**MAXNESTING** - The maximum number of recursive function call levels. The default is 10.
**MAXSTRINGLEN** - The maximum length of strings. The default is 256 bytes for 32 bit machines and
??? 512 bytes for all other machines.
**MAXTRYLOCK** - The maximum number of iterations to wait for locks on global variables before declaring
??? possible deadlock and skipping the probe. The default is 1000.
**MAXACTION** - The maximum number of statements to execute during any single probe hit. The default is 1000.
**MAXMAPENTRIES** - The maximum number of rows in an array if the array size is not specified explicitly when
??? declared. The default is 2048.
**MAXERRORS** - The maximum number of soft errors before an exit is triggered. The default is 0.
**MAXSKIPPED** - The maximum number of skipped reentrant probes before an exit is triggered. The default is 100.
**MINSTACKSPACE** - The minimum number of free kernel stack bytes required in order to run a probe handler.
??? This number should be large enough for the probe handler's own needs, plus a safety margin. The default is 1024.
?
**(9) 命令行參數**
可以從命令行傳遞兩種類型的參數:“字符串”和數值。
9.1 數值
$1 ... $<N> 用于在腳本中引用傳入的數值參數。
9.2 字符串
@1 ... @<N> 用于在腳本中引用傳入的字符串參數。
?
**(10) 條件編譯**
%( CONDITION %? TRUE-TOKENS %)
%( CONDITION %? TRUE-TOKENS %: FALSE-TOKENS %)
?
編譯條件可以是:
@defined($var) // 目標變量是否可用
kernel_v > "2.6.37" // 比較版本號
kernel_vr // 比較版本號(包括后綴)
arch == "x86_64" // CPU架構
?
kernel CONFIG option,編譯選項:
%( CONFIG_UTRACE == "y" %?
??????? do something
%)
?