AWK 教程 · ZetCode 中文系列教程

# AWK 教程 > 原文： [https://zetcode.com/lang/awk/](https://zetcode.com/lang/awk/) 這是 AWK 教程。它涵蓋了 AWK 工具的基礎知識。 ## AWK AWK 是一種模式掃描和處理語言。 AWK 包含一組針對文本數據流要采取的措施。 AWK 廣泛使用正則表達式。它是大多數類 Unix 操作系統的標準函數。 AWK 于 1977 年在貝爾實驗室創立。它的名字取自其作者的姓氏-Alfred Aho，Peter Weinberger 和 Brian Kernighan。 ## AWK 程序 AWK 程序由一系列模式操作語句和可選的函數定義組成。它處理文本文件。 AWK 是一種面向行的語言。它將文件劃分為稱為記錄的行。每行被分解為字段的序列。這些字段由特殊變量訪問：`$1`讀取第一個字段，`$2`讀取第二個字段，依此類推。 `$0`變量引用整個記錄。 AWK 程序的結構具有以下形式： ```sh pattern { action } ``` 模式是對每個記錄執行的測試。如果滿足條件，則執行操作。模式或動作都可以省略，但不能兩者都省略。默認模式匹配每行，默認操作是打印記錄。 ```sh awk -f program-file [file-list] awk program [file-list] ``` AWK 程序可以通過兩種基本方式運行：a）從單獨的文件中讀取程序；程序的名稱緊隨`-f`選項，b）程序在命令行中用引號引起來。 ## AWK 單線 AWK 單線性是從命令行運行的簡單單發程序。讓我們有以下文本文件： ```sh $ cat mywords brown tree craftsmanship book beautiful existence ministerial computer town ``` 我們要打印`mywords`文件中包含的所有超過五個字符的單詞。 ```sh $ awk 'length($1) > 5 {print}' mywords craftsmanship beautiful existence ministerial computer ``` AWK 程序位于兩個單引號字符之間。首先是模式；我們指定記錄的長度大于五。 `length()`函數返回字符串的長度。 `$1`變量引用記錄的第一個字段；在我們的情況下，每條記錄只有一個字段。動作放置在大括號之間。 ```sh $ awk 'length($1) > 5' mywords craftsmanship beautiful existence ministerial computer ``` 正如我們前面所指定的，該動作可以省略。在這種情況下，將執行默認操作-打印整個記錄。正則表達式通常應用于 AWK 字段。 `~`是正則表達式匹配運算符。它檢查字符串是否與提供的正則表達式匹配。 ```sh $ awk '$1 ~ /^[b,c]/ {print $1}' mywords brown craftsmanship book beautiful computer ``` 在此程序中，我們打印所有以`b`或`c`字符開頭的單詞。正則表達式位于兩個斜杠字符之間。 AWK 提供重要的內置變量。例如，`NR`是一個內置變量，指向正在處理的當前行。 ```sh $ awk 'NR % 2 == 0 {print}' mywords tree book existence computer ``` 上面的程序每隔`mywords`文件打印一次記錄。模除`NR`變量，我們得到一條偶數行。假設我們要打印文件的行號。 ```sh $ awk '{print NR, $0}' mywords 1 brown 2 tree 3 craftsmanship 4 book 5 beautiful 6 existence 7 ministerial 8 computer 9 town ``` 同樣，我們使用`NR`變量。我們跳過該模式，因此，每一行都執行該操作。 `$0`變量引用整個記錄。對于以下示例，我們具有此 C 源文件。 ```sh $ cat source.c 1 #include <stdio.h> 2 3 int main(void) { 4 5 char *countries[5] = { "Germany", "Slovakia", "Poland", 6 "China", "Hungary" }; 7 8 size_t len = sizeof(countries) / sizeof(*countries); 9 10 for (size_t i=0; i < len; i++) { 11 12 printf("%s\n", countries[i]); 13 } 14 } ``` 碰巧我們復制了一些數據，包括行號。我們的任務是從文本中刪除數字。 ```sh $ awk '{print substr($0, 4)}' source.c #include <stdio.h> int main(void) { char *countries[5] = { "Germany", "Slovakia", "Poland", "China", "Hungary" }; size_t len = sizeof(countries) / sizeof(*countries); for (size_t i=0; i < len; i++) { printf("%s\n", countries[i]); } } ``` 我們使用`substr()`函數。它從給定的字符串打印一個子字符串。我們在每行上應用該函數，跳過前三個字符。換句話說，我們從第四個字符開始打印每個記錄直到結束。 ## 開始和結束模式 `BEGIN`和`END`是在讀取所有記錄之前和之后執行的特殊模式。這兩個關鍵字后跟大括號，我們在其中指定要執行的語句。我們有以下兩個文件： ```sh $ cat mywords; brown tree craftsmanship book beautiful existence ministerial computer town $ cat mywords2; pleasant curly storm hering immune ``` 我們想知道這兩行中的行數。 ```sh $ awk 'END {print NR}' mywords mywords2 14 ``` 我們將兩個文件傳遞給 AWK 程序。 AWK 按順序處理在命令行上收到的文件名。關鍵字`END`之后的塊在程序結尾處執行；我們打印`NR`變量，該變量保存最后處理的行的行號。 ```sh $ awk 'BEGIN {srand()} {lines[NR] = $0} END { r=int(rand()*NR + 1); print lines[r]}' mywords tree ``` 上面的程序從`mywords`文件中打印一條隨機行。 `srand()`函數為隨機數生成器提供種子。該函數僅需執行一次。在程序的主要部分，我們將當前記錄存儲到`lines`數組中。最后，我們計算 1 到`NR`之間的隨機數，并打印從數組結構中隨機選擇的行。 ## 匹配函數 `match()`是內置的字符串操作函數。它測試給定的字符串是否包含正則表達式模式。第一個參數是字符串，第二個參數是正則表達式模式。它類似于`~`運算符。 ```sh $ awk 'match($0, /^[c,b]/)' mywords brown craftsmanship book beautiful computer ``` 程序將打印以`c`或`b`開頭的行。正則表達式位于兩個斜杠字符之間。 `match()`函數設置`RSTART`變量；它是匹配模式開始的索引。 ```sh $ awk 'match($0, /i/) {print $0 " has i character at " RSTART}' mywords craftsmanship has i character at 12 beautiful has i character at 6 existence has i character at 3 ministerial has i character at 2 ``` 程序將打印出包含`i`字符的單詞。此外，它還會打印字符的首次出現。 ## AWK 內置變量 AWK 有幾個內置變量。它們在運行程序時由 AWK 設置。我們已經看到了`NR`，`$0`和`RSTART`變量。 ```sh $ awk 'BEGIN { print ARGC, ARGV[0], ARGV[1]}' mywords 2 awk mywords ``` 該程序將打印 AWK 程序的參數數量和前兩個參數。 `ARGC`是命令行參數的數量；在我們的案例中，有兩個論點，包括 AWK 本身。 `ARGV`是命令行參數數組。數組的索引從 0 到`ARGC`-1。 `FS`是輸入字段分隔符，默認為空格。 `NF`是當前輸入記錄中的字段數。對于以下程序，我們使用此文件： ```sh $ cat values 2, 53, 4, 16, 4, 23, 2, 7, 88 4, 5, 16, 42, 3, 7, 8, 39, 21 23, 43, 67, 12, 11, 33, 3, 6 ``` 我們有三行用逗號分隔的值。 `stats.awk` ```sh BEGIN { FS="," max = 0 min = 10**10 sum = 0 avg = 0 } { for (i=1; i<=NF; i++) { sum += $i if (max < $i) { max = $i } if (min > $i) { min = $i } printf("%d ", $i) } } END { avg = sum / NF printf("\n") printf("Min: %d, Max: %d, Sum: %d, Average: %d\n", min, max, sum, avg) } ``` 程序將從提供的值中統計基本統計信息。 ```sh FS="," ``` 文件中的值用逗號分隔；因此，我們將`FS`變量設置為逗號字符。 ```sh max = 0 min = 10**10 sum = 0 avg = 0 ``` 我們定義最大值，最小值，總和和平均值的默認值。 AWK 變量是動態的；它們的值可以是浮點數或字符串，或兩者兼有，這取決于它們的使用方式。 ```sh { for (i=1; i<=NF; i++) { sum += $i if (max < $i) { max = $i } if (min > $i) { min = $i } printf("%d ", $i) } } ``` 在腳本的主要部分，我們遍歷每一行并計算值的最大值，最小值和總和。 `NF`用于確定每行的值數量。 ```sh END { avg = sum / NF printf("\n") printf("Min: %d, Max: %d, Sum: %d, Average: %d\n", min, max, sum, avg) } ``` 在腳本的最后，我們計算平均值并將計算結果打印到控制臺。 ```sh $ awk -f stats.awk values 2 53 4 16 4 23 2 7 88 4 5 16 42 3 7 8 39 21 23 43 67 12 11 33 3 6 Min: 2, Max: 88, Sum: 542, Average: 67 ``` 這是`stats.awk`程序的輸出。可以使用`-F`標志將`FS`變量指定為命令行選項。 ```sh $ awk -F: '{print $1, $7}' /etc/passwd | head -7 root /bin/bash daemon /usr/sbin/nologin bin /usr/sbin/nologin sys /usr/sbin/nologin sync /bin/sync games /usr/sbin/nologin man /usr/sbin/nologin ``` 該示例從系統`/etc/passwd`文件中打印第一個（用戶名）和第七個字段（用戶的外殼程序）。 `head`命令僅用于打印前七行。 `/etc/passwd`文件中的數據用冒號分隔。因此，冒號被賦予`-F`選項。 `RS`是輸入記錄分隔符，默認情況下是換行符。 ```sh $ echo "Jane 17#Tom 23#Mark 34" | awk 'BEGIN {RS="#"} {print $1, "is", $2, "years old"}' Jane is 17 years old Tom is 23 years old Mark is 34 years old ``` 在示例中，我們用`#`字符分隔了相關數據。 `RS`用于剝離它們。 AWK 可以從`echo`之類的其他命令接收輸入。 ## 將變量傳遞給 AWK AWK 具有`-v`選項，用于為變量分配值。對于下一個程序，我們具有`text`文件： ```sh $ cat text The French nation, oppressed, degraded during many centuries by the most insolent despotism, has finally awakened to a consciousness of its rights and of the power to which its destinies summon it. ``` `mygrep.awk` ```sh { for (i=1; i<=NF; i++) { field = $i if (field ~ word) { c = index($0, field) print NR "," c, $0 next } } } ``` 該示例模擬`grep`工具。它找到提供的單詞并打印其行和起始索引。（程序僅找到單詞的第一個出現。）使用`-v`選項將`word`變量傳遞給程序。 ```sh $ awk -f mygrep.awk -v word=the text 2,4 by the most insolent despotism, has finally awakened to a 3,36 consciousness of its rights and of the power to which its ``` 我們在`text`文件中尋找了`"the"`字樣。 ## 管道 AWK 可以通過管道接收輸入并將輸出發送到其他命令。 ```sh $ echo -e "1 2 3 5\n2 2 3 8" | awk '{print $(NF)}' 5 8 ``` 在這種情況下，AWK 從`echo`命令接收輸出。打印最后一列的值。 ```sh $ awk -F: '$7 ~ /bash/ {print $1}' /etc/passwd | wc -l 3 ``` 在此，AWK 程序通過管道將數據發送到`wc`命令。在 AWK 程序中，我們找出使用 bash 的用戶。它們的名稱被傳遞給`wc`命令，該命令對其進行計數。在我們的例子中，有三個用戶使用 bash。 ## 拼寫檢查我們創建一個用于拼寫檢查的 AWK 程序。 `spellcheck.awk` ```sh BEGIN { count = 0 i = 0 while (getline myword <"/usr/share/dict/words") { dict[i] = myword i++ } } { for (i=1; i<=NF; i++) { field = $i if (match(field, /[[:punct:]]$/)) { field = substr(field, 0, RSTART-1) } mywords[count] = field count++ } } END { for (w_i in mywords) { for (w_j in dict) { if (mywords[w_i] == dict[w_j] || tolower(mywords[w_i]) == dict[w_j]) { delete mywords[w_i] } } } for (w_i in mywords) { if (mywords[w_i] != "") { print mywords[w_i] } } } ``` 該腳本將提供的文本文件的單詞與字典進行比較。在標準`/usr/share/dict/words`路徑下，我們可以找到英語詞典；每個單詞在單獨的行上。 ```sh BEGIN { count = 0 i = 0 while (getline myword <"/usr/share/dict/words") { dict[i] = myword i++ } } ``` 在`BEGIN`塊內部，我們將字典中的單詞讀入`dict`數組。 `getline`命令從給定的文件名中讀取一條記錄；記錄存儲在`$0`變量中。 ```sh { for (i=1; i<=NF; i++) { field = $i if (match(field, /[[:punct:]]$/)) { field = substr(field, 0, RSTART-1) } mywords[count] = field count++ } } ``` 在程序的主要部分，我們將要進行拼寫檢查的文件的單詞放入`mywords`數組。我們會刪除單詞結尾處的所有標點符號（例如逗號或點）。 ```sh END { for (w_i in mywords) { for (w_j in dict) { if (mywords[w_i] == dict[w_j] || tolower(mywords[w_i]) == dict[w_j]) { delete mywords[w_i] } } } ... } ``` 我們將`mywords`數組中的單詞與字典數組進行比較。如果單詞在詞典中，則使用`delete`命令將其刪除。以句子開頭的單詞以大寫字母開頭；因此，我們還利用`tolower()`函數檢查小寫字母的替換形式。 ```sh for (w_i in mywords) { if (mywords[w_i] != "") { print mywords[w_i] } } ``` 在詞典中找不到其余的單詞；它們被打印到控制臺。 ```sh $ awk -f spellcheck.awk text consciosness finaly ``` 我們已經在文本文件上運行了該程序；我們發現了兩個拼寫錯誤的單詞。請注意，該程序需要一些時間才能完成。 ## 剪刀石頭布剪刀石頭布是一種流行的手形游戲，其中每個玩家都用伸出的手同時形成三個形狀之一。我們在 AWK 中創建此游戲。 `rock_scissors_paper.awk` ```sh # This program creates a rock-paper-scissors game. BEGIN { srand() opts[1] = "rock" opts[2] = "paper" opts[3] = "scissors" do { print "1 - rock" print "2 - paper" print "3 - scissors" print "9 - end game" ret = getline < "-" if (ret == 0 || ret == -1) { exit } val = $0 if (val == 9) { exit } else if (val != 1 && val != 2 && val != 3) { print "Invalid option" continue } else { play_game(val) } } while (1) } function play_game(val) { r = int(rand()*3) + 1 print "I have " opts[r] " you have " opts[val] if (val == r) { print "Tie, next throw" return } if (val == 1 && r == 2) { print "Paper covers rock, you loose" } else if (val == 2 && r == 1) { print "Paper covers rock, you win" } else if (val == 2 && r == 3) { print "Scissors cut paper, you loose" } else if (val == 3 && r == 2) { print "Scissors cut paper, you win" } else if (val == 3 && r == 1) { print "Rock blunts scissors, you loose" } else if (val == 1 && r == 3) { print "Rock blunts scissors, you win" } } ``` 我們在電腦上玩游戲，電腦會隨機選擇選項。 ```sh srand() ``` 我們使用`srand()`函數為隨機數生成器播種。 ```sh opts[1] = "rock" opts[2] = "paper" opts[3] = "scissors" ``` 這三個選項存儲在`opts`數組中。 ```sh do { print "1 - rock" print "2 - paper" print "3 - scissors" print "9 - end game" ... ``` 游戲的周期由`do-while`循環控制。首先，將選項打印到終端。 ```sh ret = getline < "-" if (ret == 0 || ret == -1) { exit } val = $0 ``` 我們選擇的值是使用`getline`命令從命令行讀取的；該值存儲在`val`變量中。 ```sh if (val == 9) { exit } else if (val != 1 && val != 2 && val != 3) { print "Invalid option" continue } else { play_game(val) } ``` 如果選擇選項 9，則退出程序。如果該值在打印的菜單選項之外，則打印錯誤消息，并使用`continue`命令開始新的循環。如果我們正確選擇了三個選項之一，則調用`play_game()`函數。 ```sh r = int(rand()*3) + 1 ``` 使用`rand()`函數從`1..3`中選擇一個隨機值。這是計算機的選擇。 ```sh if (val == r) { print "Tie, next throw" return } ``` 如果兩個玩家選擇相同的選項，則平局。我們從函數返回并開始新的循環。 ```sh if (val == 1 && r == 2) { print "Paper covers rock, you loose" } else if (val == 2 && r == 1) { ... ``` 我們比較所選播放器的值，并將結果打印到控制臺。 ```sh $ awk -f rock_scissors_paper.awk 1 - rock 2 - paper 3 - scissors 9 - end game 1 I have scissors you have rock Rock blunts scissors, you win 1 - rock 2 - paper 3 - scissors 9 - end game 3 I have paper you have scissors Scissors cut paper, you win 1 - rock 2 - paper 3 - scissors 9 - end game ``` 游戲示例。 ## 標記關鍵字在下面的示例中，我們在源文件中標記 Java 關鍵字。 `mark_keywords.awk` ```sh # the program adds tags around Java keywords # it works on keywords that are separate words BEGIN { # load java keywords i = 0 while (getline kwd <"javakeywords2") { keywords[i] = kwd i++ } } { mtch = 0 ln = "" space = "" # calculate the beginning space if (match($0, /[^[:space:]]/)) { if (RSTART > 1) { space = sprintf("%*s", RSTART, "") } } # add the space to the line ln = ln space for (i=1; i <= NF; i++) { field = $i # go through keywords for (w_i in keywords) { kwd = keywords[w_i] # check if a field is a keyword if (field == kwd) { mtch = 1 } } # add tags to the line if (mtch == 1) { ln = ln "<kwd>" field "</kwd> " } else { ln = ln field " " } mtch = 0 } print ln } ``` 該程序在它識別的每個關鍵字周圍添加`<kwd>`和`</kwd>`標簽。這是一個基本示例；它適用于單獨單詞的關鍵字。它沒有解決更復雜的結構。 ```sh # load java keywords i = 0 while (getline kwd <"javakeywords2") { keywords[i] = kwd i++ } ``` 我們從文件中加載 Java 關鍵字；每個關鍵字在單獨的行上。關鍵字存儲在`keywords`數組中。 ```sh # calculate the beginning space if (match($0, /[^[:space:]]/)) { if (RSTART > 1) { space = sprintf("%*s", RSTART, "") } } ``` 使用正則表達式，我們計算行首的空格（如果有）。 `space`是一個字符串變量，等于當前行的空格寬度。計算空間是為了保持程序縮進。 ```sh # add the space to the line ln = ln space ``` 將空格添加到`ln`變量中。在 AWK 中，我們使用空格添加字符串。 ```sh for (i=1; i <= NF; i++) { field = $i ... } ``` 我們遍歷當前行的字段；該字段存儲在`field`變量中。 ```sh # go through keywords for (w_i in keywords) { kwd = keywords[w_i] # check if a field is a keyword if (field == kwd) { mtch = 1 } } ``` 在`for`循環中，我們遍歷 Java 關鍵字，并檢查字段是否為 Java 關鍵字。 ```sh # add tags to the line if (mtch == 1) { ln = ln "<kwd>" field "</kwd> " } else { ln = ln field " " } ``` 如果有關鍵字，我們將標簽附加在關鍵字周圍；否則，我們只是將字段附加到該行。 ```sh print ln ``` 構建的行將打印到控制臺。 ```sh $ awk -f markkeywords2.awk program.java <kwd>package</kwd> com.zetcode; <kwd>class</kwd> Test { <kwd>int</kwd> x = 1; <kwd>public</kwd> <kwd>void</kwd> exec1() { System.out.println(this.x); System.out.println(x); } <kwd>public</kwd> <kwd>void</kwd> exec2() { <kwd>int</kwd> z = 5; System.out.println(x); System.out.println(z); } } <kwd>public</kwd> <kwd>class</kwd> MethodScope { <kwd>public</kwd> <kwd>static</kwd> <kwd>void</kwd> main(String[] args) { Test ts = <kwd>new</kwd> Test(); ts.exec1(); ts.exec2(); } } ``` 在小型 Java 程序上運行的示例。這是 AWK 教程。