第?5?章?字符串處理 · Boost C++ 庫

# 第?5?章?字符串處理 ### 目錄 * [5.1 前言](stringhandling.html#stringhandling_general) * [5.2 區域設置](stringhandling.html#stringhandling_locale) * [5.3 字符串算法庫 Boost.StringAlgorithms](stringhandling.html#stringhandling_stringalgorithms) * [5.4 正則表達式庫 Boost.Regex](stringhandling.html#stringhandling_regex) * [5.5 詞匯分割器庫 Boost.Tokenizer](stringhandling.html#stringhandling_tokenizer) * [5.6 格式化輸出庫 Boost.Format](stringhandling.html#stringhandling_format) * [5.7 練習](stringhandling.html#stringhandling_exercises) [![](https://box.kancloud.cn/2016-02-29_56d41c2d6e214.gif)](http://creativecommons.org/licenses/by-nc-nd/3.0/de/deed.zh) 該書采用 [Creative Commons License](http://creativecommons.org/licenses/by-nc-nd/3.0/de/deed.zh) 授權 ## 5.1.?前言在標準 C++ 中，用于處理字符串的是 `std::string` 類，它提供很多字符串操作，包括查找指定字符或子串的函數。盡管 `std::string` 囊括了百余函數，是標準 C++ 中最為臃腫的類之一，然而卻并不能滿足很多開發者在日常工作中的需要。例如， Java 和 .Net 提供了可以將字符串轉換到大寫字母的函數，而 `std::string` 就沒有相應的功能。 Boost C++ 庫試圖彌補這一缺憾。 ## 5.2.?區域設置在進入正題之前，有必要先審視下區域設置的問題，本章中提到的很多函數都需要一個附加的區域設置參數。區域設置在標準 C++ 中封裝了文化習俗相關的內容，包括貨幣符號，日期時間格式，分隔整數部分與分數部分的符號（基數符）以及多于三個數字時的分隔符（千位符）。在字符串處理方面，區域設置和特定文化中對字符次序以及特殊字符的描述有關。例如，字母表中是否含有變異元音字母以及其在字母表中的位置都由語言文化決定。如果一個函數用于將字符串轉換為大寫形式，那么其實施步驟取決于具體的區域設置。在德語中，字母 '?' 顯然要轉換為 '?'，然而在其他語言中并不一定。使用類 `std::string` 時區域設置可以忽略，因為它的函數均不依賴于特定語言。然而在本章中為了使用 Boost C++ 庫，區域設置的知識是必不可少的。 C++ 標準中在 `locale` 文件中定義了類 `std::locale` 。每個 C++ 程序自動擁有一個此類的實例，即不能直接訪問的全局區域設置。如果要訪問它，需要使用默認構造函數構造類 `std::locale` 的對象，并使用與全局區域設置相同的屬性初始化。 ``` #include <locale> #include <iostream> int main() { std::locale loc; std::cout << loc.name() << std::endl; } ``` * [下載源代碼](src/5.2.1/main.cpp) 以上程序在標準輸出流輸出 `C` ，這是基本區域設置的名稱，它包括了 C 語言編寫的程序中默認使用的描述。這也是每個 C++ 應用的默認全局區域設置，它包括了美式文化中使用的描述。例如，貨幣符號使用美元符號，基字符為英文句號，日期中的月份用英語書寫。全局區域設置可以使用類 `std::locale` 中的靜態函數 `global()` 改變。 ``` #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::locale loc; std::cout << loc.name() << std::endl; } ``` * [下載源代碼](src/5.2.2/main.cpp) 靜態函數 `global()` 接受一個類型為 `std::locale` 的對象作為其唯一的參數。此類的另一個版本的構造函數接受類型為 `const char*` 的字符串，可以為一個特別的文化創建區域設置對象。然而，除了 C 區域設置相應地命名為 "C" 之外，其他區域設置的名字并沒有標準化，所以這依賴于接受區域設置名字的 C++ 標準庫。在使用 Visual Studio 2008 的情況下，[語言字符串文檔](http://msdn.microsoft.com/en-us/library/39cwe7zf.aspx) 指出，可以使用語言字符串 "German" 選擇定義為德國文化。執行程序，會輸出 `German_Germany.1252` 。指定語言字符串為 "German" 等于選擇了德國文化作為主要語言和子語言，這里選擇了字符映射 1252。如果想指定與德國文化不同的子語言設置，例如瑞士語，需要使用不同的語言字符串。 ``` #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German_Switzerland")); std::locale loc; std::cout << loc.name() << std::endl; } ``` * [下載源代碼](src/5.2.3/main.cpp) 現在程序會輸出 `German_Switzerland.1252` 。在初步理解了區域設置以及如何更改全局設置后，下面的例子說明了區域設置如何影響字符串操作。 ``` #include <locale> #include <iostream> #include <cstring> int main() { std::cout << std::strcoll("?", "z") << std::endl; std::locale::global(std::locale("German")); std::cout << std::strcoll("?", "z") << std::endl; } ``` * [下載源代碼](src/5.2.4/main.cpp) 本例使用了定義在文件 `cstring` 中的函數 `std::strcoll()` ，這個函數用于按照字典順序比較第一個字符串是否小于第二個。換言之，它會判斷這兩個字符串中哪一個在字典中靠前。執行程序，得到結果為 `1` 和 `-1` 。雖然函數的參數是一樣的，卻得到了不同的結果。原因很簡單，在第一次調用函數 `std::strcoll()` 時，使用了全局 C 區域設置；而在第二次調用時，全局區域設置更改為德國文化。從輸出中可以看出，在這兩種區域設置中，字符 '?' 和 'z' 的次序是不同的。很多 C 函數以及 C++ 流都與區域設置有關。盡管類 `std::string` 中的函數是與區域設置獨立工作的，但是以下各節中提到的函數并不是這樣。所以，在本章中還會多次提到區域設置的相關內容。 ## 5.3.?字符串算法庫 Boost.StringAlgorithms Boost C++ 字符串算法庫 [Boost.StringAlgorithms](http://www.boost.org/doc/libs/1_36_0/doc/html/string_algo.html) 提供了很多字符串操作函數。字符串的類型可以是 `std::string`， `std::wstring` 或任何其他模板類 `std::basic_string` 的實例。這些函數分類別在不同的頭文件定義。例如，大小寫轉換函數定義在文件 `boost/algorithm/string/case_conv.hpp` 中。因為 Boost.StringAlgorithms 類中包括超過20個類別和相同數目的頭文件，為了方便起見，頭文件 `boost/algorithm/string.hpp` 包括了所有其他的頭文件。后面所有例子都會使用這個頭文件。正如上節提到的那樣， Boost.StringAlgorithms 庫中許多函數都可以接受一個類型為 `std::locale` 的對象作為附加參數。而此參數是可選的，如果不設置將使用默認全局區域設置。 ``` #include <boost/algorithm/string.hpp> #include <locale> #include <iostream> #include <clocale> int main() { std::setlocale(LC_ALL, "German"); std::string s = "Boris Sch?ling"; std::cout << boost::algorithm::to_upper_copy(s) << std::endl; std::cout << boost::algorithm::to_upper_copy(s, std::locale("German")) << std::endl; } ``` * [下載源代碼](src/5.3.1/main.cpp) 函數 `boost::algorithm::to_upper_copy()` 用于轉換一個字符串為大寫形式，自然也有提供相反功能的函數 —— `boost::algorithm::to_lower_copy()` 把字符串轉換為小寫形式。這兩個函數都返回轉換過的字符串作為結果。如果作為參數傳入的字符串自身需要被轉換為大（小）寫形式，可以使用函數 `boost::algorithm::to_upper()` 或 `boost::algorithm::to_lower ()`。上面的例子使用函數 `boost::algorithm::to_upper_copy()` 把字符串 "Boris Sch?ling" 轉換為大寫形式。第一次調用時使用的是默認全局區域設置，第二次調用時則明確將區域設置為德國文化。顯然后者的轉換是正確的，因為小寫字母 '?' 對應的大寫形式 '?' 是存在的。而在 C 區域設置中， '?' 是一個未知字符所以不能轉換。為了能得到正確結果，必須明確傳遞正確的區域設置參數或者在調用 `boost::algorithm::to_upper_copy()` 之前改變全局區域設置。可以注意到，程序使用了定義在頭文件 `clocale` 中的函數 `std::setlocale()` 為 C 函數進行區域設置，因為 `std::cout` 使用 C 函數在屏幕上顯示信息。在設置了正確的區域后，才可以正確顯示 '?' 和 '?' 等元音字母。 ``` #include <boost/algorithm/string.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "Boris Sch?ling"; std::cout << boost::algorithm::to_upper_copy(s) << std::endl; std::cout << boost::algorithm::to_upper_copy(s, std::locale("German")) << std::endl; } ``` * [下載源代碼](src/5.3.2/main.cpp) 上述程序將全局區域設置設為德國文化，這使得對函數 `boost::algorithm::to_upper_copy()` 的調用可以將 '?' 轉換為 '?' 。注意到本例并沒有調用函數 `std::setlocale()` 。使用函數 `std::locale::global()` 設置全局區域設置后，也自動進行了 C 區域設置。實際上，C++ 程序幾乎總是使用函數 `std::locale::global()` 進行全局區域設置，而不是像前面的例子那樣使用函數 `std::setlocale()` 。 ``` #include <boost/algorithm/string.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "Boris Sch?ling"; std::cout << boost::algorithm::erase_first_copy(s, "i") << std::endl; std::cout << boost::algorithm::erase_nth_copy(s, "i", 0) << std::endl; std::cout << boost::algorithm::erase_last_copy(s, "i") << std::endl; std::cout << boost::algorithm::erase_all_copy(s, "i") << std::endl; std::cout << boost::algorithm::erase_head_copy(s, 5) << std::endl; std::cout << boost::algorithm::erase_tail_copy(s, 8) << std::endl; } ``` * [下載源代碼](src/5.3.3/main.cpp) Boost.StringAlgorithms 庫提供了幾個從字符串中刪除單獨字母的函數，可以明確指定在哪里刪除，如何刪除。例如，可以使用函數 `boost::algorithm::erase_all_copy()` 從整個字符串中刪除特定的某個字符。如果只在此字符首次出現時刪除，可以使用函數 `boost::algorithm::erase_first_copy()` 。如果要在字符串頭部或尾部刪除若干字符，可以使用函數 `boost::algorithm::erase_head_copy()` 和 `boost::algorithm::erase_tail_copy()` 。 ``` #include <boost/algorithm/string.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "Boris Sch?ling"; boost::iterator_range<std::string::iterator> r = boost::algorithm::find_first(s, "Boris"); std::cout << r << std::endl; r = boost::algorithm::find_first(s, "xyz"); std::cout << r << std::endl; } ``` * [下載源代碼](src/5.3.4/main.cpp) 以下各個不同函數 `boost::algorithm::find_first()`、 `boost::algorithm::find_last()`、 `boost::algorithm::find_nth()`、 `boost::algorithm::find_head()` 以及 `boost::algorithm::find_tail()` 可以用于在字符串中查找子串。所有這些函數的共同點是均返回類型為 `boost::iterator_range` 類的一對迭代器。此類起源于 Boost C++ 的 [Boost.Range](http://www.boost.org/libs/range/) 庫，它在迭代器的概念上定義了“范圍”。因為操作符 `<<` 由 `boost::iterator_range` 類重載而來，單個搜索算法的結果可以直接寫入標準輸出流。以上程序將 `Boris` 作為第一個結果輸出而第二個結果為空字符串。 ``` #include <boost/algorithm/string.hpp> #include <locale> #include <iostream> #include <vector> int main() { std::locale::global(std::locale("German")); std::vector<std::string> v; v.push_back("Boris"); v.push_back("Sch?ling"); std::cout << boost::algorithm::join(v, " ") << std::endl; } ``` * [下載源代碼](src/5.3.5/main.cpp) 函數 `boost::algorithm::join()` 接受一個字符串的容器作為第一個參數，根據第二個參數將這些字符串連接起來。相應地這個例子會輸出 `Boris Sch?ling` 。 ``` #include <boost/algorithm/string.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "Boris Sch?ling"; std::cout << boost::algorithm::replace_first_copy(s, "B", "D") << std::endl; std::cout << boost::algorithm::replace_nth_copy(s, "B", 0, "D") << std::endl; std::cout << boost::algorithm::replace_last_copy(s, "B", "D") << std::endl; std::cout << boost::algorithm::replace_all_copy(s, "B", "D") << std::endl; std::cout << boost::algorithm::replace_head_copy(s, 5, "Doris") << std::endl; std::cout << boost::algorithm::replace_tail_copy(s, 8, "Becker") << std::endl; } ``` * [下載源代碼](src/5.3.6/main.cpp) Boost.StringAlgorithms 庫不但提供了查找子串或刪除字母的函數，而且提供了使用字符串替代子串的函數，包括 `boost::algorithm::replace_first_copy()`， `boost::algorithm::replace_nth_copy()`， `boost::algorithm::replace_last_copy()`， `boost::algorithm::replace_all_copy()`， `boost::algorithm::replace_head_copy()` 以及 `boost::algorithm::replace_tail_copy()` 等等。它們的使用方法同查找和刪除函數是差不多一樣的，所不同的是還需要一個替代字符串作為附加參數。 ``` #include <boost/algorithm/string.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "\t Boris Sch?ling \t"; std::cout << "." << boost::algorithm::trim_left_copy(s) << "." << std::endl; std::cout << "." <<boost::algorithm::trim_right_copy(s) << "." << std::endl; std::cout << "." <<boost::algorithm::trim_copy(s) << "." << std::endl; } ``` * [下載源代碼](src/5.3.7/main.cpp) 可以使用修剪函數 `boost::algorithm::trim_left_copy()`， `boost::algorithm::trim_right_copy()` 以及 `boost::algorithm::trim_copy()` 等自動去除字符串中的空格或者字符串的結束符。什么字符是空格取決于全局區域設置。 Boost.StringAlgorithms 庫的函數可以接受一個附加的謂詞參數，以決定函數作用于字符串的哪些字符。謂詞版本的修剪函數相應地被命名為 `boost::algorithm::trim_left_copy_if()`， `boost::algorithm::trim_right_copy_if()` 和 `boost::algorithm::trim_copy_if()` 。 ``` #include <boost/algorithm/string.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "--Boris Sch?ling--"; std::cout << "." << boost::algorithm::trim_left_copy_if(s, boost::algorithm::is_any_of("-")) << "." << std::endl; std::cout << "." <<boost::algorithm::trim_right_copy_if(s, boost::algorithm::is_any_of("-")) << "." << std::endl; std::cout << "." <<boost::algorithm::trim_copy_if(s, boost::algorithm::is_any_of("-")) << "." << std::endl; } ``` * [下載源代碼](src/5.3.8/main.cpp) 以上程序調用了一個輔助函數 `boost::algorithm::is_any_of()` ，它用于生成謂詞以驗證作為參數傳入的字符是否在給定的字符串中存在。使用函數 `boost::algorithm::is_any_of 后，正如例子中做的那樣，修剪字符串的字符被指定為連字符。 ()` Boost.StringAlgorithms 類也提供了眾多返回通用謂詞的輔助函數。 ``` #include <boost/algorithm/string.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "123456789Boris Sch?ling123456789"; std::cout << "." << boost::algorithm::trim_left_copy_if(s, boost::algorithm::is_digit()) << "." << std::endl; std::cout << "." <<boost::algorithm::trim_right_copy_if(s, boost::algorithm::is_digit()) << "." << std::endl; std::cout << "." <<boost::algorithm::trim_copy_if(s, boost::algorithm::is_digit()) << "." << std::endl; } ``` * [下載源代碼](src/5.3.9/main.cpp) 函數 `boost::algorithm::is_digit()` 返回的謂詞在字符為數字時返回布爾值 `true`。檢查字符是否為大寫或小寫的輔助函數分別是 `boost::algorithm::is_upper()` 和 `boost::algorithm::is_lower()` 。所有這些函數都默認使用全局區域設置，除非在參數中指定其他區域設置。除了檢驗單獨字符的謂詞之外， Boost.StringAlgorithms 庫還提供了處理字符串的函數。 ``` #include <boost/algorithm/string.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "Boris Sch?ling"; std::cout << boost::algorithm::starts_with(s, "Boris") << std::endl; std::cout << boost::algorithm::ends_with(s, "Sch?ling") << std::endl; std::cout << boost::algorithm::contains(s, "is") << std::endl; std::cout << boost::algorithm::lexicographical_compare(s, "Boris") << std::endl; } ``` * [下載源代碼](src/5.3.10/main.cpp) 函數 `boost::algorithm::starts_with()`、 `boost::algorithm::ends_with()`、 `boost::algorithm::contains()` 和 `boost::algorithm::lexicographical_compare()` 均可以比較兩個字符串。以下介紹一個字符串切割函數。 ``` #include <boost/algorithm/string.hpp> #include <locale> #include <iostream> #include <vector> int main() { std::locale::global(std::locale("German")); std::string s = "Boris Sch?ling"; std::vector<std::string> v; boost::algorithm::split(v, s, boost::algorithm::is_space()); std::cout << v.size() << std::endl; } ``` * [下載源代碼](src/5.3.11/main.cpp) 在給定分界符后，使用函數 `boost::algorithm::split()` 可以將一個字符串拆分為一個字符串容器。它需要給定一個謂詞作為第三個參數以判斷應該在字符串的哪個位置分割。這個例子使用了輔助函數 `boost::algorithm::is_space()` 創建一個謂詞，在每個空格字符處分割字符串。本節中許多函數都有忽略字符串大小寫的版本，這些版本一般都有與原函數相似的名稱，所相差的只是以 'i'.開頭。例如，與函數 `boost::algorithm::erase_all_copy()` 相對應的是函數 `boost::algorithm::ierase_all_copy()`。最后，值得注意的是類 Boost.StringAlgorithms 中許多函數都支持正則表達式。以下程序使用函數 `boost::algorithm::find_regex()` 搜索正則表達式。 ``` #include <boost/algorithm/string.hpp> #include <boost/algorithm/string/regex.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "Boris Sch?ling"; boost::iterator_range<std::string::iterator> r = boost::algorithm::find_regex(s, boost::regex("\\w\\s\\w")); std::cout << r << std::endl; } ``` * [下載源代碼](src/5.3.12/main.cpp) 為了使用正則表達式，此程序使用了Boost C++ 庫中的 `boost::regex` ，這將在下一節介紹。 ## 5.4.?正則表達式庫 Boost.Regex Boost C++ 的正則表達式庫 [Boost.Regex](http://www.boost.org/libs/regex/) 可以應用正則表達式于 C++ 。正則表達式大大減輕了搜索特定模式字符串的負擔，在很多語言中都是強大的功能。雖然現在 C++ 仍然需要以 Boost C++ 庫的形式提供這一功能，但是在將來正則表達式將進入 C++ 標準庫。 Boost.Regex 庫有望包括在下一版的 C++ 標準中。 Boost.Regex 庫中兩個最重要的類是 `boost::regex` 和 `boost::smatch`，它們都在 `boost/regex.hpp` 文件中定義。前者用于定義一個正則表達式，而后者可以保存搜索結果。以下將要介紹 Boost.Regex 庫中提供的三個搜索正則表達式的函數。 ``` #include <boost/regex.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "Boris Sch?ling"; boost::regex expr("\\w+\\s\\w+"); std::cout << boost::regex_match(s, expr) << std::endl; } ``` * [下載源代碼](src/5.4.1/main.cpp) 函數 `boost::regex_match()` 用于字符串與正則表達式的比較。在整個字符串匹配正則表達式時其返回值為 `true` 。函數 `boost::regex_search()` 可用于在字符串中搜索正則表達式。 ``` #include <boost/regex.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "Boris Sch?ling"; boost::regex expr("(\\w+)\\s(\\w+)"); boost::smatch what; if (boost::regex_search(s, what, expr)) { std::cout << what[0] << std::endl; std::cout << what[1] << " " << what[2] << std::endl; } } ``` * [下載源代碼](src/5.4.2/main.cpp) 函數 `boost::regex_search()` 可以接受一個類型為 `boost::smatch` 的引用的參數用于儲存結果。函數 `boost::regex_search()` 只用于分類的搜索，本例實際上返回了兩個結果，它們是基于正則表達式的分組。存儲結果的類 `boost::smatch` 事實上是持有類型為 `boost::sub_match` 的元素的容器，可以通過與類 `std::vector` 相似的界面訪問。例如，元素可以通過操作符 `operator[]()` 訪問。另一方面，類 `boost::sub_match` 將迭代器保存在對應于正則表達式分組的位置。因為它繼承自類 `std::pair` ，迭代器引用的子串可以使用 `first` 和 `second` 訪問。如果像上面的例子那樣，只把子串寫入標準輸出流，那么通過重載操作符 `<<` 就可以直接做到這一點，那么并不需要訪問迭代器。請注意結果保存在迭代器中而 `boost::sub_match` 類并不復制它們，這說明它們只是在被迭代器引用的相關字符串存在時才可以訪問。另外，還需要注意容器 `boost::smatch` 的第一個元素存儲的引用是指向匹配正則表達式的整個字符串的，匹配第一組的第一個子串由索引 1 訪問。 Boost.Regex 提供的第三個函數是 `boost::regex_replace()`。 ``` #include <boost/regex.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = " Boris Sch?ling "; boost::regex expr("\\s"); std::string fmt("_"); std::cout << boost::regex_replace(s, expr, fmt) << std::endl; } ``` * [下載源代碼](src/5.4.3/main.cpp) 除了待搜索的字符串和正則表達式之外， `boost::regex_replace()` 函數還需要一個格式參數，它決定了子串、匹配正則表達式的分組如何被替換。如果正則表達式不包含任何分組，相關子串將被用給定的格式一個個地被替換。這樣上面程序輸出的結果為 `_Boris_Sch?ling_` 。 `boost::regex_replace()` 函數總是在整個字符串中搜索正則表達式，所以這個程序實際上將三處空格都替換為下劃線。 ``` #include <boost/regex.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "Boris Sch?ling"; boost::regex expr("(\\w+)\\s(\\w+)"); std::string fmt("\\2 \\1"); std::cout << boost::regex_replace(s, expr, fmt) << std::endl; } ``` * [下載源代碼](src/5.4.4/main.cpp) 格式參數可以訪問由正則表達式分組的子串，這個例子正是使用了這項技術，交換了姓、名的位置，于是結果顯示為 `Sch?ling Boris` 。需要注意的是，對于正則表達式和格式有不同的標準。這三個函數都可以接受一個額外的參數，用于選擇具體的標準。也可以指定是否以某一具體格式解釋特殊字符或者替代匹配正則表達式的整個字符串。 ``` #include <boost/regex.hpp> #include <locale> #include <iostream> int main() { std::locale::global(std::locale("German")); std::string s = "Boris Sch?ling"; boost::regex expr("(\\w+)\\s(\\w+)"); std::string fmt("\\2 \\1"); std::cout << boost::regex_replace(s, expr, fmt, boost::regex_constants::format_literal) << std::endl; } ``` * [下載源代碼](src/5.4.5/main.cpp) 此程序將 `boost::regex_constants::format_literal` 標志作為第四參數傳遞給函數 `boost::regex_replace()` ，從而抑制了格式參數中對特殊字符的處理。因為整個字符串匹配正則表達式，所以本例中經格式參數替換的到達的輸出結果為 `\2 \1`。正如上一節末指出的那樣，正則表達式可以和 Boost.StringAlgorithms 庫結合使用。它通過 Boost.Regex 庫提供函數如 `boost::algorithm::find_regex()` 、 `boost::algorithm::replace_regex()` 、 `boost::algorithm::erase_regex()` 以及 `boost::algorithm::split_regex()` 等等。由于 Boost.Regex 庫很有可能成為即將到來的下一版 C++ 標準的一部分，脫離 Boost.StringAlgorithms 庫，熟練地使用正則表達式是個明智的選擇。 ## 5.5.?詞匯分割器庫 Boost.Tokenizer [Boost.Tokenizer](http://www.boost.org/libs/tokenizer/) 庫可以在指定某個字符為分隔符后，遍歷字符串的部分表達式。 ``` #include <boost/tokenizer.hpp> #include <string> #include <iostream> int main() { typedef boost::tokenizer<boost::char_separator<char> > tokenizer; std::string s = "Boost C++ libraries"; tokenizer tok(s); for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) std::cout << *it << std::endl; } ``` * [下載源代碼](src/5.5.1/main.cpp) Boost.Tokenizer 庫在 `boost/tokenizer.hpp` 文件中定義了模板類 `boost::tokenizer` ，其模板參數為支持相關表達式的類。上面的例子中就使用了 `boost::char_separator` 類作為模板參數，它將空格和標點符號視為分隔符。詞匯分割器必須由類型為 `std::string` 的字符串初始化。通過使用 `begin()` 和 `end()` 方法，詞匯分割器可以像容器一樣訪問。通過使用迭代器，可以得到前述字符串的部分表達式。模板參數的類型決定了如何達到部分表達式。因為 `boost::char_separator` 類默認將空格和標點符號視為分隔符，所以本例顯示的結果為 `Boost` 、 `C` 、 `+` 、 `+` 和 `libraries` 。為了識別這些分隔符， `boost::char_separator` 函數調用了 `std::isspace()` 函數和 `std::ispunct 函數。 ()`Boost.Tokenizer 庫會區分要隱藏的分隔符和要顯示的分隔符。在默認的情況下，空格會隱藏而標點符號會顯示出來，所以這個例子里顯示了兩個加號。如果不需要將標點符號作為分隔符，可以在傳遞給詞匯分割器之前相應地初始化 `boost::char_separator` 對象。以下例子正式這樣做的。 ``` #include <boost/tokenizer.hpp> #include <string> #include <iostream> int main() { typedef boost::tokenizer<boost::char_separator<char> > tokenizer; std::string s = "Boost C++ libraries"; boost::char_separator<char> sep(" "); tokenizer tok(s, sep); for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) std::cout << *it << std::endl; } ``` * [下載源代碼](src/5.5.2/main.cpp) 類 `boost::char_separator` 的構造函數可以接受三個參數，只有第一個是必須的，它描述了需要隱藏的分隔符。在本例中，空格仍然被視為分隔符。第二個參數指定了需要顯示的分隔符。在不提供此參數的情況下，將不顯示任何分隔符。執行程序，會顯示 `Boost` 、 `C++` 和 `libraries` 。如果將加號作為第二個參數，此例的結果將和上一個例子相同。 ``` #include <boost/tokenizer.hpp> #include <string> #include <iostream> int main() { typedef boost::tokenizer<boost::char_separator<char> > tokenizer; std::string s = "Boost C++ libraries"; boost::char_separator<char> sep(" ", "+"); tokenizer tok(s, sep); for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) std::cout << *it << std::endl; } ``` * [下載源代碼](src/5.5.3/main.cpp) 第三個參數決定了是否顯示空的部分表達式。如果連續找到兩個分隔符，他們之間的部分表達式將為空。在默認情況下，這些空表達式是不會顯示的。第三個參數可以改變默認的行為。 ``` #include <boost/tokenizer.hpp> #include <string> #include <iostream> int main() { typedef boost::tokenizer<boost::char_separator<char> > tokenizer; std::string s = "Boost C++ libraries"; boost::char_separator<char> sep(" ", "+", boost::keep_empty_tokens); tokenizer tok(s, sep); for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) std::cout << *it << std::endl; } ``` * [下載源代碼](src/5.5.4/main.cpp) 執行以上程序，會顯示另外兩個的空表達式。其中第一個是在兩個加號中間的而第二個是加號和之后的空格之間的。詞匯分割器也可用于不同的字符串類型。 ``` #include <boost/tokenizer.hpp> #include <string> #include <iostream> int main() { typedef boost::tokenizer<boost::char_separator<wchar_t>, std::wstring::const_iterator, std::wstring> tokenizer; std::wstring s = L"Boost C++ libraries"; boost::char_separator<wchar_t> sep(L" "); tokenizer tok(s, sep); for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) std::wcout << *it << std::endl; } ``` * [下載源代碼](src/5.5.5/main.cpp) 這個例子遍歷了一個類型為 `std::wstring` 的字符串。為了使用這個類型的字符串，必須使用另外的模板參數初始化詞匯分割器，對 `boost::char_separator` 類也是如此，他們都需要參數 `wchar_t` 初始化。除了 `boost::char_separator` 類之外， Boost.Tokenizer 還提供了另外兩個類以識別部分表達式。 ``` #include <boost/tokenizer.hpp> #include <string> #include <iostream> int main() { typedef boost::tokenizer<boost::escaped_list_separator<char> > tokenizer; std::string s = "Boost,\"C++ libraries\""; tokenizer tok(s); for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) std::cout << *it << std::endl; } ``` * [下載源代碼](src/5.5.6/main.cpp) `boost::escaped_list_separator` 類用于讀取由逗號分隔的多個值，這個格式的文件通常稱為 CSV （comma separated values，逗號分隔文件），它甚至還可以處理雙引號以及轉義序列。所以本例的輸出為 `Boost` 和 `C++ libraries` 。另一個是 `boost::offset_separator` 類，必須用實例說明。這個類的對象必須作為第二個參數傳遞給 `boost::tokenizer` 類的構造函數。 ``` #include <boost/tokenizer.hpp> #include <string> #include <iostream> int main() { typedef boost::tokenizer<boost::offset_separator> tokenizer; std::string s = "Boost C++ libraries"; int offsets[] = { 5, 5, 9 }; boost::offset_separator sep(offsets, offsets + 3); tokenizer tok(s, sep); for (tokenizer::iterator it = tok.begin(); it != tok.end(); ++it) std::cout << *it << std::endl; } ``` * [下載源代碼](src/5.5.7/main.cpp) `boost::offset_separator` 指定了部分表達式應當在字符串中的哪個位置結束。以上程序制定第一個部分表達式在 5 個字符后結束，第二個字符串在另 5 個字符后結束，第三個也就是最后一個字符串應當在之后的 9 個字符后結束。輸出的結果為 `Boost` 、 `?C++?` 和 `libraries` 。 ## 5.6.?格式化輸出庫 Boost.Format [Boost.Format](http://www.boost.org/libs/format/) 庫可以作為定義在文件 `cstdio` 中的函數 `std::printf()` 的替代。 `std::printf()` 函數最初出現在 C 標準中，提供格式化數據輸出功能，但是它既不是類型安全的有不能擴展。因此在 C++ 應用中， Boost.Format 庫通常是數據格式化輸出的上佳之選。 Boost.Format 庫在文件 `boost/format.hpp` 中定義了類 `boost::format` 。與函數 `std::printf 相似的是，傳遞給()` `boost::format` 的構造函數的參數也是一個字符串，它由控制格式的特殊字符組成。實際數據通過操作符 % 相連，在輸出中替代特殊字符，如下例所示。 ``` #include <boost/format.hpp> #include <iostream> int main() { std::cout << boost::format("%1%.%2%.%3%") % 16 % 9 % 2008 << std::endl; } ``` * [下載源代碼](src/5.6.1/main.cpp) Boost.Format 類使用置于兩個百分號之間的數字作為占位符，占位符稍后通過 % 操作符與實際數據連接。以上程序使用數字16、9 和 2009 組成一個日期字符串，以 `16.9.2008`的格式輸出。如果要月份出現在日期之前，即美式表示，只需交換占位符即可。 ``` #include <boost/format.hpp> #include <iostream> int main() { std::cout << boost::format("%2%/%1%/%3%") % 16 % 9 % 2008 << std::endl; } ``` * [下載源代碼](src/5.6.2/main.cpp) 現在程序顯示的結果變成 `9/16/2008` 。如果要使用C++ 操作器格式化數據，Boost.Format 庫提供了函數 `boost::io::group()` 。 ``` #include <boost/format.hpp> #include <iostream> int main() { std::cout << boost::format("%1% %2% %1%") % boost::io::group(std::showpos, 99) % 100 << std::endl; } ``` * [下載源代碼](src/5.6.3/main.cpp) 本例的結果顯示為 `+99 100 +99` 。因為操作器 `std::showpos()` 通過 `boost::io::group()` 與數字 99 連接，所以只要顯示 99 ，在它前面就會自動加上加號。如果需要加號僅在 99 第一次輸出時顯示，則需要改造格式化占位符。 ``` #include <boost/format.hpp> #include <iostream> int main() { std::cout << boost::format("%|1$+| %2% %1%") % 99 % 100 << std::endl; } ``` 為了將輸出格式改為 `+99 100 99` ，不但需要將數據的引用符號由 1$ 變為 1% ，還需要在其兩側各添加一個附加的管道符號，即將占位符 %1% 替換為 %|1$+|。請注意，雖然一般對數據的引用不是必須的，但是所有占位符一定要同時設置為指定貨非指定。以下例子在執行時會出現錯誤，因為它給第二個和第三個占位符設置了引用但是卻忽略了第一個。 ``` #include <boost/format.hpp> #include <iostream> int main() { try { std::cout << boost::format("%|+| %2% %1%") % 99 % 100 << std::endl; } catch (boost::io::format_error &ex) { std::cout << ex.what() << std::endl; } } ``` * [下載源代碼](src/5.6.5/main.cpp) 此程序拋出了類型為 `boost::io::format_error` 的異常。嚴格地說，Boost.Format 庫拋出的異常為 `boost::io::bad_format_string`。但是由于所有的異常類都繼承自 `boost::io::format_error` 類，捕捉此類型的異常會輕松一些。以下例子演示了不引用數據的方法。 ``` #include <boost/format.hpp> #include <iostream> int main() { std::cout << boost::format("%|+| %|| %||") % 99 % 100 % 99 << std::endl; } ``` * [下載源代碼](src/5.6.6/main.cpp) 第二、第三個占位符的管道符號可以被安全地省略，因為在這種情況下，他們并不指定格式。這樣的語法看起來很像 `std::printf ()`的那種。 ``` #include <boost/format.hpp> #include <iostream> int main() { std::cout << boost::format("%+d %d %d") % 99 % 100 % 99 << std::endl; } ``` * [下載源代碼](src/5.6.7/main.cpp) 雖然這看起來就像 `std::printf()` ，但是 Boost.Format 庫有類型安全的優點。格式化字符串中字母 'd' 的使用并不表示輸出數字，而是表示 `boost::format` 類所使用的內部流對象上的 `std::dec()` 操作器，它可以使用某些對 `std::printf()` 函數無意義的格式字符串，如果使用 `std::printf()` 會導致程序在運行時崩潰。 ``` #include <boost/format.hpp> #include <iostream> int main() { std::cout << boost::format("%+s %s %s") % 99 % 100 % 99 << std::endl; } ``` * [下載源代碼](src/5.6.8/main.cpp) 盡管在 `std::printf()` 函數中，字母 's' 只用于表示類型為 `const char*` 的字符串，然而以上程序卻能正常運行。因為在 Boost.Format 庫中，這并不代表強制為字符串，它會結合適當的操作器，調整內部流的操作模式。所以即使在這種情況下，在內部流中加入數字也是沒問題的。 ## 5.7.?練習 You can buy [solutions to all exercises](http://en.highscore.de/shop/index.php?p=boost-solution) in this book as a ZIP file. 1. 編寫程序，從以下 XML 流中提取并顯示數據，包括姓名、生日以及賬戶余額。 **`<person><name>Karl-Heinz Huber</name><dob>1970-9-30</dob><account>2,900.64 USD</account></person>`**。姓、名要分開顯示，生日使用 “日.月.年” 的格式，賬戶余額忽略小數位。使用其他 XML 流測試你的程序，如包含多余空白、其他名字、賬戶余額為負數等等的 XML 流。 2. 編寫程序，使得格式與顯示的數據記錄如下：輸入 **`Munich Hamburg 92.12 8:25 9:45`**，這條記錄表示從 Munich 到 Hamburg 的航班票價為 92.12 歐元，上午 8:25 起飛 9:45 到達目的地。要得到以下輸出 `Munich????-> Hamburg??????92.12 EUR (08:25-09:45)`。具體地說，城市名稱長度為10并且左對齊而票價長度為7并且右對齊，貨幣在價格后顯示。起飛與降落時間一起顯示在圓括號中，以連字符分隔，不留空格。對早于10點（上午或下午）的時間，必須在前面補0。用不同的數據記錄測試你的程序，例如使用長度大于10的城市名。