關聯容器 — hashtable · STL源碼分析

C++ 11已將哈希表納入了標準之列。hashtable是hash_set、hash_map、hash_multiset、hash_multimap的底層機制，即這四種容器中都包含一個hashtable。解決碰撞問題的辦法有許多，線性探測、二次探測、開鏈等等。SGI STL的hashtable采用的開鏈方法，每個hash table中的元素用vector承載，每個元素稱為桶（bucket），一個桶指向一個存儲了實際元素的鏈表（list），鏈表節點（node）結構如下： ~~~ template <class Value> struct __hashtable_node { __hashtable_node* next; Value val; // 存儲實際值 }; ~~~ 再來看看hash table的迭代器定義： ~~~ template <class Value, class Key, class HashFcn, class ExtractKey, class EqualKey, class Alloc> struct __hashtable_iterator { // 迭代器 typedef hashtable<Value, Key, HashFcn, ExtractKey, EqualKey, Alloc> hashtable; .... typedef __hashtable_node<Value> node; // 定義迭代器相應類型 typedef forward_iterator_tag iterator_category; // 前向迭代器 typedef Value value_type; typedef ptrdiff_t difference_type; typedef size_t size_type; typedef Value& reference; typedef Value* pointer; node* cur; // 迭代器目前所指節點 hashtable* ht; // 和hashtable之間的紐帶 __hashtable_iterator(node* n, hashtable* tab) : cur(n), ht(tab) {} __hashtable_iterator() {} reference operator*() const { return cur->val; } pointer operator->() const { return &(operator*()); } iterator& operator++(); iterator operator++(int); bool operator==(const iterator& it) const { return cur == it.cur; } bool operator!=(const iterator& it) const { return cur != it.cur; } }; ~~~ hash table的迭代器不能后退，這里關注迭代器的自增操作，代碼如下： ~~~ template <class V, class K, class HF, class ExK, class EqK, class A> __hashtable_iterator<V, K, HF, ExK, EqK, A>& __hashtable_iterator<V, K, HF, ExK, EqK, A>::operator++() // 注意類模板成員函數的定義 { const node* old = cur; cur = cur->next; // 移動到下一個node if (!cur) { // 到了list結尾 size_type bucket = ht->bkt_num(old->val); // 根據節點值定位舊節點所在桶號 while (!cur && ++bucket < ht->buckets.size()) // 計算下一個可用桶號 cur = ht->buckets[bucket]; // 找到，另cur指向新桶的第一個node } return *this; } ~~~ hashtable數據結構內容很多，這里只列出少量代碼： ~~~ template <class Value, class Key, class HashFcn, class ExtractKey, class EqualKey, class Alloc> class hashtable { // hash table數據結構 public: typedef Key key_type; typedef Value value_type; typedef HashFcn hasher; // 散列函數類型 typedef EqualKey key_equal; typedef size_t size_type; typedef ptrdiff_t difference_type; .... private: hasher hash; // 散列函數 key_equal equals; // 判斷鍵值是否相等 ExtractKey get_key; // 從節點取出鍵值 typedef __hashtable_node<Value> node; typedef simple_alloc<node, Alloc> node_allocator; // 空間配置器 vector<node*,Alloc> buckets; // 桶的集合，可以看出一個桶實值上是一個node* size_type num_elements; // node個數 .... } ~~~ SGI STL將hash table的大小，也就是vector的大小設計為28個質數，并存放在一個數組中： ~~~ static const int __stl_num_primes = 28; // 28個質數 static const unsigned long __stl_prime_list[__stl_num_primes] = { 53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593, 49157, 98317, 196613, 393241, 786433, 1572869, 3145739, 6291469, 12582917, 25165843, 50331653, 100663319, 201326611, 402653189, 805306457, 1610612741, 3221225473, 4294967291 }; ~~~ 當vector容量不足時，會以兩倍的容量進行擴充。下面介紹插入操作，以insert_unique為例： ~~~ // 插入新元素，鍵值不能重復 pair<iterator, bool> insert_unique(const value_type& obj) { resize(num_elements + 1); // 判斷vector是否需要擴充 return insert_unique_noresize(obj); // 直接插入obj } ~~~ insert操作大致分兩步：第一步是擴充（如果需要的話），第二步是插入。 resize代碼如下： ~~~ template <class V, class K, class HF, class Ex, class Eq, class A> void hashtable<V, K, HF, Ex, Eq, A>::resize(size_type num_elements_hint) // 判斷是否需要擴充vector { const size_type old_n = buckets.size(); if (num_elements_hint > old_n) { // 元素個數大于vector容量，則需要擴充vector const size_type n = next_size(num_elements_hint); if (n > old_n) { vector<node*, A> tmp(n, (node*) 0); // 建立一個臨時的vector作為轉移目的地 for (size_type bucket = 0; bucket < old_n; ++bucket) { // 一個桶一個桶進行轉移 node* first = buckets[bucket]; while (first) { // 一個節點一個節點進行轉移 size_type new_bucket = bkt_num(first->val, n); // 散列過程，對n取模 buckets[bucket] = first->next; first->next = tmp[new_bucket]; // 這一句和下一句表示從鏈表前端插入 tmp[new_bucket] = first; first = buckets[bucket]; // first指向舊vector的下一個node } buckets.swap(tmp); // 兩個vector的內容互換，使buckets徹底改變 } } } } ~~~ 上述代碼基本思路就是：先擴充，再移動，最后交換。 - 擴充利用next_size函數。next_size的作用就是從質數表中選取最接近并且不小于num_elements_hint的質數并返回，利用這個較大值開辟一個新vector。 - 移動實質上就是指針的移動。重新對每個節點進行散列，然后從前鏈入到新的vector中。 - 交換過程就是上面代碼紅色部分。這里使用了vector內部的swap成員函數，將*this和tmp的內容進行了互換，這是copy-and-swap技術，《Effective C++》條款11有說明這個技術。擴充完vector后，就可以順利插入需要插入的元素了。 insert_unique_noresize代碼如下： ~~~ template <class V, class K, class HF, class Ex, class Eq, class A> pair<typename hashtable<V, K, HF, Ex, Eq, A>::iterator, bool> // 注意，返回一個pair hashtable<V, K, HF, Ex, Eq, A>::insert_unique_noresize(const value_type& obj) // 直接插入節點，無需擴充 { const size_type n = bkt_num(obj); // 對obj進行散列，然后模上vector大小，從而確定桶號 node* first = buckets[n]; // first指向對應桶的第一個node for (node* cur = first; cur; cur = cur->next) if (equals(get_key(cur->val), get_key(obj))) // 遇到相同node，則直接返回這個node return pair<iterator, bool>(iterator(cur, this), false); // 沒有遇到相同node，則在list開頭插入 node* tmp = new_node(obj); tmp->next = first; buckets[n] = tmp; ++num_elements; return pair<iterator, bool>(iterator(tmp, this), true); } ~~~ 這里也是將新節點插入list的開頭，詳細過程已在注釋中說明。參考：《STL源碼剖析》 P253.