MapReduce核心類 · Hadoop2.x

![](https://img.kancloud.cn/f7/70/f770105fd9104aabc3e424915f070811_922x555.png) [TOC] # 1. InputFormat InputFormat 的主要功能就是確定每一個 map 任務需要讀取哪些數據以及如何讀取數據的問題，<ins>每一個 map 讀取哪些數據由 InputSplit（數據切片）決定，如何讀取數據由 RecordReader 來決定</ins>。InputFormat 中就有獲取 InputSplit 和RecordReader 的方法。 ![](https://img.kancloud.cn/b2/b7/b2b7cd0e46a25b58a5c5fe7a8a900cb5_1129x474.png) **InputSplit:** 在map之前，根據輸入文件InputSplit會被創建。 * 每個InputSplit對應一個Mapper任務 * 輸入分片存儲的是分片長度和記錄數據位置的數組 ![](https://img.kancloud.cn/cb/b4/cbb4dbb9868bcad5777dbeb0a89641c3_896x453.png) **block和split的區別：** * block是數據的物理表示、split是塊中數據的邏輯表示 * split劃分是在記錄的邊界處 * split的數量應不大于block的數量（一般相等） <br/> # 2. InputFormat 接口實現類 ![](https://img.kancloud.cn/dd/34/dd34c30fa28f9b5c4aeef6bb7dfd45d3_1153x501.png) InputFormat實現類有很多，但是我們開發比較常用應該是文件類型（FileInputFormat）和數據庫類型（DBInputFormat）。課程中還是以FileInputFormat為主。DBInputFormat 只是知道有這個功能即可。 1. **FileInputFormat 源碼解析**(該部分內容可參照 FileInputFormat 源碼) ![](https://img.kancloud.cn/44/f9/44f92f16148a87147b97382c00c30987_1038x525.png) （1）找到輸入數據存儲的目錄。（2）開始遍歷處理（規劃切片）目錄下的每一個文件。（3）遍歷第一個文件 hello.txt。      a）獲取文件大小 fs.sizeOf(hello.txt)。      b）計算切片大小 <ins>computeSliteSize(Math.max(minSize,Math.min(maxSize,blocksize)))=blocksize=128M</ins>。      c）<ins>默認情況下，切片大小=blocksize</ins>。      d）開始切，形成第 1 個切片：hello.txt—0:128M ，第 2 個切片 hello.txt—128:256M ，第 3 個切片 hello.txt—256M:300M（<ins>每次切片時，都要判斷切完剩下的部分是否大于塊的 1.1 倍，不大于 1.1 倍就劃分一塊切片</ins>）。      e）將切片信息寫到一個切片規劃文件中。      f）整個切片的核心過程在 FileInputFormat 類中的 getSplit()方法中完成，可以去查看源碼。      g）<ins>數據切片只是在邏輯上對輸入數據進行分片，并不會在磁盤上將其切分成分片進行存儲</ins>。InputSplit 只記錄了分片的元數據信息，比如起始位置、長度以及所在的節點列表等。      h）注意：<ins>block 是 HDFS 物理上存儲的數據，切片是對數據邏輯上的劃分</ins>。（4）提交切片規劃文件到 Yarn 上，Yarn 上的 MrAppMaster 就可以根據切片規劃文件計算開啟 maptask 個數。 2. **FileInputFormat 切片大小的參數配置** 通過分析源碼，在 FileInputFormat 中，計算切片大小的邏輯：<ins>Math.max(minSize, Math.min(maxSize, blockSize))</ins>; 切片主要由這幾個值來運算決定： ``` mapreduce.input.fileinputformat.split.minsize=1 默認值為 1 mapreduce.input.fileinputformat.split.maxsize=Long.MAXValue 默認值Long.MAXValue ``` 因此，默認情況下，切片大小=blocksize。 ``` maxsize（切片最大值）：參數如果調得比 blocksize 小，則會讓切片變小，而且就等于配置的這個參數的值。 minsize（切片最小值）：參數調的比 blockSize 大，則可以讓切片變得比blocksize 還大。 ``` 3. **獲取切片信息 API，可以使用 MapTask 上下文對象獲取切片信息** ```java // 根據文件類型獲取切片信息 FileSplit inputSplit = (FileSplit) context.getInputSplit(); // 獲取切片的文件名稱 String name = inputSplit.getPath().getName(); ``` 4. **總結** FileInputFormat 默認切片規則（1）簡單地按照文件的內容長度進行切片（2）切片大小，默認等于 block 大小（3）切片時不考慮數據集整體，而是逐個針對每一個文件單獨切片 <br/> # 3. FileInputFormat 實現類 FileInputFormat 其實是一個抽象類，它有很多實現類。默認的是TextInputFormat。 1. **TextInputFormat** TextInputFormat 是默認的 InputFormat。每條記錄是一行輸入。<ins>鍵是LongWritable 類型，存儲該行在整個文件中的字節偏移量。值是這行的內容，不包括任何行終止符（換行符和回車符）</ins>。以下是一個示例，比如，一個分片包含了如下 4 條文本記錄。 ```txt Rich learning form Intelligent learning engine Learning more convenient From the real demand for more close to the enterprise ``` 每條記錄表示為以下鍵/值對。 ```txt (0,Rich learning form) (19,Intelligent learning engine) (47,Learning more convenient) (72,From the real demand for more close to the enterprise) ``` 很明顯，鍵并不是行號。一般情況下，很難取得行號，因為文件按字節而不是按行切分為分片。 2. **KeyValueTextInputFormat**(擴展內容) 每一行均為一條記錄，被分隔符分割為 key，value。可以通過在驅動類中設置 conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, " ");來設定分隔符。默認分隔符是 tab（\t）。<br/> 以下是一個示例，輸入是一個包含 4 條記錄的分片。其中——>表示一個（水平方向的）制表符。 ``` line1 ——>Rich learning form line2 ——>Intelligent learning engine line3 ——>Learning more convenient line4 ——>From the real demand for more close to the enterprise ``` 每條記錄表示為以下鍵/值對。 ``` (line1,Rich learning form) (line2,Intelligent learning engine) (line3,Learning more convenient) (line4,From the real demand for more close to the enterprise) ``` 此時的鍵是每行排在制表符之前的 Text 序列。 3. **NLineInputFormat**（擴展內容）如果使用NlineInputFormat，代表每個map 進程處理的InputSplit不再按block塊去劃分，而是按 NlineInputFormat 指定的行數 N 來劃分。即`輸入文件的總行數/N=切片數`，如果不整除，`切片數=商+1`。以下是一個示例，仍然以上面的 4 行輸入為例。 ``` Rich learning form Intelligent learning engine Learning more convenient From the real demand for more close to the enterprise ``` 例如，如果 N 是 2，則每個輸入分片包含兩行。開啟 2 個 maptask。 ``` (0,Rich learning form) (19,Intelligent learning engine) ``` 另一個 mapper 則收到后兩行： ``` (47,Learning more convenient) (72,From the real demand for more close to the enterprise) ``` 這里的鍵和值與 TextInputFormat 生成的一樣。