InputFormat接口實現類 · JAVA

[TOC] # 簡介 MapReduce任務的輸入文件一般是存儲在HDFS里面.輸入的文件格式包括:基于行的日志文件,二進制格式文件.這些文件一般很大,達到數十G,甚至更大,那么MapReduce是如何讀取這些數據的? InputFormat常見的接口實現類包括: TextInputFormat, KeyValueTextInputFormat, NLineInputFormat, CombineTextInputFormat和自定義InputFormat等 # TextInputFormat TextInputFormat是默認的InputFormat.每條記錄是一行輸入. 鍵是LongWritable類型,存儲該行在整個文件中的字節偏移量,值是這行的內容,不包括任何行終止符(換行符和回車符) 一些是一個示例,比如一個分片包含了如下4條文本記錄 ~~~ Rich learning form Intelligent learning engine Learning more convenient From the real demand for more close to the enterprise ~~~ 每條記錄表示為以下鍵值對 ~~~ (0, Rich learning form) (19, Intelligent learning engine) (47, Learning more convenient) (72, From the real demand for more close to the enterprise) ~~~ 很明顯,鍵并不是行號.一般情況下,很難取得行號,因為文件按字節而不是按行切分為分片 # KeyValueTextInputFormat **每一行均為一條記錄**,被分割符分割成key, value. 可以通過在驅動類中設置 ~~~ conf.set(keyValueLineRecordReader.KEY_VALUE_SEPERATOR, " "); ~~~ 來設定分割符,默認分割符是tab(\t) 以下是一個示例,輸入是一個包含4條記錄的分片.其中--->表示一個(水平方向的)制表符 ~~~ line1 ---> Rich learning form line2 ---> Intelligent learning engine line3 ---> Learning more convenient line4 ---> From the real demand for more close to the enterprise ~~~ 每條記錄表示為以下鍵/值對 ~~~ (line1, Rich learning form) (line2, Intelligent learning engine) (line3, Learning more convenient) (line4, From the real demand for more close to the enterprise) ~~~ 此時的鍵是每行排在制表符之前的Text序列 ## 代碼 **map** ~~~ import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class KVTextMapper extends Mapper<Text, Text, Text, LongWritable> { Text k = new Text(); LongWritable v = new LongWritable(); @Override protected void map(Text key, Text value, Context context) throws IOException, InterruptedException { //設置key和value k.set(key); //設置key的個數 v.set(1); //寫出 context.write(k, v); } } ~~~ **reducer** ~~~ import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class KVTextReducer extends Reducer<Text, LongWritable, Text, LongWritable> { LongWritable v = new LongWritable(); @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long count = 0l; //匯總統計 for (LongWritable value : values) { count += value.get(); } v.set(count); //輸出 context.write(key, v); } } ~~~ **驅動** ~~~ import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.KeyValueLineRecordReader; import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class MyDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); //設置切割符 conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, " "); //獲取job對象 Job job = Job.getInstance(conf); //設置jar包關系,關聯mapper和reducer job.setJarByClass(MyDriver.class); //告訴框架,我們程序所用的mapper類和reduce類是什么 job.setMapperClass(KVTextMapper.class); job.setReducerClass(KVTextReducer.class); //告訴框架我們程序輸出的類型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); //告訴框架,我們程序使用的數據讀取組件,結果輸出所用的組件是什么 //TextInputFormat是mapreduce程序中內置的一種讀取數據組件,準備的叫做讀取文本的輸入組件 job.setInputFormatClass(KeyValueTextInputFormat.class); //job.setOutputFormatClass(TextOutputFormat.class); //告訴框架,我們要處理的數據文件在那個路徑下 FileInputFormat.setInputPaths(job,new Path("/Users/jdxia/Desktop/website/data/input/")); //告訴框架我們的處理結果要輸出到什么地方 FileOutputFormat.setOutputPath(job,new Path("/Users/jdxia/Desktop/website/data/output/")); //提交后,然后等待服務器端返回值,看是不是true boolean res = job.waitForCompletion(true); //設置成功就退出碼為0 System.exit(res?0:1); } } ~~~ # NLineInputFormat 如果使用NLineInputFormat,代表每個map進程處理的InputSplit不再按block塊去劃分,而是按NLineInputFormat指定的行數N來劃分.即輸入文件的總行數/N=切片數,如果不整除,切片數=商+1 以下是一個示例,仍然以上面的4行輸入為例子 ~~~ Rich learning form Intelligent learning engine Learning more convenient From the real demand for more close to the enterprise ~~~ 例如,如果N是2,則每個輸入分片包含兩行,開啟2個maptask ~~~ (0, Rich learning form) (19, Intelligent learning engine) ~~~ 另一個mapper則收到后兩行 ~~~ (47, Learning more convenient) (72, From the real demand for more close to the enterprise) ~~~ 這里的鍵和值與TextInputFormat生成的一樣 ## 代碼 **map** ~~~ import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class NLineMapper extends Mapper<LongWritable, Text, Text, LongWritable> { private Text k = new Text(); private LongWritable v = new LongWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //獲取一行 String line = value.toString(); //切割 String[] splited = line.split(" "); //循環寫出 for (int i = 0; i < splited.length; i++) { k.set(splited[i]); context.write(k, v); } } } ~~~ **reducer** ~~~ import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class NLineReducer extends Reducer<Text, LongWritable, Text, LongWritable> { LongWritable v = new LongWritable(); @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long count= 0l; //匯總 for (LongWritable value : values) { count += value.get(); } v.set(count); //輸出 context.write(key, v); } } ~~~ **驅動** ~~~ import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class NLineDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { //獲取job對象 Configuration conf = new Configuration(); Job job = Job.getInstance(conf); //設置每個切片InputSplit中劃分三條記錄 NLineInputFormat.setNumLinesPerSplit(job, 3); //使用NLineInputFormat處理記錄 job.setInputFormatClass(NLineInputFormat.class); //設置jar包位置,關聯mapper和reducer job.setJarByClass(NLineDriver.class); job.setMapperClass(NLineMapper.class); job.setReducerClass(NLineReducer.class); //設置map輸出kv類型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); //設置最終輸出kv類型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); //告訴框架,我們要處理的數據文件在那個路徑下 FileInputFormat.setInputPaths(job,new Path("/Users/jdxia/Desktop/website/data/input/")); //告訴框架我們的處理結果要輸出到什么地方 FileOutputFormat.setOutputPath(job,new Path("/Users/jdxia/Desktop/website/data/output/")); //這邊不用submit,因為一提交就和我這個沒關系了,我這就斷開了就看不見了 // job.submit(); //提交后,然后等待服務器端返回值,看是不是true boolean res = job.waitForCompletion(true); //設置成功就退出碼為0 System.exit(res?0:1); } } ~~~ 輸出結果的切片數 ~~~ number of splits ~~~