Avro · Hadoop2.x

[TOC] # 1. Avro特點和存儲格式 Apache Avro 是一個數據序列化系統，出自 Hadoop 之父 Doug Cutting。 Avro File 以 JSON 格式存儲數據定義（Schema），以二進制格式存儲數據。官網地址：http://avro.apache.org/docs/current/ 特點： ? 豐富的數據結構 ? 快速可壓縮的二進制數據格式 ? 容器文件用于持久化數據 ? 自帶遠程過程調用 RPC ? 動態語言可以方便地處理 Avro 數據 :-: ![](https://img.kancloud.cn/c9/70/c97011a2287a821c780862ac9b16b2fe_1075x628.png) Avro存儲格式 基本數據類型: null、 boolean、 int、 long、 float、 double、 bytes、 string 復雜數據類型：record、enum、array、map、union、fixed 可以自己寫代碼實現 avro 格式，也可以使用 avro-tools 應用（一個jar包）來實現 avro 格式。 # 2. 使用avro-tools應用來實現avro格式（1）在user.avsc文件定義User對象的數據存儲格式（Schema） ```json { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": "int"}, {"name": "favorite_color", "type": "string"} ]} ``` （2）在user.json文件存儲數據(data) ```json {"name": "Alyssa", "favorite_number": 256, "favorite_color": "black"} {"name": "Ben", "favorite_number": 7, "favorite_color": "red"} {"name": "Charlie", "favorite_number": 12, "favorite_color": "blue"} ``` （3）運行 avro-tools.jar將Schema+data生成user.avro文件。 avro-tools.jar可以到https://mvnrepository.com/artifact/org.apache.avro/avro-tools下載。 ```sql [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar fromjson --schema-file \ /hdatas/user.avsc /hdatas/user.json > /hdatas/user.avro ``` 或者使用壓縮格式： ```sql [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar fromjson --codec snappy --schema-file \ /hdatas/user.avsc /hdatas/user.json > /hdatas/user.avro ``` （4）我們也可以將user.avro生成回json文件 ```sql -- 查看轉換為json數據的格式 [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar tojson /hdatas/user.avro {"name":"Alyssa","favorite_number":256,"favorite_color":"black"} {"name":"Ben","favorite_number":7,"favorite_color":"red"} {"name":"Charlie","favorite_number":12,"favorite_color":"blue"} -- 將輸出存儲到user_002.json文件 [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar tojson \ /hdatas/user.avro > /hdatas/user_002.json ``` 或者輸出為格式化的json文件： ```sql [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar tojson --pretty /hdatas/user.avro { "name" : "Alyssa", "favorite_number" : 256, "favorite_color" : "black" } { "name" : "Ben", "favorite_number" : 7, "favorite_color" : "red" } { "name" : "Charlie", "favorite_number" : 12, "favorite_color" : "blue" } [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar tojson --pretty \ /hdatas/user.avro > /hdatas/user_002.json ``` （5）我們也可以獲取user.avro的元數據 ```sql -- 查看user.avro的元數據 [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar getmeta /hdatas/user.avro avro.schema {"type":"record","name":"User","namespace":"example.avro", "fields":[{"name":"name","type":"string"}, {"name":"favorite_number","type":"int"}, {"name":"favorite_color","type":"string"}]} avro.codec snappy ``` （6）獲取user.avro的schema ``` -- 查看 [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar getschema /hdatas/user.avro { "type" : "record", "name" : "User", "namespace" : "example.avro", "fields" : [ { "name" : "name", "type" : "string" }, { "name" : "favorite_number", "type" : "int" }, { "name" : "favorite_color", "type" : "string" } ] } -- 將輸出存儲到user_002.avsc文件中 [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar getschema /hdatas/user.avro > /hdatas/user_002.avsc ``` **查看有哪些命令** ```sql [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar Version 1.8.2 of Apache Avro Copyright 2010-2015 The Apache Software Foundation This product includes software developed at The Apache Software Foundation (http://www.apache.org/). ---------------- Available tools: cat extracts samples from files compile Generates Java code for the given schema. concat Concatenates avro files without re-compressing. fragtojson Renders a binary-encoded Avro datum as JSON. fromjson Reads JSON records and writes an Avro data file. fromtext Imports a text file into an avro data file. getmeta Prints out the metadata of an Avro data file. getschema Prints out schema of an Avro data file. idl Generates a JSON schema from an Avro IDL file idl2schemata Extract JSON schemata of the types from an Avro IDL file induce Induce schema/protocol from Java class/interface via reflection. jsontofrag Renders a JSON-encoded Avro datum as binary. random Creates a file with randomly generated instances of a schema. recodec Alters the codec of a data file. repair Recovers data from a corrupt Avro Data file rpcprotocol Output the protocol of a RPC service rpcreceive Opens an RPC Server and listens for one message. rpcsend Sends a single RPC message. tether Run a tethered mapreduce job. tojson Dumps an Avro data file as JSON, record per line or pretty. totext Converts an Avro data file to a text file. totrevni Converts an Avro data file to a Trevni file. trevni_meta Dumps a Trevni file's metadata as JSON. trevni_random Create a Trevni file filled with random instances of a schema. trevni_tojson Dumps a Trevni file as JSON. ``` **查看命令有哪些參數** ```sql [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar fromjson Expected 1 arg: input_file Option Description ------ ----------- --codec Compression codec (default: null) --level <Integer> Compression level (only applies to deflate and xz) (default: -1) --schema Schema --schema-file Schema File ``` # 3. Java 讀寫Avro 在`pom.xml`中添加如下依賴 ```xml <build> <plugins> <plugin> <groupId>org.apache.avro</groupId> <artifactId>avro-maven-plugin</artifactId> <version>1.10.1</version> <executions> <execution> <phase>generate-sources</phase> <goals> <goal>schema</goal> </goals> <configuration> <sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory> <outputDirectory>${project.basedir}/src/main/java/</outputDirectory> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> </plugins> </build> ``` ## 3.1 使用avro-tools應用生成的代碼讀寫Avro （1）在user.avsc中定義數據存儲格式（Schema) ```sql { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": "int"}, {"name": "favorite_color", "type": "string"} ]} ``` （2）Avro可以根據Schema生成對應的java類 ```sql [root@hadoop101 /]# java -jar /opt/software/avro-tools-1.8.2.jar \ compile schema /hdatas/user.avsc /hdatas/User.java ``` 然后會生成/hdatas/User.java/example/avro/User.java文件。（3）然后我們調用生成的User對象來創建Avro ```java package datamodel; import example.avro.User; import org.apache.avro.file.DataFileReader; import org.apache.avro.file.DataFileWriter; import org.apache.avro.io.DatumReader; import org.apache.avro.io.DatumWriter; import org.apache.avro.specific.SpecificDatumReader; import org.apache.avro.specific.SpecificDatumWriter; import org.junit.Test; import java.io.File; import java.io.IOException; public class CreateAvro1 { @Test public void createAvro() throws IOException { // 1. 創建User對象，有下面3中構建方法 User user1 = new User(); user1.setName("Alyssa"); user1.setFavoriteNumber(256); user1.setFavoriteColor("black"); User user2 = new User("Ben", 7, "red"); User user3 = User.newBuilder() .setName("Charlie") .setFavoriteNumber(12) .setFavoriteColor("blue").build(); /* 2. 進行序列化，就是將數據寫入user.avro文件中 DatumWriter接口將Java對象轉換為內存中的序列化格式； SpecificDatumWriter類用來生成類并制定生成類的類型； DataFileWriter用來進行具體的序列化 */ DatumWriter<User> userDatumWriter = new SpecificDatumWriter<>(User.class); DataFileWriter<User> dataFileWriter = new DataFileWriter<>(userDatumWriter); // 生成user.avro文件 dataFileWriter.create(user1.getSchema(), new File("user.avro")); // 往user.avro中追加數據 dataFileWriter.append(user1); dataFileWriter.append(user2); dataFileWriter.append(user3); // 關閉 dataFileWriter.close(); /* 3. 反序列化，就是將user.avro文件的數據讀取出來 */ File file = new File("user.avro"); DatumReader<User> userDatumReader = new SpecificDatumReader<>(User.class); DataFileReader<User> dataFileReader = new DataFileReader<User>(file, userDatumReader); User user = null; while(dataFileReader.hasNext()) { user = dataFileReader.next(user); System.out.println(user); } } } ``` 上面的代碼輸出如下： ```java {"name": "Alyssa", "favorite_number": 256, "favorite_color": "black"} {"name": "Ben", "favorite_number": 7, "favorite_color": "red"} {"name": "Charlie", "favorite_number": 12, "favorite_color": "blue"} ``` ## 3. 2 自定義代碼讀寫Avro格式下面我們不借助avro-tools工具來生成我們的Avro。（1）在user.avsc中定義數據存儲格式（Schema) ```sql { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": "int"}, {"name": "favorite_color", "type": "string"} ]} ``` （2）Java代碼 ```java package datamodel; import org.apache.avro.Schema; import org.apache.avro.file.DataFileReader; import org.apache.avro.file.DataFileWriter; import org.apache.avro.generic.GenericData; import org.apache.avro.generic.GenericRecord; import org.apache.avro.io.DatumReader; import org.apache.avro.io.DatumWriter; import org.apache.avro.specific.SpecificDatumReader; import org.apache.avro.specific.SpecificDatumWriter; import org.junit.Test; import java.io.File; import java.io.IOException; public class CreateAvro2 { @Test public void createAvro() throws IOException { // 1. 獲取user.avsc中Schema信息 Schema schema = new Schema.Parser().parse(new File("user.avsc")); // 2. 創建record GenericRecord user1 = new GenericData.Record(schema); user1.put("name", "Alyssa"); user1.put("favorite_number", 256); user1.put("favorite_color", "black"); GenericRecord user2 = new GenericData.Record(schema); user2.put("name", "Ben"); user2.put("favorite_number", 7); user2.put("favorite_color", "red"); GenericRecord user3 = new GenericData.Record(schema); user3.put("name", "Charlie"); user3.put("favorite_number", 12); user3.put("favorite_color", "blue"); /* 3. 序列化, 就是將數據寫入user.avro文件中 DatumWriter接口將Java對象轉換為內存中的序列化格式； SpecificDatumWriter類用來生成類并制定生成類的類型； DataFileWriter用來進行具體的序列化 */ DatumWriter<GenericRecord> userDatumWriter = new SpecificDatumWriter<>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(userDatumWriter); // 生成user.avro文件 dataFileWriter.create(user1.getSchema(), new File("user.avro")); // 往user.avro中追加數據 dataFileWriter.append(user1); dataFileWriter.append(user2); dataFileWriter.append(user3); // 關閉 dataFileWriter.close(); // 4. 反序列化, 就是將user.avro文件的數據讀取出來 File file = new File("user.avro"); DatumReader<GenericRecord> userDatumReader = new SpecificDatumReader<>(schema); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(file, userDatumReader); GenericRecord user = null; while (dataFileReader.hasNext()) { user = dataFileReader.next(user); System.out.println(user); } } } ``` 上面的代碼輸出如下： ```java {"name": "Alyssa", "favorite_number": 256, "favorite_color": "black"} {"name": "Ben", "favorite_number": 7, "favorite_color": "red"} {"name": "Charlie", "favorite_number": 12, "favorite_color": "blue"} ``` # 4. 在Hive將Avro作為存儲模型 ```sql -- 方式一 create external table user_avro_ext( name string, favorite_number int, favorite_color string ) stored as avro; -- 方式二 create table customers row format serde 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' stored as inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' tblproperties ('avro.schema.literal'='{ "name": "customer", "type": "record", "fields": [ {"name":"firstName", "type":"string"}, {"name":"lastName", "type":"string"}, {"name":"age", "type":"int"}, {"name":"salary", "type":"double"}, {"name":"department", "type":"string"}, {"name":"title", "type":"string"}, {"name": "address", "type": "string"}]}'); ```