select · 大數據

[TOC] ## select * 基本的Select操作 * 語法結構 ~~~ SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list [HAVING condition]] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list] ] [LIMIT number] ~~~ **注：** 1. **order by 會對輸入做全局排序，因此只有一個reducer，會導致當輸入規模較大時，需要較長的計算時間** 2. **sort by不是全局排序，其在數據進入reducer前完成排序。因此，如果用sort by進行排序，并且設置`mapred.reduce.tasks>1`，則sort by只保證每個reducer的輸出有序，不保證全局有序** 3. distribute by(字段)(分發)根據指定的字段將數據分到不同的reducer，且分發算法是hash散列。 4. Cluster by(字段)(桶) 除了具有Distribute by的功能外，還會對該字段進行排序。因此，如果分桶和sort字段是同一個時，此時，`cluster by = distribute by + sort by` 分桶表的作用：最大的作用是用來提高join操作的效率；（思考這個問題： `select a.id,a.name,b.addr from a join b on a.id = b.id;` 如果a表和b表已經是分桶表，而且分桶的字段是id字段做這個join操作時，還需要全表做笛卡爾積嗎？） **注意：在hive中提供了一種“嚴格模式”的設置來阻止用戶執行可能會帶來未知不好影響的查詢** 設置屬性hive.mapred.mode 為strict能夠阻止以下三種類型的查詢： 1. 除非在where語段中包含了分區過濾，否則不能查詢分區了的表。這是因為分區表通常保存的數據量都比較大，沒有限定分區查詢會掃描所有分區，耗費很多資源。不允許：`select *from logs;` 允許：`select * from logs where day=20151212;` 2. ? 包含order by，但沒有limit子句的查詢。因為order by 會將所有的結果發送給單個reducer來執行排序，這樣的排序很耗時 3. ? 笛卡爾乘積；簡單理解就是JOIN沒帶ON，而是帶where的 **案例** ~~~ create external table student_ext(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',' location '/stu'; ~~~ ~~~ //where查詢 select * from student_ext where sno=95020; //分組 select sex,count(*) from student_ext group by sex; ~~~ ~~~ //分區,排序,但是這個只有1個reduce,沒意義 select * from student_ext cluster by sex; ~~~ ~~~ //設置4個reduce //這樣每個reduce自己內部會排序 hive> set mapred.reduce.task=4; hive> create table tt_1 as select * from student_ext cluster by sno; //查看結果,這個tt_1文件夾下面有4個文件 dfs -cat /user/hive/warehouse/db1.db/tt_1/000000_0; //這個結果和上面一樣,分成4個reduce create table tt_2 as select * from student_ext distribute by sno sort by sno; //排序可以按照其他方式排序 create table tt_3 as select * from student_ext distribute by sno sort by sage; ~~~