1.條件判斷
if(條件判斷,true,false)
case when 條件1 then 條件2 then 值 else 默認(rèn) end 字段名稱
2.parse_url解析url字符串
parse_url(url,url部分,具體字段)
url部分:HOST,QUERY
3.map格式解析,列名[字段]
[uid -> 119024341,currPage -> indexpage,bannerType -> yueke,timestamp -> 1619440226820]這樣格式的數(shù)據(jù), 數(shù)據(jù)格式:map
props['presaleId'],key:value的解析形式
4.空值填充
nvl(a,b),如果a為空的時候,使用b進(jìn)行填充, 適用于兩個字段的判斷和填充
coalesce(a,b,c),分別判斷a,b,c的空值情況,依次進(jìn)行填充
5.get_json_object(context,'$.字段')
context字段類型是字符串
6.按關(guān)鍵字截取字符串
substring_index(str,delim,count)
說明:substring_index(被截取字段,關(guān)鍵字,關(guān)鍵字出現(xiàn)的次數(shù))
例:select substring_index("blog.jlb51.net","l", 2)
結(jié)果:blog.j
(注:如果關(guān)鍵字出現(xiàn)的次數(shù)是負(fù)數(shù) 如-2 則是從后倒數(shù),到字符串結(jié)束)
例:select substring_index("blog.jlb51.net","l", -2)
結(jié)果:http://og.jlb51.net
7.cache table 表名,進(jìn)行緩存
8.同一行,取出多個字段中最大值(greatest), 最小值(least)
sql語句,需要取出多個字段列中的最大值和最小值
9.explode會過濾空值的數(shù)據(jù)
10.udf
Spark官方UDF使用文檔:Spark SQL, Built-in Functions
11.空值
表A需要篩選出a中不等于aaa的數(shù)據(jù)(a字段有空值)
錯誤:select * from A where a != 'aaa'(空值數(shù)據(jù)也被過濾了)
正確:select * from A where (a != 'aaa' or a is null)
12.ARRAY的相關(guān)操作
生成:collect_set(struct(a.lesson_id,b.lesson_title,b.lesson_type_id))
查詢:where array_contains(字段,17(目標(biāo)值))
13.修改表名
ALTER TABLE 原表 RENAME TO 目標(biāo)表
14.first_value(),last_value
15.獲取周幾
date_format(字段(時間戳格式),'u')
16.struct字段類型
17.==
select 1=='1' true
select 1==1 true
select 1=='2' false
select 1=='jiang' 空 (\n)
18.case when a = 'xx' then 1
when a = 'yy' then 2
else 3 then 字段名
19.row_number() over(partition by trade_order_no order by )
20.not in
注意:當(dāng)數(shù)據(jù)是空的時候,使用not in 會將空值排除
21.cache不僅可以提供計算效率,有時不適用還有造成數(shù)據(jù)錯誤
table1:
user_id 課程 時間 order_no
001 數(shù)學(xué) 20210701 20210701002
001 數(shù)學(xué) 20210701 20210701003
select *
,row_number() over(partition by user_id, 課程 order by 時間) px
from table1
as table1_order;
select *
from table1_order
where px = 1
as table1_part1;
select *
from table1 a
left anti join table1_part1 b on a.order_no = b.order_no -- 第一次
as table1_part2;
select * from table1_part1
union
select * from table1_part2
as result;
最后result的值,可能只有一條。
原因:table1_part1不cache住,會被計算兩次,而之前的排序因時間相同,排序具有隨機(jī)性,可能第一次排序20210701002的px為1,table1_part2為 20210701003;第二次計算時20210701003的px為1。 union去重之后,就只留下20210701003一條數(shù)據(jù)。這時候需要在table1_part1計算結(jié)束后,加cache,將結(jié)果鎖住,防止再次計算。
原理參考:spark程序中cache的作用 + 實驗 - 程序員大本營
22.union all結(jié)果順序是隨機(jī)的
a
union all
b
union all
c
結(jié)果可能是bca
23.2-null = null
涉及計算時,要將空值進(jìn)行填充
24.行轉(zhuǎn)列,列轉(zhuǎn)行
行轉(zhuǎn)列
collect_set(去重)/collect_list(不去重)
concat_ws(':',collect_set(字段))
列轉(zhuǎn)行
一個字段的列轉(zhuǎn)行:
SELECT stu_id,
stu_name,
ecourse
from student_score_new
lateral view explode(split(course,',')) cr ecourse -- 字段需要重新命名
多個字段的列轉(zhuǎn)行:
SELECT stu_id,
stu_name,
ecourse,
escore,
from student_score_new
lateral view posexplode(split(course,',')) cr as a,ecourse
lateral view posexplode(split(score,',')) sc as b,ecourse
25.隨機(jī)數(shù)
rand()生成[0,1)的小數(shù)
生成min到max到隨機(jī)數(shù),rand() * (max-min+1) + min
生成1,2,3隨機(jī)數(shù),rand()3 + 1
生成5,6,7,8隨機(jī)數(shù),rand()4+5
order by rand() limit 200 -- 隨機(jī)取200條數(shù)據(jù)
26.cube
cube函數(shù) 多用來實現(xiàn)鉆取查詢
將一個group by中單一維度分組后進(jìn)行聚合,等價于將不同維度的GROUP BY結(jié)果集進(jìn)行UNION ALL
27.grouping_id
標(biāo)記出屬于哪一類維度組合,相同的組合方式grouping_id的結(jié)果一樣
28.rollup
以左側(cè)維度為主聚合維度進(jìn)行層級聚合,所有維度都為NULL時代表全部數(shù)據(jù),rollup是cube的子集;可以快速實現(xiàn)由左及右的下鉆分析。
29.lag
向上取數(shù);lag(col,n,DEFAULT) 用于統(tǒng)計窗口內(nèi)往上第n行值
第一個參數(shù)為列名,第二個參數(shù)為往上第n行(可選,默認(rèn)為1),第三個參數(shù)為默認(rèn)值(當(dāng)往上第n行為NULL時候,取默認(rèn)值,如不指定,則為NULL)
30.lead
向下取數(shù);lead(col,n,DEFAULT) 用于統(tǒng)計窗口內(nèi)往下第n行值
第一個參數(shù)為列名,第二個參數(shù)為往下第n行(可選,默認(rèn)為1),第三個參數(shù)為默認(rèn)值(當(dāng)往下第n行為NULL時候,取默認(rèn)值,如不指定,則為NULL)
31.ntile
對數(shù)據(jù)按照某一維度進(jìn)行等比切片,如果數(shù)據(jù)不均勻,會優(yōu)先補(bǔ)充上面分片的數(shù)據(jù)量
32.ROWS BETWEEN
ROWS BETWEEN是窗口子函數(shù),借助該函數(shù)可限定累計的范圍
select day, sid, pv, sum(pv) over(partition by sid order by day) pv1, sum(pv) over(partition by sid order by day ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) pv2 from VALUES('2020-04-04','a1',11), ('2020-04-03','d1',51), ('2020-04-02','d1',11), ('2020-04-01','d1',21), ('2020-04-04','d1',1) log(day, sid, pv)
33.計算每日零點至當(dāng)前點,間隔15分鐘所有點的SQL
WITH table1 AS (
SELECT '2022-11-28 00:15:00' AS window_start,1 AS test
UNION ALL
SELECT '2022-11-28 00:30:00' AS window_start,1 AS test
UNION ALL
SELECT '2022-11-28 00:45:00' AS window_start,1 AS test
UNION ALL
SELECT '2022-11-28 01:00:00' AS window_start,1 AS test
UNION ALL
SELECT '2022-11-28 01:15:00' AS window_start,1 AS test
UNION ALL
SELECT '2022-11-28 01:30:00' AS window_start,1 AS test
UNION ALL
SELECT '2022-11-28 01:45:00' AS window_start,1 AS test
UNION ALL
SELECT '2022-11-28 02:00:00' AS window_start,1 AS test
),
table2 AS (
SELECT FROM_UNIXTIME((FLOOR(unix_timestamp(appl_finished_dt, "yyyy-MM-dd HH:mm:ss")/900)*900)) as window_start,1 AS test
from fin_dw.xxx
where appl_finished_dt <= '2022-11-28 02:00:00' and appl_finished_dt >= '2022-11-28 00:00:00'
)
SELECT
b.window_start,
SUM(1) AS pv
FROM table2 a
JOIN table1 b
ON a.test = b.test and a.window_start <= b.window_start
GROUP BY b.window_start;
34.計算表存儲文件數(shù)
查看表的存儲空間:desc extended fin_dw.test
轉(zhuǎn)換為gb單位:https://tool.browser.qq.com/byte_cal.html
估算表大小,128M一個文件,計算number
distribute by cast(rand()*number AS INT);
35.兩個表數(shù)據(jù)集對比
--spark safelycc
select '張三' AS name,41 AS age
EXCEPT
select '張三' AS name,40 AS age