一、什么是spark Sql
Spark SQL是Spark用來處理結構化數據的一個模塊,它提供了兩個編程抽象分別叫做DataFrame和DataSet,它們用于作為分布式SQL查詢引擎,是一種解析傳統SQL到大數據運算模型的引擎。從下圖可以查看RDD、DataFrames與DataSet的關系。
二、hive 與spark sql
Hive,它是將Hive SQL轉換成MapReduce然后提交到集群上執行,大大簡化了編寫MapReduce的程序的復雜性,由于MapReduce這種計算模型執行效率比較慢。所以Spark SQL的應運而生,它是將Spark SQL轉換成RDD,然后提交到集群執行,執行效率非常快!所以我們類比的理解:Hive---SQL-->MapReduce,Spark SQL---SQL-->RDD。
三、測試數據
我們使用2個csv文件作為部分測試數據:
dept.csv信息:
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
emp.csv信息:
7369,SMITH,CLERK,7902,1980/12/17,800,,20
7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30
7566,JONES,MANAGER,7839,1981/4/2,2975,,20
7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30
7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30
7782,CLARK,MANAGER,7839,1981/6/9,2450,,10
7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20
7839,KING,PRESIDENT,,1981/11/17,5000,,10
7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30
7876,ADAMS,CLERK,7788,1987/5/23,1100,,20
7900,JAMES,CLERK,7698,1981/12/3,950,,30
7902,FORD,ANALYST,7566,1981/12/3,3000,,20
7934,MILLER,CLERK,7782,1982/1/23,1300,,10
將這2個csv文件put到HDFS的hdfs://hadoop3:80070/input/目錄以便后面使用
先打開dfs
start-dfs.sh
[root@bigdata111 ~]# hdfs dfs -ls /input
-rw-r--r-- 3 root supergroup 80 2019-03-01 05:05 /input/dept.csv
-rw-r--r-- 3 root supergroup 603 2019-03-01 05:05 /input/emp.csv
四、創建DataFrame
1、啟動spark集群
spark-start-all.sh
2、啟動spark_shell
spark-shell
3、創建方式一:使用case class定義表
scala> val rdd=sc.textFile("hdfs://hadoop3/input/dept.csv")
scala> val rdd2=rdd.filter(_.lenght>0).map(_.split(","))
scala> case class Dept(deptno:Int,dname:String,loc:String)
scala> val deptrdd=rdd2.map(x => Dept(x(0).toInt.x(1),x(2)))
scala> val deptdf=deptrdd.toDF
結果輸出:
對應sql語句:select * from dept
scala> deptdf.show
+------+----------+--------+
|deptno| dname| loc|
+------+----------+--------+
| 10|ACCOUNTING|NEW YORK|
| 20| RESEARCH| DALLAS|
| 30| SALES| CHICAGO|
| 40|OPERATIONS| BOSTON|
+------+----------+--------+
對應sql語句:desc dept
scala> deptrd.printSchema
root
|-- deptno: integer (nullable = false)
|-- dname: string (nullable = true)
|-- loc: string (nullable = true)
、創建方式二:使用SparkSession對象創建DataFrame
scala>val lines = sc.textFile("/root/temp/csv/emp.csv").map(_.split(","))//讀取Linux數據
scala>val lines = sc.textFile("hdfs://10.30.30.146:9000/input/emp.csv").map(_.split(","))//讀取HDFS數據
scala>import org.apache.spark.sql._
scala>import org.apache.spark.sql.types._
scala>val myschema = StructType(List(StructField("empno", DataTypes.IntegerType)
, StructField("ename", DataTypes.StringType)
,StructField("job", DataTypes.StringType)
,StructField("mgr", DataTypes.StringType)
,StructField("hiredate", DataTypes.StringType)
,StructField("sal", DataTypes.IntegerType)
,StructField("comm", DataTypes.StringType)
,StructField("deptno", DataTypes.IntegerType)))//定義schema:StructType
scala>val rowRDD = lines.map(x=>Row(x(0).toInt,x(1),x(2),x(3),x(4),x(5).toInt,x(6),x(7).toInt))// 把讀入的每一行數據映射成一個個Row
scala>val df = spark.createDataFrame(rowRDD,myschema)//使用SparkSession.createDataFrame創建表
結果輸出:
scala> df.show
+-----+------+---------+----+----------+----+----+------+
|empno| ename| job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+----------+----+----+------+
| 7369| SMITH| CLERK|7902|1980/12/17| 800| | 20|
| 7499| ALLEN| SALESMAN|7698| 1981/2/20|1600| 300| 30|
| 7521| WARD| SALESMAN|7698| 1981/2/22|1250| 500| 30|
| 7566| JONES| MANAGER|7839| 1981/4/2|2975| | 20|
| 7654|MARTIN| SALESMAN|7698| 1981/9/28|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839| 1981/5/1|2850| | 30|
| 7782| CLARK| MANAGER|7839| 1981/6/9|2450| | 10|
| 7788| SCOTT| ANALYST|7566| 1987/4/19|3000| | 20|
| 7839| KING|PRESIDENT| |1981/11/17|5000| | 10|
| 7844|TURNER| SALESMAN|7698| 1981/9/8|1500| 0| 30|
| 7876| ADAMS| CLERK|7788| 1987/5/23|1100| | 20|
| 7900| JAMES| CLERK|7698| 1981/12/3| 950| | 30|
| 7902| FORD| ANALYST|7566| 1981/12/3|3000| | 20|
| 7934|MILLER| CLERK|7782| 1982/1/23|1300| | 10|
+-----+------+---------+----+----------+----+----+------+
、創建方式三:直接讀取格式化的文件(json,csv)等-最簡單
準備:
因為/opt/modules/app/spark/examples/src/main/resources目錄下有準備好的樣例,所以直接將某一個json樣例上傳
[root@hadoop3 resources]# pwd
/opt/modules/app/spark/examples/src/main/resources
[root@hadoop3 resources]# hadoop fs -put ./people.json /input/
scala> val peopleDF = spark.read.json("hdfs://hadoop3/input/people.json")
19/03/01 06:16:51 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
peopleDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
結果輸出:
scala> peopleDF.show
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
五、操作DataFrame
1、添加依賴
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.neusoft</groupId>
<artifactId>sparkdemo</artifactId>
<version>1.0-SNAPSHOT</version>
<name>sparkdemo</name>
<!-- FIXME change it to the project's website -->
<url>http://www.example.com</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spark.version>2.3.1</spark.version>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
</dependencies>
<build>
<pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
<plugins>
<!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
<plugin>
<artifactId>maven-clean-plugin</artifactId>
<version>3.1.0</version>
</plugin>
<!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.1</version>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-install-plugin</artifactId>
<version>2.5.2</version>
</plugin>
<plugin>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.8.2</version>
</plugin>
<!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
<plugin>
<artifactId>maven-site-plugin</artifactId>
<version>3.7.1</version>
</plugin>
<plugin>
<artifactId>maven-project-info-reports-plugin</artifactId>
<version>3.0.0</version>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
2、查詢表
select * from 表
package com.neusoft
import org.apache.spark.sql.SparkSession
object Sparksqldemo2 {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.master("local")
.appName("test")
.config("spark.sql.shuffle.partitions", "5")
.getOrCreate()
val df = spark.read.json("hdfs://192.168.159.133:8020/input/people.json")
df.show()
}
}
3、DSL操作DataFrame
1.查看所有的員工信息===selec * from empDF;
df.show
2.查詢所有的員工姓名 ($符號添加不加功能一樣)===select ename,deptno from empDF;
df.select("ename","deptno").show
df.select("ename","deptno").show
3.查詢所有的員工姓名和薪水,并給薪水加100塊錢===select ename,sal,sal+100 from empDF;
empDF.select("ename","sal",$"sal"+100).show
4.查詢工資大于2000的員工===select * from empDF where sal>2000;
df.filter($"sal" > 2000).show
5.分組===select deptno,count(*) from empDF group by deptno;
scala>empDF.groupBy(""deptno").count,showscala>empDF.groupby("deptno").avg().show
scala>empDF.groupBy($"deptno").max().show
4、SQL操作DataFrame
(1)前提條件:需要把DataFrame注冊成是一個Table或者View
df.createOrReplaceTempView("emp")
(2)使用SparkSession執行從查詢
spark.sql("select * from emp").show
spark.sql("select * from emp where deptno=10").show
(3)求每個部門的工資總額
spark.sql("select deptno,sum(sal) from emp group by deptno").show
六 、視圖
在使用SQL操作DataFrame的時候,有一個前提就是必須通過DF創建一個表或者視圖:
df.createOrReplaceTempView("emp")
在SparkSQL中,如果你想擁有一個臨時的view,并想在不同的Session中共享,而且在application的運行周期內可用,那么就需要創建一個全局的臨時view。并記得使用的時候加上global_temp作為前綴來引用它,因為全局的臨時view是綁定到系統保留的數據庫global_temp上。
① 創建一個普通的view和一個全局的view
df.createOrReplaceTempView("emp1")
df.createGlobalTempView("emp2")
② 在當前會話中執行查詢,均可查詢出結果。
scala>spark.sql("select * from emp1").show
scala>spark.sql("select * from global_temp.emp2").show
③ 開啟一個新的會話,執行同樣的查詢
scala>spark.newSession.sql("select * from emp1").show (運行出錯)
scala>spark.newSession.sql("select * from global_temp.emp2").show
七、使用數據源
1、通用的Load/Save函數
(*)什么是parquet文件?
Parquet是列式存儲格式的一種文件類型,列式存儲有以下的核心:
- 可以跳過不符合條件的數據,只讀取需要的數據,降低IO數據量。
- 壓縮編碼可以降低磁盤存儲空間。由于同一列的數據類型是一樣的,可以使用更高效的壓縮編碼(例如Run Length Encoding和Delta Encoding)進一步節約存儲空間。
- 只讀取需要的列,支持向量運算,能夠獲取更好的掃描性能。
Parquet格式是Spark SQL的默認數據源,可通過spark.sql.sources.default配置
(*)通用的Load/Save函數
- load函數讀取Parquet文件:scala>val userDF = spark.read.load("hdfs://bigdata111:9000/input/users.parquet")
對比如下語句:
scala>val peopleDF = spark.read.json("hdfs://bigdata111:9000/input/people.json")
scala>val peopleDF = spark.read.format("json").load("hdfs://bigdata111:9000/input/people.json")
查詢Schema和數據:scala>userDF.show
- save函數保存數據,默認的文件格式:Parquet文件(列式存儲文件)
scala>userDF.select("name","favorite_color").write.save("/root/temp/result1")
scala>userDF.select("name","favorite_color").write.format("csv").save("/root/temp/result2")
scala>userDF.select("name","favorite_color").write.csv("/root/temp/result3")
(*)顯式指定文件格式:加載json格式
直接加載:val usersDF = spark.read.load("/root/resources/people.json")
會出錯
val usersDF = spark.read.format("json").load("/root/resources/people.json")
(*)存儲模式(Save Modes)
可以采用SaveMode執行存儲操作,SaveMode定義了對數據的處理模式。需要注意的是,這些保存模式不使用任何鎖定,不是原子操作。此外,當使用Overwrite方式執行時,在輸出新數據之前原數據就已經被刪除。SaveMode詳細介紹如下:
默認為SaveMode.ErrorIfExists模式,該模式下,如果數據庫中已經存在該表,則會直接報異常,導致數據不能存入數據庫.另外三種模式如下:
SaveMode.Append 如果表已經存在,則追加在該表中;若該表不存在,則會先創建表,再插入數據;
SaveMode.Overwrite 重寫模式,其實質是先將已有的表及其數據全都刪除,再重新創建該表,最后插入新的數據;
SaveMode.Ignore 若表不存在,則創建表,并存入數據;在表存在的情況下,直接跳過數據的存儲,不會報錯。
Demo:
usersDF.select($"name").write.save("/root/result/parquet1")
--> 出錯:因為/root/result/parquet1已經存在
usersDF.select($"name").write.mode("overwrite").save("/root/result/parquet1")
2 讀寫mysql
2.1 JDBC
Spark SQL可以通過JDBC從關系型數據庫中讀取數據的方式創建DataFrame,通過對DataFrame一系列的計算后,還可以將數據再寫回關系型數據庫中。
2.1.1 從Mysql中加載數據庫(Spark Shell 方式)
啟動Spark Shell,必須指定mysql連接驅動jar包
spark-shell --master spark://hadoop1:7077 --jars mysql-connector-java-5.1.35-bin.jar --driver-class-path mysql-connector-java-5.1.35-bin.jar
從mysql中加載數據
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url"->"jdbc:mysql://hadoop1:3306/bigdata",
"driver"->"com.mysql.jdbc.Driver",
"dbtable"->"person", // "dbtable"->"(select * from person where id = 12) as person",
"user"->"root",
"password"->"123456")
).load()
執行查詢
jdbcDF.show()
2.1.2 將數據寫入到MySQL中(打jar包方式)
pom依賴:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.neusoft</groupId>
<artifactId>sparkdemo</artifactId>
<version>1.0-SNAPSHOT</version>
<name>sparkdemo</name>
<!-- FIXME change it to the project's website -->
<url>http://www.example.com</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spark.version>2.3.1</spark.version>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
</dependencies>
<build>
<pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
<plugins>
<!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
<plugin>
<artifactId>maven-clean-plugin</artifactId>
<version>3.1.0</version>
</plugin>
<!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.1</version>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-install-plugin</artifactId>
<version>2.5.2</version>
</plugin>
<plugin>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.8.2</version>
</plugin>
<!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
<plugin>
<artifactId>maven-site-plugin</artifactId>
<version>3.7.1</version>
</plugin>
<plugin>
<artifactId>maven-project-info-reports-plugin</artifactId>
<version>3.0.0</version>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
編寫Spark SQL程序
package com.neusoft
import java.sql
import java.sql.DriverManager
import java.util.Date
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
/**
* Created by Administrator on 2019/3/7.
*/
object SparkDemo {
def main(args: Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME", "root")
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
ssc.checkpoint("hdfs://hadoop3:8020/spark_check_point")
//kafka的topic集合,即可以訂閱多個topic,args傳參的時候用,隔開
val topicsSet = Set("ss_kafka")
//設置kafka參數,定義brokers集合
val kafkaParams = Map[String, String]("metadata.broker.list" -> "192.168.159.133:9092,192.168.159.130:9092,192.168.159.134:9092")
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
print("---------:" +messages)
val lines = messages.map(_._2)
lines.foreachRDD(rdd => {
//內部函數
def func(records: Iterator[String]) {
var conn: sql.Connection = null
var stmt: sql.PreparedStatement = null
try {
val url = "jdbc:mysql://localhost:3306/test"
val user = "root"
val password = "root" //筆者設置的數據庫密碼是hadoop,請改成你自己的mysql數據庫密碼
conn = DriverManager.getConnection(url, user, password)
records.foreach(p => {
val arr = p.split("\\t")
val phoneno = arr(0)
val jingwei = arr(1)
var arrjingwei = jingwei.split(",")
//wei,jing
var sql = "insert into location(time,latitude,longtitude) values (?,?,?)"
stmt = conn.prepareStatement(sql);
stmt.setLong(1, new Date().getTime)
stmt.setDouble(2,java.lang.Double.parseDouble(arrjingwei(0).trim))
stmt.setDouble(3,java.lang.Double.parseDouble(arrjingwei(1).trim))
stmt.executeUpdate()
})
} catch {
case e: Exception => e.printStackTrace()
} finally {
if (stmt != null) {
stmt.close()
}
if (conn != null) {
conn.close()
}
}
}
val repartitionedRDD = rdd.repartition(1)
repartitionedRDD.foreachPartition(func)
})
ssc.start()
ssc.awaitTermination()
}
}
用maven-shade-plugin插件將程序打包
將jar包提交到spark集群
spark-submit
--class cn.itcast.spark.sql.jdbcDF
--master spark://hadoop1:7077
--jars mysql-connector-java-5.1.35-bin.jar
--driver-class-path mysql-connector-java-5.1.35-bin.jar
/root/demo.jar