spark2.4開始支持image圖片數據源操作
import org.apache.spark.sql.SparkSession
object ImageDataSourceTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.master("local[2]")
.appName("ImageDataSourceTest")
.getOrCreate()
// $example on$
val df = spark.read.format("image")
.option("dropInvalid", value = true) // 從結果中刪除無效圖片
.load("D:\\data\\image")
df.select("image.origin", "image.width", "image.height")
.show(truncate = true)
// $example off$
spark.stop()
}
}
df的schema信息
root
|-- image: struct (nullable = true)
| |-- origin: string (nullable = true)
| |-- height: integer (nullable = true)
| |-- width: integer (nullable = true)
| |-- nChannels: integer (nullable = true)
| |-- mode: integer (nullable = true)
| |-- data: binary (nullable = true)
如果是多層目錄,而且需要獲取目錄名,可以將目錄命為:cls=string,在image的同級目錄中會多出信息:“|-- cls: string (nullable = true)”
- origin: 圖片路徑
- height: 圖片高度
- width: 圖片寬度
- nChannels: 圖片通道數量,對于灰度圖像,典型值為1,對于彩色圖像(例如,RGB),典型值為3,對于具有alpha通道的彩色圖像,典型值為4
- mode: openCV兼容的類型,"CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24,和通道一一對應
- data: BinaryType,以openCV兼容的方式排列,大多數情況下按行排列BGR