久久不见久久见免费影院,高H喷水荡肉爽文公交车,成人婷婷

Spark機器學習庫中包含了兩種實現方式，一種是spark.mllib，這種是基礎的API，基于RDDs之上構建，另一種是spark.ml，這種是higher-level API，基于DataFrames之上構建，spark.ml使用起來比較方便和靈活。

Spark機器學習中關于特征處理的API主要包含三個方面：特征提取、特征轉換與特征選擇。本文通過例子介紹和學習Spark.ml中提供的關于特征處理API中的特征選擇（Feature Selectors）部分。

特征選擇（Feature Selectors）

1.? VectorSlicer

VectorSlicer用于從原來的特征向量中切割一部分，形成新的特征向量，比如，原來的特征向量長度為10，我們希望切割其中的5~10作為新的特征向量，使用VectorSlicer可以快速實現。

大數據/機器學習交流群：724693112 歡迎大家一起交流學習~

package com.lxw1234.spark.features.selectors

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}

import org.apache.spark.ml.feature.VectorSlicer

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.sql.Row

import org.apache.spark.sql.types.StructType

/**

* By? http://lxw1234.com

object TestVectorSlicer extends App {

? ? val conf = new SparkConf().setMaster("local").setAppName("lxw1234.com")

? ? val sc = new SparkContext(conf)

? ? val sqlContext = new org.apache.spark.sql.SQLContext(sc)

? ? import sqlContext.implicits._

? ? //構造特征數組

? ? val data = Array(Row(Vectors.dense(-2.0, 2.3, 0.0)))

? ? //為特征數組設置屬性名（字段名），分別為f1 f2 f3

? ? val defaultAttr = NumericAttribute.defaultAttr

? ? val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)

? ? val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])

? ? //構造DataFrame

? ? val dataRDD = sc.parallelize(data)

? ? val dataset = sqlContext.createDataFrame(dataRDD, StructType(Array(attrGroup.toStructField())))

? ? print("原始特征：")

? ? dataset.take(1).foreach(println)

? ? //構造切割器

? ? var slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")

? ? //根據索引號，截取原始特征向量的第1列和第3列

? ? slicer.setIndices(Array(0,2))

? ? print("output1: ")

? ? slicer.transform(dataset).select("userFeatures", "features").first()

? ? //根據字段名，截取原始特征向量的f2和f3

? ? slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")

? ? slicer.setNames(Array("f2","f3"))

? ? print("output2: ")

? ? slicer.transform(dataset).select("userFeatures", "features").first()

? ? //索引號和字段名也可以組合使用，截取原始特征向量的第1列和f2

? ? slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")

? ? slicer.setIndices(Array(0)).setNames(Array("f2"))

? ? print("output3: ")

? ? slicer.transform(dataset).select("userFeatures", "features").first()

}

程序運行輸出為：

原始特征：

[[-2.0,2.3,0.0]]

output1:

org.apache.spark.sql.Row = [[-2.0,2.3,0.0],[-2.0,0.0]]

output2:

org.apache.spark.sql.Row = [[-2.0,2.3,0.0],[2.3,0.0]]

output3:

org.apache.spark.sql.Row = [[-2.0,2.3,0.0],[-2.0,2.3]]

2.? RFormula

RFormula用于將數據中的字段通過R語言的Model Formulae轉換成特征值，輸出結果為一個特征向量和Double類型的label。關于R語言Model Formulae的介紹可參考：https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html

package com.lxw1234.spark.features.selectors

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.ml.feature.RFormula

/**

* By? http://lxw1234.com

object TestRFormula extends App {

? ? val conf = new SparkConf().setMaster("local").setAppName("lxw1234.com")

? ? val sc = new SparkContext(conf)

? ? val sqlContext = new org.apache.spark.sql.SQLContext(sc)

? ? import sqlContext.implicits._

? ? //構造數據集

? ? val dataset = sqlContext.createDataFrame(Seq(

? ? ? (7, "US", 18, 1.0),

? ? ? (8, "CA", 12, 0.0),

? ? ? (9, "NZ", 15, 0.0)

? ? )).toDF("id", "country", "hour", "clicked")

? ? dataset.select("id", "country", "hour", "clicked").show()

? ? //當需要通過country和hour來預測clicked時候，

? ? //構造RFormula，指定Formula表達式為clicked ~ country + hour

? ? val formula = new RFormula().setFormula("clicked ~ country + hour").setFeaturesCol("features").setLabelCol("label")

? ? //生成特征向量及label

? ? val output = formula.fit(dataset).transform(dataset)

? ? output.select("id", "country", "hour", "clicked", "features", "label").show()

}

程序輸出：

3.? ChiSqSelector

ChiSqSelector用于使用卡方檢驗來選擇特征（降維）。

package com.lxw1234.spark.features.selectors

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.ml.feature.ChiSqSelector

import org.apache.spark.mllib.linalg.Vectors

/**

* By? http://lxw1234.com

object TestChiSqSelector extends App {

? ? val conf = new SparkConf().setMaster("local").setAppName("lxw1234.com")

? ? val sc = new SparkContext(conf)

? ? val sqlContext = new org.apache.spark.sql.SQLContext(sc)

? ? import sqlContext.implicits._

? ? //構造數據集

? ? val data = Seq(

? ? ? (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),

? ? ? (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),

? ? ? (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)

? ? )

? ? val df = sc.parallelize(data).toDF("id", "features", "clicked")

? ? df.select("id", "features","clicked").show()

? ? //使用卡方檢驗，將原始特征向量（特征數為4）降維（特征數為3）

? ? val selector = new ChiSqSelector().setNumTopFeatures(3).setFeaturesCol("features").setLabelCol("clicked").setOutputCol("selectedFeatures")

? ? val result = selector.fit(df).transform(df)

? ? result.show()

}

程序輸出為：

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Spark機器學習API之特征處理（二）

Spark機器學習API之特征處理（二）

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Spark機器學習API之特征處理（二）

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频