Spark 優(yōu)化GroupByKey產(chǎn)生RDD[(K, Iterable[V])]

RDD觸發(fā)機(jī)制

在spark中,RDD Action操作,是由SparkContext來觸發(fā)的. 通過scala Iterator來實(shí)現(xiàn).

  /**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

  /**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
  }

  /**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
  def filter(f: T => Boolean): RDD[T] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[T, T](
      this,
      (context, pid, iter) => iter.filter(cleanF),
      preservesPartitioning = true)
  }

GroupByKey分析

GroupByKey是一個(gè)非常耗資源的操作,shuffle之后,每個(gè)key分組之后的數(shù)據(jù),會(huì)緩存在內(nèi)存中,也就是Iterable[V].

 def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }

   def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

如果對(duì)RDD[(K, Iterable[V])].在進(jìn)行flatMap的操作,比如每10條統(tǒng)計(jì)一個(gè)結(jié)果,就會(huì)出現(xiàn)問題.

eg:

val sc = new SparkConf().setAppName("demo").setMaster("local[1]")
    val sparkContext =new SparkContext(sc)
    val rdd =  sparkContext.makeRDD(Seq(
      ("wang",25),("wang",26),("wang",18),("wang",15),("wang",7),("wang",1)
      ))
        .groupByKey().flatMap(kv=>{

      var i =0
      kv._2.map(r=>{

        i=i+1
        println(r)
        r
      })


    })

    sparkContext.runJob(rdd,add _)


    def add(list:Iterator[Int]): Unit ={

      var i=0
      val items = new mutable.MutableList[Int]()
      while(list.hasNext){
        items.+=(list.next())
        if(i>=2){
          println(items.mkString(","))
          items.clear()
          i=0
        }else if(!list.hasNext){
          println(items.mkString(","))
        }
        i=i+1
      }

    }

結(jié)果:

  25
  26
  18
  15
  7
  1
  25,26,18
  15,7
  1
val sc = new SparkConf().setAppName("demo").setMaster("local[1]")
    val sparkContext =new SparkContext(sc)
    val rdd =  sparkContext.makeRDD(Seq(
      ("wang",25),("wang",26),("wang",18),("wang",15),("wang",7),("wang",1)
      ))
        .groupByKey().flatMap(kv=>{

      var i =0
      kv._2.toIterator.map(r=>{

        i=i+1
        println(r)
        r
      })


    })

    sparkContext.runJob(rdd,add _)


    def add(list:Iterator[Int]): Unit ={

      var i=0
      val items = new mutable.MutableList[Int]()
      while(list.hasNext){
        items.+=(list.next())
        if(i>=2){
          println(items.mkString(","))
          items.clear()
          i=0
        }else if(!list.hasNext){
          println(items.mkString(","))
        }
        i=i+1
      }
    }

結(jié)果:

  25
  26
  18
  25,26,18
  15
  7
  15,7
  1
  1

結(jié)論

RDD[(K, Iterable[V])].flatMap直接用Iterable,那么在Action就沒法進(jìn)行控制,只能flatMap里面所有數(shù)據(jù)執(zhí)行完之后,才能執(zhí)行后面操作

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容