zipWithIndex

官方文檔描述：

Zips this RDD with its element indices. The ordering is first based on the partition index 
and then the ordering of items within each partition. So the first item in the first partition 
gets index 0, and the last item in the last partition receives the largest index. 
This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type.
This method needs to trigger a spark job when this RDD contains more than one partitions.

函數原型：

def zipWithIndex(): JavaPairRDD[T, JLong]

該函數將RDD中的元素和這個元素在RDD中的indices組合起來，形成鍵/值對的RDD。

源碼分析：

def zipWithIndex(): RDD[(T, Long)] = withScope {  
    new ZippedWithIndexRDD(this)
}

/** The start index of each partition. */
@transient private val startIndices: Array[Long] = {  
    val n = prev.partitions.length  
    if (n == 0) {    
      Array[Long]()  
    } else if (n == 1) {   
       Array(0L)  
    } else {    
       prev.context.runJob(      
          prev,      
          Utils.getIteratorSize _,      
          0 until n - 1, // do not need to count the last partition      
          allowLocal = false    
      ).scanLeft(0L)(_ + _)  
  }
}

override def compute(splitIn: Partition, context: TaskContext): Iterator[(T, Long)] = {  
    val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition]      
    firstParent[T].iterator(split.prev, context).zipWithIndex.map { x =>    
        (x._1, split.startIndex + x._2)  
  }
}

從源碼中可以看出，該函數返回ZippedWithIndexRDD，在ZippedWithIndexRDD中通過計算startIndices獲得index；然后在compute函數中利用scala的zipWithIndex計算index。

實例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2); 
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3); 
List<Integer> data1 = Arrays.asList(3,2,12,5,6,1,7); 
JavaRDD<Integer> javaRDD1 = javaSparkContext.parallelize(data1);
JavaPairRDD<Integer,Long> zipWithIndexRDD = javaRDD.zipWithIndex(); 
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + zipWithIndexRDD.collect());

zipWithUniqueId

官方文檔描述：

Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, 
n+k,2*n+k, ..., where n is the number of partitions. So there may exist gaps, 
but this method won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].

函數原型：

def zipWithUniqueId(): JavaPairRDD[T, JLong]

該函數將RDD中的元素和一個對應的唯一ID組合成鍵值對，其中ID的生成算法是每個分區的第一元素的ID是該分區索引號，每個分區中的第N個元素的ID是（N * 該RDD總的分區數） + （該分區索引號）。

源碼分析：

def zipWithUniqueId(): RDD[(T, Long)] = withScope {  
    val n = this.partitions.length.toLong    
    this.mapPartitionsWithIndex { case (k, iter) =>    
        iter.zipWithIndex.map { case (item, i) =>      
            (item, i * n + k)    
        }  
  }
}

從源碼中可以看出，zipWithUniqueId()函數是利用mapPartitionsWithIndex()函數獲得每個元素的分區索引號，同時利用（i*n + k）進行相應的計算。

實例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);
List<Integer> data1 = Arrays.asList(3,2,12,5,6,1,7);
JavaRDD<Integer> javaRDD1 = javaSparkContext.parallelize(data1);
JavaPairRDD<Integer,Long> zipWithIndexRDD = javaRDD.zipWithUniqueId();
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + zipWithIndexRDD.collect());

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

【Spark Java API】Transformation(13)—zipWithIndex、zipWithUniqueId

【Spark Java API】Transformation(13)—zipWithIndex、zipWithUniqueId

zipWithIndex

官方文檔描述：

函數原型：

源碼分析：

實例：

zipWithUniqueId

官方文檔描述：

函數原型：

源碼分析：

實例：

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

【Spark Java API】Transformation(13)—zipWithIndex、zipWithUniqueId

zipWithIndex

官方文檔描述：

函數原型：

源碼分析：

實例：

zipWithUniqueId

官方文檔描述：

函數原型：

源碼分析：

實例：

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频