【Spark Java API】Transformation(13)—zipWithIndex、zipWithUniqueId

zipWithIndex


官方文檔描述:

Zips this RDD with its element indices. The ordering is first based on the partition index 
and then the ordering of items within each partition. So the first item in the first partition 
gets index 0, and the last item in the last partition receives the largest index. 
This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type.
This method needs to trigger a spark job when this RDD contains more than one partitions.

函數原型:

def zipWithIndex(): JavaPairRDD[T, JLong]

該函數將RDD中的元素和這個元素在RDD中的indices組合起來,形成鍵/值對的RDD。

源碼分析:

def zipWithIndex(): RDD[(T, Long)] = withScope {  
    new ZippedWithIndexRDD(this)
}

/** The start index of each partition. */
@transient private val startIndices: Array[Long] = {  
    val n = prev.partitions.length  
    if (n == 0) {    
      Array[Long]()  
    } else if (n == 1) {   
       Array(0L)  
    } else {    
       prev.context.runJob(      
          prev,      
          Utils.getIteratorSize _,      
          0 until n - 1, // do not need to count the last partition      
          allowLocal = false    
      ).scanLeft(0L)(_ + _)  
  }
}

override def compute(splitIn: Partition, context: TaskContext): Iterator[(T, Long)] = {  
    val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition]      
    firstParent[T].iterator(split.prev, context).zipWithIndex.map { x =>    
        (x._1, split.startIndex + x._2)  
  }
}

從源碼中可以看出,該函數返回ZippedWithIndexRDD,在ZippedWithIndexRDD中通過計算startIndices獲得index;然后在compute函數中利用scala的zipWithIndex計算index。

實例:

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2); 
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3); 
List<Integer> data1 = Arrays.asList(3,2,12,5,6,1,7); 
JavaRDD<Integer> javaRDD1 = javaSparkContext.parallelize(data1);
JavaPairRDD<Integer,Long> zipWithIndexRDD = javaRDD.zipWithIndex(); 
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + zipWithIndexRDD.collect());

zipWithUniqueId


官方文檔描述:

Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, 
n+k,2*n+k, ..., where n is the number of partitions. So there may exist gaps, 
but this method won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].

函數原型:

def zipWithUniqueId(): JavaPairRDD[T, JLong]

該函數將RDD中的元素和一個對應的唯一ID組合成鍵值對,其中ID的生成算法是每個分區的第一元素的ID是該分區索引號,每個分區中的第N個元素的ID是(N * 該RDD總的分區數) + (該分區索引號)。

源碼分析:

def zipWithUniqueId(): RDD[(T, Long)] = withScope {  
    val n = this.partitions.length.toLong    
    this.mapPartitionsWithIndex { case (k, iter) =>    
        iter.zipWithIndex.map { case (item, i) =>      
            (item, i * n + k)    
        }  
  }
}

從源碼中可以看出,zipWithUniqueId()函數是利用mapPartitionsWithIndex()函數獲得每個元素的分區索引號,同時利用(i*n + k)進行相應的計算。

實例:

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);
List<Integer> data1 = Arrays.asList(3,2,12,5,6,1,7);
JavaRDD<Integer> javaRDD1 = javaSparkContext.parallelize(data1);
JavaPairRDD<Integer,Long> zipWithIndexRDD = javaRDD.zipWithUniqueId();
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + zipWithIndexRDD.collect());
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容