The spark sorting helpers is a library of convenience functions for leveraging the secondary sort functionality of Spark partitioning. Secondary sorting allows an RDD or Dataset to be partitioned by a key while sorting the values, pushing that sort into the underlying shuffle machinery. This provides an efficient way to sort values within a partition if one is already conducting a shuffle operation anyway (e.g. a join or groupBy).
This library uses the extension methods pattern to add methods to RDDs or Datasets of pairs. You can import the implicits with:
import net.gonzberg.spark.sorting.implicits._
You can then call additional functions on certain RDDs or Datasets, e.g.
val rdd: RDD[(String, Int)] = ???
val groupedRDD: RDD[(String, Iterable[Int])] = rdd.sortedGroupByKey
groupedRDD.foreach((k, group) => assert group == group.sorted)
This library attempts to support Scala 2.11
, 2.12
, and 2.13
. Since there is not a single version of Spark which supports all three of those Scala versions, this library is built against different versions of Spark depending on the Scala version.
Scala | Spark |
---|---|
2.11 | 2.4.8 |
2.12 | 3.3.2 |
2.13 | 3.3.2 |
Other combinations of versions may also work, but these are the ones for which the tests run automatically. We will likely drop 2.11
support in a later release, depending on when it becomes too difficult to support.
Scaladocs are avaiable here.
This package is built using sbt
. You can run the tests with sbt test
. You can lint with sbt scalafmt
. You can use +
in front of a directive to cross-build, though you'll need Java 8 (as opposed to Java 11) to cross-build to Scala 2.11.