Spark concepts
Job : An Action
triggers a spark job in Driver Program
Executor : contains slots
for running tasks
Stage : A collections of tasks executing the same code on a different subset of data. A sequence of transformations which can be done without shuffling the full data.
Narrow / Wide transformation :
Transformations that may trigger a stage boundary typically accept a numPartitions argument that determines how many partitions to split the data into in the child stage
SchemaRDD, which will open up Spark’s Catalyst optimizer to programmers using Spark’s core APIs
all shuffle data must be written to disk and then transferred over the network
Common
-
Use
reduceByKey
instead ofgroupByKey
-
don’t create new object for each record.
-
Avoid the flatMap-join-groupBy pattern : use cogroup
scala> val rdd1 = sc.parallelize( ('a',1) :: ('b',2) :: ('c',3) :: ('c',4) :: Nil )
// BAD !!!! Shuffle all data !!!
scala> rdd1.groupByKey().mapValues(_.sum).foreach(println)
(c,7)
(a,1)
(b,2)
// Good.
scala> rdd1.reduceByKey(_ + _).foreach(println)
(a,1)
(c,7)
(b,2)
// BAD !!!! Create new Set() many times.
scala> import collection.mutable
scala> val rdd1 = sc.parallelize( 1->"apple" :: 2->"pie" :: 1->"banana" :: 1->"apple" :: 2->"pie" :: Nil )
scala> val rdd2 = rdd1.map(kv => (kv._1, mutable.Set[String]() + kv._2))
scala> rdd2.reduceByKey(_ ++ _).foreach(println)
(2,Set(pie))
(1,Set(banana, apple))
// Good.
scala> val zero = mutable.Set[String]()
scala> rdd1.aggregateByKey(zero)( (set, v) => set += v, (set1, set2) => set1 ++= set2 ).foreach(println)
(1,Set(banana, apple))
(2,Set(pie))
cogroup