How Spark Distributes Partitions To Executors
Solution 1:
When you are reading data from HDFS the number of partitions depends on the number of blocks you are reading. From the images attached it looks like your data is not distributed evenly across the cluster. Try repartitioning your data and increase tweak the number of cores and executors.
If you are repartitioning your data, the hash partitioner is returning a value which is more common than other can lead to data skew.
If this is after performing join, then your data is skewed.
Solution 2:
repartition()
will not give you to evenly distributed the dataset as Spark internally uses HashPartitioner
. To put your data evenly in all partitions then in my point of view Custom Partitioner is the way.
In this case you need to extend the org.apache.spark.Partitioner
class and use your own logic instead of HashPartition
. To achieve this we need to convert the RDD
to PairRDD
.
Found below blog post which will be help you in your case: https://blog.clairvoyantsoft.com/custom-partitioning-spark-datasets-25cbd4e2d818
Thanks
Post a Comment for "How Spark Distributes Partitions To Executors"