How Spark Distributes Partitions To Executors

February 02, 2024 Post a Comment

I have a performance issue and after analyzing Spark web UI i found what it seems to be data skewness: Initially i thought partitions were not evenly distributed, so i performed a

Solution 1:

When you are reading data from HDFS the number of partitions depends on the number of blocks you are reading. From the images attached it looks like your data is not distributed evenly across the cluster. Try repartitioning your data and increase tweak the number of cores and executors.

If you are repartitioning your data, the hash partitioner is returning a value which is more common than other can lead to data skew.

If this is after performing join, then your data is skewed.

Solution 2:

repartition() will not give you to evenly distributed the dataset as Spark internally uses HashPartitioner. To put your data evenly in all partitions then in my point of view Custom Partitioner is the way.

In this case you need to extend the org.apache.spark.Partitioner class and use your own logic instead of HashPartition. To achieve this we need to convert the RDD to PairRDD.

Found below blog post which will be help you in your case: https://blog.clairvoyantsoft.com/custom-partitioning-spark-datasets-25cbd4e2d818

Thanks

Python Library

How Spark Distributes Partitions To Executors

Solution 1:

Solution 2:

Post a Comment for "How Spark Distributes Partitions To Executors"