Skip to content Skip to sidebar Skip to footer

How Does The Collectasmap() Function Work For Spark Api

I am trying to understand as to what happens when we run the collectAsMap() function in spark. As per the Pyspark docs,it says, collectAsMap(self) Return the key-value pairs in

Solution 1:

The semantics of collectAsMap are identical between the Scala and Python APIs so I'll look at the first WLOG. The documentation for PairRDDFunctions.collectAsMap explicitly states:

Warning: this doesn't return a multimap (so if you have multiple values to the same key, only one value per key is preserved in the map returned)

In particular, the current implementation inserts the key-value pairs into the resultant map in order and thus only the last two pairs survive in each of your two examples.

If you use collect instead, it will return Array[(Int,Int)] without losing any of your pairs.

Solution 2:

collectAsMap will return the results for paired RDD as Map collection. And since it is returning Map collection you will only get pairs with unique keys and pairs with duplicate keys will be removed.

Post a Comment for "How Does The Collectasmap() Function Work For Spark Api"