How Does The Collectasmap() Function Work For Spark Api
Solution 1:
The semantics of collectAsMap
are identical between the Scala and Python APIs so I'll look at the first WLOG. The documentation for PairRDDFunctions.collectAsMap
explicitly states:
Warning: this doesn't return a multimap (so if you have multiple values to the same key, only one value per key is preserved in the map returned)
In particular, the current implementation inserts the key-value pairs into the resultant map in order and thus only the last two pairs survive in each of your two examples.
If you use collect
instead, it will return Array[(Int,Int)]
without losing any of your pairs.
Solution 2:
collectAsMap
will return the results for paired RDD
as Map collection. And since it is returning Map collection you will only get pairs with unique keys and pairs with duplicate keys will be removed.
Post a Comment for "How Does The Collectasmap() Function Work For Spark Api"