Calling Pandas Df.sort_values() Multiple Times On The Same Column Gives Different Results?
Solution 1:
For me working specify mergesort
like only one stable sorting method in DataFrame.sort_values
, because if sorting by only one column is default method kind=quicksort
:
kind{‘quicksort’, ‘mergesort’, ‘heapsort’}, default quicksort
Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort is the only stable algorithm. For DataFrames, this option is only applied when sorting on a single column or label.
If sorting by multiple columns default is mergesort
.
print (df.sort_values('b', kind='mergesort').head())
a b
0 0 1
3 3 1
6 6 1
9 9 1
12 12 1
print (df.sort_values('b', kind='mergesort').sort_values('b', kind='mergesort').head())
a b
0 0 1
3 3 1
6 6 1
9 9 1
12 12 1
Solution 2:
This should be a comment, but it is too long.
According to docs for DataFrame.sort_values
kind: .. mergesort is the only stable algorithm.
You getting different results for column a
because there is no guarantee that the order of equivalent elements in column b
will be retained during sorting. And since the column b
consists of 1
s only, order of the elements are undetermined. You can either use mergesort
as suggested by jezrael, or sort by column b
then by column a
.
Also, please see Quick Sort vs Merge Sort for additional info. The most important point regarding your question is
- Stability : Merge sort is stable as two elements with equal value appear in the same order in sorted output as they were in the input unsorted array. Quick sort is unstable in this scenario.
Post a Comment for "Calling Pandas Df.sort_values() Multiple Times On The Same Column Gives Different Results?"