Skip to content Skip to sidebar Skip to footer

Calling Pandas Df.sort_values() Multiple Times On The Same Column Gives Different Results?

Example below... why does this happen and how can I prevent it? >>> df = pd.DataFrame({'a': list(range(150)), 'b': [1, 2, 3] * 50}) >>> df.sort_values('b').equals

Solution 1:

For me working specify mergesort like only one stable sorting method in DataFrame.sort_values, because if sorting by only one column is default method kind=quicksort:

kind{‘quicksort’, ‘mergesort’, ‘heapsort’}, default quicksort

Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort is the only stable algorithm. For DataFrames, this option is only applied when sorting on a single column or label.

If sorting by multiple columns default is mergesort.

print (df.sort_values('b', kind='mergesort').head())
     a  b
0    0  1
3    3  1
6    6  1
9    9  1
12  12  1

print (df.sort_values('b', kind='mergesort').sort_values('b', kind='mergesort').head())
     a  b
0    0  1
3    3  1
6    6  1
9    9  1
12  12  1

Solution 2:

This should be a comment, but it is too long.

According to docs for DataFrame.sort_values

kind: .. mergesort is the only stable algorithm.

You getting different results for column a because there is no guarantee that the order of equivalent elements in column b will be retained during sorting. And since the column b consists of 1s only, order of the elements are undetermined. You can either use mergesort as suggested by jezrael, or sort by column b then by column a.

Also, please see Quick Sort vs Merge Sort for additional info. The most important point regarding your question is

  1. Stability : Merge sort is stable as two elements with equal value appear in the same order in sorted output as they were in the input unsorted array. Quick sort is unstable in this scenario.

Post a Comment for "Calling Pandas Df.sort_values() Multiple Times On The Same Column Gives Different Results?"