Calling Pandas Df.sort_values() Multiple Times On The Same Column Gives Different Results?

December 27, 2023 Post a Comment

Example below... why does this happen and how can I prevent it? >>> df = pd.DataFrame({'a': list(range(150)), 'b': [1, 2, 3] * 50}) >>> df.sort_values('b').equals

Solution 1:

For me working specify mergesort like only one stable sorting method in DataFrame.sort_values, because if sorting by only one column is default method kind=quicksort:

kind{‘quicksort’, ‘mergesort’, ‘heapsort’}, default quicksort
Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort is the only stable algorithm. For DataFrames, this option is only applied when sorting on a single column or label.

If sorting by multiple columns default is mergesort.

print (df.sort_values('b', kind='mergesort').head())
     a  b
0    0  1
3    3  1
6    6  1
9    9  1
12  12  1

print (df.sort_values('b', kind='mergesort').sort_values('b', kind='mergesort').head())
     a  b
0    0  1
3    3  1
6    6  1
9    9  1
12  12  1

Solution 2:

This should be a comment, but it is too long.

According to docs for DataFrame.sort_values

kind: .. mergesort is the only stable algorithm.

You getting different results for column a because there is no guarantee that the order of equivalent elements in column b will be retained during sorting. And since the column b consists of 1s only, order of the elements are undetermined. You can either use mergesort as suggested by jezrael, or sort by column b then by column a.

Also, please see Quick Sort vs Merge Sort for additional info. The most important point regarding your question is

Stability : Merge sort is stable as two elements with equal value appear in the same order in sorted output as they were in the input unsorted array. Quick sort is unstable in this scenario.

Python Library

Calling Pandas Df.sort_values() Multiple Times On The Same Column Gives Different Results?

Solution 1:

Solution 2:

Post a Comment for "Calling Pandas Df.sort_values() Multiple Times On The Same Column Gives Different Results?"