Skip to content Skip to sidebar Skip to footer

How To Delete A Column In Pandas Dataframe Based On A Condition?

I have a pandas DataFrame, with many NAN values in it. How can I drop columns such that number_of_na_values > 2000? I tried to do it like that: toRemove = set() naNumbersPerCol

Solution 1:

Here's another alternative to keep the columns that have less than or equal to the specified number of nans in each column:

max_number_of_nas = 3000df = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nas)]

In my tests this seems to be slightly faster than the drop columns method suggested by Jianxun Li in the cases I tested (as shown below). However, I should note that the performance becomes more similar if you simply don't use the apply method (e.g. df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)). Just a reminder that when it comes to performance in pandas vectorization almost always wins out over apply.

np.random.seed(0)
df = pd.DataFrame(np.random.randn(10000,5), columns=list('ABCDE'))
df[df <0] = np.nan
max_number_of_nans =5010%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>>1.1 ms ± 4.08 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)

%timeit c = df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)
>>1.3 ms ± 11.8 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)

%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>>2.11 ms ± 29.4 µs per loop (mean ± std. dev. of7 runs, 100 loops each)

Performance often varies with data size so don't forget to check whatever case is closest to your data.

np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
df[df <0] = np.nan
max_number_of_nans =5%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>>755 µs ± 4.84 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)

%timeit c = df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)
>>777 µs ± 12 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)

%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>>1.71 ms ± 17.3 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)

Solution 2:

Same logic, but just put all things in one line.

import pandas as pd
import numpy as np

# artificial data# ====================================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10,5), columns=list('ABCDE'))
df[df < 0] = np.nan

        A       B       C       D       E
01.76410.40020.97872.24091.86761     NaN  0.9501     NaN     NaN  0.410620.14401.45430.76100.12170.443930.33371.4941     NaN  0.3131     NaN
4     NaN  0.65360.8644     NaN  2.26985     NaN  0.0458     NaN  1.53281.469460.15490.3782     NaN     NaN     NaN
70.15631.23031.2024     NaN     NaN
8     NaN     NaN     NaN  1.9508     NaN
9     NaN     NaN  0.7775     NaN     NaN

# processing: drop columns with no. of NaN > 3# ====================================
df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > 3)], axis=1)


Out[183]:
        B
00.400210.950121.454331.494140.653650.045860.378271.23038     NaN
9     NaN

Post a Comment for "How To Delete A Column In Pandas Dataframe Based On A Condition?"