Skip to content Skip to sidebar Skip to footer

Pandas Data Frame Behavior

This code works--it sets each column to its mean: def setSerNanToMean(serAll): return serAll.replace(np.NaN, serAll.mean()) def setPdfNanToMean(pdfAll, listCols): pdfAll.ix

Solution 1:

No, apply does not work inplace*.

Here's another for you: the inplace flag doesn't actually mean whatever function is actually happening inplace (!). To give an example:

In [11]: s = pd.Series([1, 2, np.nan, 4])

In [12]: s._data._values
Out[12]: array([  1.,   2.,  nan,   4.])

In [13]: vals = s._data._values

In [14]: s.fillna(s.mean(), inplace=True)

In [15]: vals is s._data._values  # valuesare the same
Out[15]: TrueIn [16]: vals
Out[16]: array([ 1.        ,  2.        ,  2.33333333,  4.        ])

In [21]: s = pd.Series([1, 2, np.nan, 4])  # start again

In [22]: vals = s._data._values

In [23]: s.fillna('mean', inplace=True)

In [24]: vals is s._data._values  # valuesare*not* the same
Out[24]: FalseIn [25]: s._data._values
Out[25]: array([1.0, 2.0, 'mean', 4.0], dtype=object)

Note: often if the type is the same then so is the values array but pandas does not guarantee this.

In general apply is slow (since you are basically iterating through each row in python), and the "game" is to rewrite that function in terms of pandas/numpy native functions and indexing. If you want to delve into more details about the internals, check out the BlockManager in core/internals.py, this is the object which holds the underlying numpy arrays. But to be honest I think your most useful tool is %timeit and looking at the source code for specific functions (?? in ipython).

In this specific example I would consider using fillna in an explicit for loop of the columns you want:

In [31]: df = pd.DataFrame([[1, 2, np.nan], [4, np.nan, 6]], columns=['A', 'B', 'C'])

In [32]: for col in ["A", "B"]:
   ....:     df[col].fillna(df[col].mean(), inplace=True)
   ....:

In [33]: df
Out[33]:
   A  B   C
012 NaN
1426

(Perhaps it makes sense for fillna to have columns argument for this usecase?)

All of this isn't to say pandas is memory inefficient... but efficient (and memory efficient) code sometimes has to be thought about.

*apply is not usually going to make sense inplace (and IMO this behaviour would rarely be desired).

Post a Comment for "Pandas Data Frame Behavior"