Skip to content Skip to sidebar Skip to footer

Pandas Rolling Sum Of Last Five Minutes

Assume I have below data frame Date, A 2014-11-21 11:00:00, 1 2014-11-21 11:03:00, 4 2014-11-21 11:04:00, 1 2014-11-21 11:05:00, 2 2014-11-21 11:07:00, 4 2014-11-21 11:08:00, 1 201

Solution 1:

In general, if the dates are completely arbitrary, I think you would be forced to use a Python for-loop over the rows or use df.apply, (which under the hood, also uses a Python loop.)

However, if your Dates share a common frequency, as is the case above, then there is a trick which should be much quicker than using df.apply: Expand the timeseries according to the common frequency -- in this case, 1 minute -- fill in the NaNs with zeros, and then call rolling_sum:

In [279]: pd.rolling_sum(df.set_index(['Date']).asfreq('1T').fillna(0), window=5, min_periods=1).reindex(df['Date'])
Out[279]: 
                      A
Date                   
2014-11-21 11:00:00   1
2014-11-21 11:03:00   5
2014-11-21 11:04:00   6
2014-11-21 11:05:00   7
2014-11-21 11:07:00  11
2014-11-21 11:08:00   8
2014-11-21 11:12:00   2
2014-11-21 11:13:00   3

Of course, any time series has a common frequency if you are willing to accept a small enough granularity, but the required size of df.asfreq(...) may make this trick impractical.


Here is an example of the more general approach using df.apply. Note that calling searchsorted relies on df['Date'] being in sorted order.

import numpy as np
import pandas as pd
df = pd.read_csv('data', parse_dates=[0], sep=',\s*')
start_dates = df['Date'] - pd.Timedelta(minutes=5)
df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right')
df['end_index'] = np.arange(len(df))

def sum_window(row):
    return df['A'].iloc[row['start_index']:row['end_index']+1].sum()
df['rolling_sum'] = df.apply(sum_window, axis=1)

print(df[['Date', 'A', 'rolling_sum']])

yields

                 Date  A  rolling_sum
0 2014-11-21 11:00:00  1            1
1 2014-11-21 11:03:00  4            5
2 2014-11-21 11:04:00  1            6
3 2014-11-21 11:05:00  2            7
4 2014-11-21 11:07:00  4           11
5 2014-11-21 11:08:00  1            8
6 2014-11-21 11:12:00  1            2
7 2014-11-21 11:13:00  2            3

Here is a benchmark comparing the df.asfreq trick versus calling df.apply:

import numpy as np
import pandas as pd
df = pd.read_csv('data', parse_dates=[0], sep=',\s*')

def big_df(df):
    df = df.copy()
    for i in range(7):
        dates = df['Date'] + pd.Timedelta(df.iloc[-1]['Date']-df.iloc[0]['Date']) + pd.Timedelta('1 minute')
        df2 = pd.DataFrame({'Date': dates, 'A': df['A']})
        df = pd.concat([df, df2])
    df = df.reset_index(drop=True)
    return df

def using_apply():
    start_dates = df['Date'] - pd.Timedelta(minutes=5)
    df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right')
    df['end_index'] = np.arange(len(df))

    def sum_window(row):
        return df['A'].iloc[row['start_index']:row['end_index']+1].sum()

    df['rolling_sum'] = df.apply(sum_window, axis=1)
    return df[['Date', 'rolling_sum']]

def using_asfreq():
    result = (pd.rolling_sum(
        df.set_index(['Date']).asfreq('1T').fillna(0), 
        window=5, min_periods=1).reindex(df['Date']))
    return result

In [364]: df = big_df(df)

In [367]: %timeit using_asfreq()
1000 loops, best of 3: 1.21 ms per loop

In [368]: %timeit using_apply()
1 loops, best of 3: 208 ms per loop

Post a Comment for "Pandas Rolling Sum Of Last Five Minutes"