Skip to content Skip to sidebar Skip to footer

Splitting A Pandas Dataframe

I would like to filter and split my original dataframe into a number of dataframes using the condition that progressPercentage goes from 1.0 to 100 as in the following example: Inp

Solution 1:

If Finish value sometimes missing and need use only progressPercentage column use:

shifted = df['progressPercentage'].shift()
#compare difference for second 100 if together 100 (e.g. 15, 16 row)   
m = shifted.diff(-1).ne(0) & shifted.eq(100)
a = m.cumsum()

aa = df.groupby([df.id_B,a])

for k, gp in aa: 
    print('key=' + str(k))
    print(gp)
    print('A NEW ONE...')

key=('id1',0)id_Bts_BcourseweightPhaseremainingTime\0id12017-04-27 01:35:30  cotton3.5A01:15:001id12017-04-27 01:37:30  cotton3.5B01:13:002id12017-04-27 01:38:00  cotton3.5B01:13:003id12017-04-27 01:38:30  cotton3.5C01:13:004id12017-04-27 01:39:00  cotton3.5C00:02:005id12017-04-27 01:39:30  cotton3.5C00:01:006id12017-04-27 01:40:00  cotton3.5Finish00:01:00progressPercentage023.0124.0224.0324.0499.05100.06100.0ANEWONE...key=('id1',1)id_Bts_BcourseweightPhaseremainingTime\7id12017-04-27 02:35:30  cotton3.5A03:15:008id12017-04-27 02:36:00  cotton3.5A03:14:009id12017-04-27 02:36:30  cotton3.5A03:14:0010id12017-04-27 02:37:00  cotton3.5B03:13:0011id12017-04-27 02:37:30  cotton3.5B03:13:0012id12017-04-27 02:38:00  cotton3.5B03:13:0013id12017-04-27 02:38:30  cotton3.5C03:13:0014id12017-04-27 02:39:00  cotton3.5C00:02:0015id12017-04-27 02:39:30  cotton3.5C00:01:0016id12017-04-27 02:40:00  cotton3.5Finish00:01:00progressPercentage71.082.092.0103.0114.0125.01398.01499.015100.016100.0ANEWONE...key=('id2',2)...

Solution 2:

You can divide the dataframe by progressPercentage which is equal 100. Remove the earlier index if they are consecutive.Then slice and append the dataframe to an array. Hope this helps

import numpy as np
df = pd.read_csv('input.csv',delimiter=',') # The input csv provided
df1 = df[(df["progressPercentage "]==100)]
x = (np.array(df1.index) + 1).tolist()
x.insert(0,0)
#Remove the consecutive elements so that they can be treated under one dataframe. 
x = [ beginforbegin, endin zip(x, x[1:]) if (begin != end-1)]
x.insert(len(x),df.shape[0])
frames = [df.iloc[begin:end] forbegin, endin zip(x, x[1:])]   

You can print the dataframes using a for loop, i.e

fordfin frames:
    print(df)

Output of the dataframes:

  id_B                 ts_B  course  weight   Phase remainingTime  \
0  id1  2017-04-27 01:35:30  cotton     3.5       A      01:15:00   
1  id1  2017-04-27 01:37:30  cotton     3.5       B      01:13:00   
2  id1  2017-04-27 01:38:00  cotton     3.5       B      01:13:00   
3  id1  2017-04-27 01:38:30  cotton     3.5       C      01:13:00   
4  id1  2017-04-27 01:39:00  cotton     3.5       C      00:02:00   
5  id1  2017-04-27 01:39:30  cotton     3.5       C      00:01:00   
6  id1  2017-04-27 01:40:00  cotton     3.5  Finish      00:01:00   

   progressPercentage   
0                 23.0  
1                 24.0  
2                 24.0  
3                 24.0  
4                 99.0  
5                100.0  
6                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
7   id1  2017-04-27 02:35:30  cotton     3.5       A      03:15:00   
8   id1  2017-04-27 02:36:00  cotton     3.5       A      03:14:00   
9   id1  2017-04-27 02:36:30  cotton     3.5       A      03:14:00   
10  id1  2017-04-27 02:37:00  cotton     3.5       B      03:13:00   
11  id1  2017-04-27 02:37:30  cotton     3.5       B      03:13:00   
12  id1  2017-04-27 02:38:00  cotton     3.5       B      03:13:00   
13  id1  2017-04-27 02:38:30  cotton     3.5       C      03:13:00   
14  id1  2017-04-27 02:39:00  cotton     3.5       C      00:02:00   
15  id1  2017-04-27 02:39:30  cotton     3.5       C      00:01:00   
16  id1  2017-04-27 02:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
7                   1.0  
8                   2.0  
9                   2.0  
10                  3.0  
11                  4.0  
12                  5.0  
13                 98.0  
14                 99.0  
15                100.0  
16                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
17  id2  2017-04-27 03:36:00  cotton     3.5       A      03:15:00   
18  id2  2017-04-27 03:36:30  cotton     3.5       A      03:14:00   
19  id2  2017-04-27 03:37:00  cotton     3.5       B      03:13:00   
20  id2  2017-04-27 03:37:30  cotton     3.5       B      03:13:00   
21  id2  2017-04-27 03:38:00  cotton     3.5       B      03:13:00   
22  id2  2017-04-27 03:38:30  cotton     3.5       C      03:13:00   
23  id2  2017-04-27 03:39:00  cotton     3.5       C      00:02:00   
24  id2  2017-04-27 03:39:30  cotton     3.5       C      00:01:00   
25  id2  2017-04-27 03:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
17                  1.0  
18                  1.0  
19                  2.0  
20                  2.0  
21                  3.0  
22                 98.0  
23                 99.0  
24                100.0  
25                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
26  id1  2017-05-27 01:35:30  cotton     3.5       A      03:15:00   
27  id1  2017-05-27 01:37:30  cotton     3.5       B      03:13:00   
28  id1  2017-05-27 01:38:00  cotton     3.5       B      03:13:00   
29  id1  2017-05-27 01:38:30  cotton     3.5       C      03:13:00   
30  id1  2017-05-27 01:39:00  cotton     3.5       C      00:02:00   
31  id1  2017-05-27 01:39:30  cotton     3.5       C      00:01:00   
32  id1  2017-05-27 01:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
26                 23.0  
27                 24.0  
28                 24.0  
29                 24.0  
30                 99.0  
31                100.0  
32                100.0  
   id_B                 ts_B  course  weight Phase remainingTime  \
33  id1  2017-05-27 02:35:30  cotton     3.5     A      01:15:00   
34  id1  2017-05-27 02:36:00  cotton     3.5     A      01:14:00   
35  id1  2017-05-27 02:36:30  cotton     3.5     A      01:13:00   
36  id1  2017-05-27 02:37:00  cotton     3.5     B      01:12:00   
37  id1  2017-05-27 02:37:30  cotton     3.5     B      01:11:00   
38  id1  2017-05-27 02:38:00  cotton     3.5     B      01:10:00   
39  id1  2017-05-27 02:38:30  cotton     3.5     C      01:09:00   
40  id1  2017-05-27 02:39:00  cotton     3.5     C      00:08:00   
41  id1  2017-05-27 02:39:00  cotton     3.5     C      00:08:00   

    progressPercentage   
33                  1.0  
34                  2.0  
35                  2.0  
36                  3.0  
37                  4.0  
38                  5.0  
39                 98.0  
40                 99.0  
41                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
42  id2  2017-04-27 03:36:00  cotton     3.5       A      03:15:00   
43  id2  2017-04-27 03:36:30  cotton     3.5       A      03:14:00   
44  id2  2017-04-27 03:37:00  cotton     3.5       B      03:13:00   
45  id2  2017-04-27 03:37:30  cotton     3.5       B      03:13:00   
46  id2  2017-04-27 03:38:00  cotton     3.5       B      03:13:00   
47  id2  2017-04-27 03:38:30  cotton     3.5       C      03:13:00   
48  id2  2017-04-27 03:39:00  cotton     3.5       C      00:02:00   
49  id2  2017-04-27 03:39:30  cotton     3.5       C      00:01:00   
50  id2  2017-04-27 03:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
42                  1.0  
43                  1.0  
44                  2.0  
45                  2.0  
46                  3.0  
47                 98.0  
48                 99.0  
49                100.0  
50                100.0  

Solution 3:

the best way I found is the following:

    a = dfb['progressPercentage'].shift().eq(100).cumsum()
    df_output = dfb.groupby([dfb.id_B,a])

    for k, gp in aa: 
        print('key=' + str(k))
        print(gp.sort_values(['eventTime', 'wm_id'], ascending=[1, 0]).to_string())
        print('A NEW ONE...')

Post a Comment for "Splitting A Pandas Dataframe"