Splitting A Pandas Dataframe
I would like to filter and split my original dataframe into a number of dataframes using the condition that progressPercentage goes from 1.0 to 100 as in the following example: Inp
Solution 1:
If Finish
value sometimes missing and need use only progressPercentage
column use:
shifted = df['progressPercentage'].shift()
#compare difference for second 100 if together 100 (e.g. 15, 16 row)
m = shifted.diff(-1).ne(0) & shifted.eq(100)
a = m.cumsum()
aa = df.groupby([df.id_B,a])
for k, gp in aa:
print('key=' + str(k))
print(gp)
print('A NEW ONE...')
key=('id1',0)id_Bts_BcourseweightPhaseremainingTime\0id12017-04-27 01:35:30 cotton3.5A01:15:001id12017-04-27 01:37:30 cotton3.5B01:13:002id12017-04-27 01:38:00 cotton3.5B01:13:003id12017-04-27 01:38:30 cotton3.5C01:13:004id12017-04-27 01:39:00 cotton3.5C00:02:005id12017-04-27 01:39:30 cotton3.5C00:01:006id12017-04-27 01:40:00 cotton3.5Finish00:01:00progressPercentage023.0124.0224.0324.0499.05100.06100.0ANEWONE...key=('id1',1)id_Bts_BcourseweightPhaseremainingTime\7id12017-04-27 02:35:30 cotton3.5A03:15:008id12017-04-27 02:36:00 cotton3.5A03:14:009id12017-04-27 02:36:30 cotton3.5A03:14:0010id12017-04-27 02:37:00 cotton3.5B03:13:0011id12017-04-27 02:37:30 cotton3.5B03:13:0012id12017-04-27 02:38:00 cotton3.5B03:13:0013id12017-04-27 02:38:30 cotton3.5C03:13:0014id12017-04-27 02:39:00 cotton3.5C00:02:0015id12017-04-27 02:39:30 cotton3.5C00:01:0016id12017-04-27 02:40:00 cotton3.5Finish00:01:00progressPercentage71.082.092.0103.0114.0125.01398.01499.015100.016100.0ANEWONE...key=('id2',2)...
Solution 2:
You can divide the dataframe by progressPercentage which is equal 100. Remove the earlier index if they are consecutive.Then slice and append the dataframe to an array. Hope this helps
import numpy as np
df = pd.read_csv('input.csv',delimiter=',') # The input csv provided
df1 = df[(df["progressPercentage "]==100)]
x = (np.array(df1.index) + 1).tolist()
x.insert(0,0)
#Remove the consecutive elements so that they can be treated under one dataframe.
x = [ beginforbegin, endin zip(x, x[1:]) if (begin != end-1)]
x.insert(len(x),df.shape[0])
frames = [df.iloc[begin:end] forbegin, endin zip(x, x[1:])]
You can print the dataframes using a for loop, i.e
fordfin frames:
print(df)
Output of the dataframes:
id_B ts_B course weight Phase remainingTime \ 0 id1 2017-04-27 01:35:30 cotton 3.5 A 01:15:00 1 id1 2017-04-27 01:37:30 cotton 3.5 B 01:13:00 2 id1 2017-04-27 01:38:00 cotton 3.5 B 01:13:00 3 id1 2017-04-27 01:38:30 cotton 3.5 C 01:13:00 4 id1 2017-04-27 01:39:00 cotton 3.5 C 00:02:00 5 id1 2017-04-27 01:39:30 cotton 3.5 C 00:01:00 6 id1 2017-04-27 01:40:00 cotton 3.5 Finish 00:01:00 progressPercentage 0 23.0 1 24.0 2 24.0 3 24.0 4 99.0 5 100.0 6 100.0 id_B ts_B course weight Phase remainingTime \ 7 id1 2017-04-27 02:35:30 cotton 3.5 A 03:15:00 8 id1 2017-04-27 02:36:00 cotton 3.5 A 03:14:00 9 id1 2017-04-27 02:36:30 cotton 3.5 A 03:14:00 10 id1 2017-04-27 02:37:00 cotton 3.5 B 03:13:00 11 id1 2017-04-27 02:37:30 cotton 3.5 B 03:13:00 12 id1 2017-04-27 02:38:00 cotton 3.5 B 03:13:00 13 id1 2017-04-27 02:38:30 cotton 3.5 C 03:13:00 14 id1 2017-04-27 02:39:00 cotton 3.5 C 00:02:00 15 id1 2017-04-27 02:39:30 cotton 3.5 C 00:01:00 16 id1 2017-04-27 02:40:00 cotton 3.5 Finish 00:01:00 progressPercentage 7 1.0 8 2.0 9 2.0 10 3.0 11 4.0 12 5.0 13 98.0 14 99.0 15 100.0 16 100.0 id_B ts_B course weight Phase remainingTime \ 17 id2 2017-04-27 03:36:00 cotton 3.5 A 03:15:00 18 id2 2017-04-27 03:36:30 cotton 3.5 A 03:14:00 19 id2 2017-04-27 03:37:00 cotton 3.5 B 03:13:00 20 id2 2017-04-27 03:37:30 cotton 3.5 B 03:13:00 21 id2 2017-04-27 03:38:00 cotton 3.5 B 03:13:00 22 id2 2017-04-27 03:38:30 cotton 3.5 C 03:13:00 23 id2 2017-04-27 03:39:00 cotton 3.5 C 00:02:00 24 id2 2017-04-27 03:39:30 cotton 3.5 C 00:01:00 25 id2 2017-04-27 03:40:00 cotton 3.5 Finish 00:01:00 progressPercentage 17 1.0 18 1.0 19 2.0 20 2.0 21 3.0 22 98.0 23 99.0 24 100.0 25 100.0 id_B ts_B course weight Phase remainingTime \ 26 id1 2017-05-27 01:35:30 cotton 3.5 A 03:15:00 27 id1 2017-05-27 01:37:30 cotton 3.5 B 03:13:00 28 id1 2017-05-27 01:38:00 cotton 3.5 B 03:13:00 29 id1 2017-05-27 01:38:30 cotton 3.5 C 03:13:00 30 id1 2017-05-27 01:39:00 cotton 3.5 C 00:02:00 31 id1 2017-05-27 01:39:30 cotton 3.5 C 00:01:00 32 id1 2017-05-27 01:40:00 cotton 3.5 Finish 00:01:00 progressPercentage 26 23.0 27 24.0 28 24.0 29 24.0 30 99.0 31 100.0 32 100.0 id_B ts_B course weight Phase remainingTime \ 33 id1 2017-05-27 02:35:30 cotton 3.5 A 01:15:00 34 id1 2017-05-27 02:36:00 cotton 3.5 A 01:14:00 35 id1 2017-05-27 02:36:30 cotton 3.5 A 01:13:00 36 id1 2017-05-27 02:37:00 cotton 3.5 B 01:12:00 37 id1 2017-05-27 02:37:30 cotton 3.5 B 01:11:00 38 id1 2017-05-27 02:38:00 cotton 3.5 B 01:10:00 39 id1 2017-05-27 02:38:30 cotton 3.5 C 01:09:00 40 id1 2017-05-27 02:39:00 cotton 3.5 C 00:08:00 41 id1 2017-05-27 02:39:00 cotton 3.5 C 00:08:00 progressPercentage 33 1.0 34 2.0 35 2.0 36 3.0 37 4.0 38 5.0 39 98.0 40 99.0 41 100.0 id_B ts_B course weight Phase remainingTime \ 42 id2 2017-04-27 03:36:00 cotton 3.5 A 03:15:00 43 id2 2017-04-27 03:36:30 cotton 3.5 A 03:14:00 44 id2 2017-04-27 03:37:00 cotton 3.5 B 03:13:00 45 id2 2017-04-27 03:37:30 cotton 3.5 B 03:13:00 46 id2 2017-04-27 03:38:00 cotton 3.5 B 03:13:00 47 id2 2017-04-27 03:38:30 cotton 3.5 C 03:13:00 48 id2 2017-04-27 03:39:00 cotton 3.5 C 00:02:00 49 id2 2017-04-27 03:39:30 cotton 3.5 C 00:01:00 50 id2 2017-04-27 03:40:00 cotton 3.5 Finish 00:01:00 progressPercentage 42 1.0 43 1.0 44 2.0 45 2.0 46 3.0 47 98.0 48 99.0 49 100.0 50 100.0
Solution 3:
the best way I found is the following:
a = dfb['progressPercentage'].shift().eq(100).cumsum()
df_output = dfb.groupby([dfb.id_B,a])
for k, gp in aa:
print('key=' + str(k))
print(gp.sort_values(['eventTime', 'wm_id'], ascending=[1, 0]).to_string())
print('A NEW ONE...')
Post a Comment for "Splitting A Pandas Dataframe"