Pandas. Picking A Column Name Based On Row Data
In my previous question, i was trying to count blanks and build a dataframe with new columns for the subsequent analysis. The question became too exhaustive and i decided to split
Solution 1:
Here is a function that may be helpful, IIUC.
import pandas as pd
# create test data
t = pd.DataFrame({'x': [10, 20] + [None] * 3 + [30, 40, 50, 60] + [None] * 5 + [70]})
Create a function to find start location, end location, and size of each 'group', where a group is a sequence of repeated values (e.g., NaNs):
def extract_nans(df, field):
df = df.copy()
# identify NaNs
df['is_na'] = df[field].isna()
# identify groups (sequence of identical values is a group): X Y X => 3 groups
df['group_id'] = (df['is_na'] ^ df['is_na'].shift(1)).cumsum()
# how many members in this group?
df['group_size'] = df.groupby('group_id')['group_id'].transform('size')
# initial, final index of each group
df['min_index'] = df.reset_index().groupby('group_id')['index'].transform(min)
df['max_index'] = df.reset_index().groupby('group_id')['index'].transform(max)
return df
Results:
summary = extract_nans(t, 'x')
print(summary)
x is_na group_id group_size min_index max_index
0 10.0 False 0 2 0 1
1 20.0 False 0 2 0 1
2 NaN True 1 3 2 4
3 NaN True 1 3 2 4
4 NaN True 1 3 2 4
5 30.0 False 2 4 5 8
6 40.0 False 2 4 5 8
7 50.0 False 2 4 5 8
8 60.0 False 2 4 5 8
9 NaN True 3 5 9 13
10 NaN True 3 5 9 13
11 NaN True 3 5 9 13
12 NaN True 3 5 9 13
13 NaN True 3 5 9 13
14 70.0 False 4 1 14 14
Now, you can exclude 'x' from the summary, drop duplicates, filter to keep only NaN values (is_na == True), filter to keep sequences above a certain length (e.g., at least 3 consecutive NaN values), etc. Then, if you drop duplicates, the first row will summarize the first NaN run, second row will summarize the second NaN run, etc.
Finally, you can use this with apply() to process the whole data frame, if this is what you need.
Short version of results, for the test data frame:
print(summary[summary['is_na']].drop(columns='x').drop_duplicates())
is_na group_id group_size min_index max_index
2 True 1 3 2 4
9 True 3 5 9 13
Post a Comment for "Pandas. Picking A Column Name Based On Row Data"