Pandas. Picking A Column Name Based On Row Data

September 14, 2022 Post a Comment

In my previous question, i was trying to count blanks and build a dataframe with new columns for the subsequent analysis. The question became too exhaustive and i decided to split

Solution 1:

Here is a function that may be helpful, IIUC.

import pandas as pd

# create test data
t = pd.DataFrame({'x': [10, 20] + [None] * 3 + [30, 40, 50, 60] + [None] * 5 + [70]})

Create a function to find start location, end location, and size of each 'group', where a group is a sequence of repeated values (e.g., NaNs):

def extract_nans(df, field):
    df = df.copy()
    
    # identify NaNs
    df['is_na'] = df[field].isna()

    # identify groups (sequence of identical values is a group):  X Y X => 3 groups
    df['group_id'] = (df['is_na'] ^ df['is_na'].shift(1)).cumsum()

    # how many members in this group?
    df['group_size'] = df.groupby('group_id')['group_id'].transform('size')

    # initial, final index of each group
    df['min_index'] = df.reset_index().groupby('group_id')['index'].transform(min)
    df['max_index'] = df.reset_index().groupby('group_id')['index'].transform(max)

    return df

Results:

summary = extract_nans(t, 'x')
print(summary)

       x  is_na  group_id  group_size  min_index  max_index
0   10.0  False         0           2          0          1
1   20.0  False         0           2          0          1
2    NaN   True         1           3          2          4
3    NaN   True         1           3          2          4
4    NaN   True         1           3          2          4
5   30.0  False         2           4          5          8
6   40.0  False         2           4          5          8
7   50.0  False         2           4          5          8
8   60.0  False         2           4          5          8
9    NaN   True         3           5          9         13
10   NaN   True         3           5          9         13
11   NaN   True         3           5          9         13
12   NaN   True         3           5          9         13
13   NaN   True         3           5          9         13
14  70.0  False         4           1         14         14

Now, you can exclude 'x' from the summary, drop duplicates, filter to keep only NaN values (is_na == True), filter to keep sequences above a certain length (e.g., at least 3 consecutive NaN values), etc. Then, if you drop duplicates, the first row will summarize the first NaN run, second row will summarize the second NaN run, etc.

Finally, you can use this with apply() to process the whole data frame, if this is what you need.

Baca Juga

Short version of results, for the test data frame:

print(summary[summary['is_na']].drop(columns='x').drop_duplicates())
   is_na  group_id  group_size  min_index  max_index
2   True         1           3          2          4
9   True         3           5          9         13

Python Library

Pandas. Picking A Column Name Based On Row Data

Solution 1:

Post a Comment for "Pandas. Picking A Column Name Based On Row Data"