Pandas Mean For Certain Column

February 28, 2024 Post a Comment

I have a pandas dataframe like that: How can I able to calculate mean (min/max, median) for specific column if Cluster==1 or CLuster==2? Thanks!

Solution 1:

You can create new df with only the relevant rows, using:

newdf = df[df['cluster'].isin([1,2)]

newdf.mean(axis=1)

In order to calc mean of a specfic column you can:

newdf["page"].mean(axis=1)

Solution 2:

If you meant take the mean only where Cluster is 1 or 2, then the other answers here address your issue. If you meant take a separate mean for each value of Cluster, you can use pandas' aggregation functions, including groupyby and agg:

df.groupby("Cluster").mean()

is the simplest and will take means of all columns, grouped by Cluster.

df.groupby("Cluster").agg({"duration" : np.mean})

is an example where you are taking the mean of just one specific column, grouped by cluster. You can also use np.min, np.max, np.median, etc.

The groupby method produces a GroupBy object, which is something like but not like a DataFrame. Think of it as the DataFrame grouped, waiting for aggregation to be applied to it. The GroupBy object has simple built-in aggregation functions that apply to all columns (the mean() in the first example), and also a more general aggregation function (the agg() in the second example) that you can use to apply specific functions in a variety of ways. One way of using it is passing a dict of column names keyed to functions, so specific functions can be applied to specific columns.

Solution 3:

You can do it in one line, using boolean indexing. For example you can do something like:

import numpy as np
import pandas as pd

# This will just produce an example DataFrame
df = pd.DataFrame({'a':np.arange(30), 'Cluster':np.ones(30,dtype=np.int)})
df.loc[10:19, "Cluster"] *= 2
df.loc[20:,   "Cluster"] *= 3# This line is all you need
df.loc[(df['Cluster']==1)|(df['Cluster']==2), 'a'].mean()

The boolean indexing array is True for the correct clusters. a is just the name of the column to compute the mean over.

Solution 4:

Simple intuitive answer

First pick the rows of interest, then average then pick the columns of interest.

clusters_of_interest = [1, 2]
columns_of_interest = ['page']

# rows of interest
newdf = df[ df.CLUSTER.isin(clusters_of_interest) ]
# average and pick columns of interest
newdf.mean(axis=0)[ columns_of_interest ]

More advanced

# Create groups object according to the value in the 'cluster' columngrp = df.groupby('CLUSTER')
# apply functions of interest to all cluster groupingsdata_agg = grp.agg( ['mean' , 'max' , 'min' ] )

This is also a good link which describes aggregation techniques. It should be noted that the "simple answer" averages over clusters 1 AND 2 or whatever is specified in the clusters_of_interest while the .agg function averages over each group of values having the same CLUSTER value.

Python Library