Pandas Mean For Certain Column
Solution 1:
You can create new df with only the relevant rows, using:
newdf = df[df['cluster'].isin([1,2)]
newdf.mean(axis=1)
In order to calc mean of a specfic column you can:
newdf["page"].mean(axis=1)
Solution 2:
If you meant take the mean only where Cluster is 1 or 2, then the other answers here address your issue. If you meant take a separate mean for each value of Cluster, you can use pandas' aggregation functions, including groupyby
and agg
:
df.groupby("Cluster").mean()
is the simplest and will take means of all columns, grouped by Cluster.
df.groupby("Cluster").agg({"duration" : np.mean})
is an example where you are taking the mean of just one specific column, grouped by cluster. You can also use np.min
, np.max
, np.median
, etc.
The groupby
method produces a GroupBy
object, which is something like but not like a DataFrame
. Think of it as the DataFrame
grouped, waiting for aggregation to be applied to it. The GroupBy
object has simple built-in aggregation functions that apply to all columns (the mean()
in the first example), and also a more general aggregation function (the agg()
in the second example) that you can use to apply specific functions in a variety of ways. One way of using it is passing a dict
of column names keyed to functions, so specific functions can be applied to specific columns.
Solution 3:
You can do it in one line, using boolean indexing. For example you can do something like:
import numpy as np
import pandas as pd
# This will just produce an example DataFrame
df = pd.DataFrame({'a':np.arange(30), 'Cluster':np.ones(30,dtype=np.int)})
df.loc[10:19, "Cluster"] *= 2
df.loc[20:, "Cluster"] *= 3# This line is all you need
df.loc[(df['Cluster']==1)|(df['Cluster']==2), 'a'].mean()
The boolean indexing array is True
for the correct clusters. a
is just the name of the column to compute the mean over.
Solution 4:
Simple intuitive answer
First pick the rows of interest, then average then pick the columns of interest.
clusters_of_interest = [1, 2]
columns_of_interest = ['page']
# rows of interest
newdf = df[ df.CLUSTER.isin(clusters_of_interest) ]
# average and pick columns of interest
newdf.mean(axis=0)[ columns_of_interest ]
More advanced
# Create groups object according to the value in the 'cluster' columngrp = df.groupby('CLUSTER')
# apply functions of interest to all cluster groupingsdata_agg = grp.agg( ['mean' , 'max' , 'min' ] )
This is also a good link which describes aggregation techniques. It should be noted that the "simple answer" averages over clusters 1 AND 2 or whatever is specified in the clusters_of_interest
while the .agg
function averages over each group of values having the same CLUSTER
value.
Post a Comment for "Pandas Mean For Certain Column"