Python Pandas Multiply Dataframe By Weights That Vary With Category In Vectorized Fashion
Solution 1:
You can do a groupby (select by category) and then do the dot()
or you can do the dot()
and then select by category. The latter is faster and simpler in pandas. Note that the data I used matches the column names in the data and the weights frames.
Code for dot()
and then select:
df['dot'] = df[df_wgt.columns].dot(df_wgt.T).lookup(df.index, df.Category)
Steps performed...
Select the columns to use with
df[df_wgt.columns]
This uses the column labels and ordering from the weight dataframe. This is important because
dot()
needs the data to be in the correct order.Performing the dot product against the transposed weights dataframe with
.dot(df_wgt.T)
Transposing the weight puts them in the correct orientation for the
.dot()
. This does the calculation for all of the weight categories for each row of data. That means in this case we do four times as many multiplications as will be needed, but it is still likely faster then doing grouping.Select the needed dot product with
.lookup(df.index, df.Category)
By using
lookup()
we can gather the correct result for the category of that row.
Code for select (groupby) and then dot()
:
def dot(group):
category = group['Category'].iloc[0]
weights = df_wgt.loc[category].values
return pd.Series(
np.dot(group[df_wgt.columns].values, weights), index=group.index)
df['dot'] = df.groupby(['Category']).apply(dot) \
.reset_index().set_index('Index')[0]
Test Code:
import pandas as pd
from io import StringIO
df = pd.read_fwf(StringIO(u"""
Index var_1 var_2 var_3 var_4 Category
1903 0.000443 0.006928 0.000000 0.012375 A
1904 -0.000690 -0.007873 0.000171 0.014824 A
1905 -0.001354 0.001545 0.000007 -0.008195 C
1906 -0.001578 0.008796 -0.000164 0.015955 D
1907 -0.001578 0.008796 -0.000164 0.015955 A
1909 -0.001354 0.001545 0.000007 -0.008195 B"""),
header=1, skiprows=0).set_index(['Index'])
df_wgt = pd.read_fwf(StringIO(u"""
Category var_1 var_2 var_3 var_4
A 0.182022 0.182022 0.131243 0.182022
B 0.534814 0.534814 0.534814 0.534814
C 0.131243 0.534814 0.131243 0.182022
D 0.182022 0.151921 0.151921 0.131243"""),
header=1, skiprows=0).set_index(['Category'])
df['dot'] = df[df_wgt.columns].dot(df_wgt.T).lookup(df.index, df.Category)
print(df)
Results:
var_1 var_2 var_3 var_4 Category dot
Index
1903 0.000443 0.006928 0.000000 0.012375 A 0.003594
1904 -0.000690 -0.007873 0.000171 0.014824 A 0.001162
1905 -0.001354 0.001545 0.000007 -0.008195 C -0.000842
1906 -0.001578 0.008796 -0.000164 0.015955 D 0.003118
1907 -0.001578 0.008796 -0.000164 0.015955 A 0.004196
1909 -0.001354 0.001545 0.000007 -0.008195 B -0.004277
Solution 2:
Setup
print(df)
Out[655]:
var_1 var_2 var_3 var_4 Category
Symbol
1903 0.000443 0.006928 0.000000 0.012375 A
1904 -0.000690 -0.007873 0.000171 0.014824 A
1905 -0.001354 0.001545 0.000007 -0.008195 C
1906 -0.001578 0.008796 -0.000164 0.015955 D
1907 -0.001578 0.008796 -0.000164 0.015955 A
1909 -0.001354 0.001545 0.000007 -0.008195 B
print(w)
Out[656]:
Category var_1_wgt var_2_wgt var_3_wgt var_4_wgt
0 A 0.182022 0.182022 0.131243 0.182022
1 B 0.534814 0.534814 0.534814 0.534814
2 C 0.131243 0.534814 0.131243 0.182022
3 D 0.182022 0.151921 0.151921 0.131243
Solution
#convert Category to numerical encoding
df['C_Number'] = df.Category.apply(lambda x: ord(x.lower())-97)
#Get a dot product for each row with all category weights and the extract the weights by the category number
df['new_var'] = ((df.iloc[:,:4].values).dot(w.iloc[:,-4:].values))[np.arange(len(df)),df.C_Number]
Out[654]:
var_1 var_2 var_3 var_4 Category C_Number new_var
Symbol
1903 0.000443 0.006928 0.000000 0.012375 A 0 0.006038
1904 -0.000690 -0.007873 0.000171 0.014824 A 0 -0.001615
1905 -0.001354 0.001545 0.000007 -0.008195 C 2 -0.000595
1906 -0.001578 0.008796 -0.000164 0.015955 D 3 0.006481
1907 -0.001578 0.008796 -0.000164 0.015955 A 0 0.007300
1909 -0.001354 0.001545 0.000007 -0.008195 B 1 -0.000661
Post a Comment for "Python Pandas Multiply Dataframe By Weights That Vary With Category In Vectorized Fashion"