Scikit-learn Labelencoder: How To Preserve Mappings Between Batches?
I have 185 million samples that will be about 3.8 MB per sample. To prepare my dataset, I will need to one-hot encode many of the features after which I end up with over 15,000 fea
Solution 1:
I suggest using Pandas' get_dummies()
for this, since sklearn's OneHotEncoder()
needs to see all possible categorical values when .fit()
, otherwise it will throw an error when it encounters a new one during .transform()
.
# Create toy dataset and split to batches
data_column = pd.Series(['Paris', 'Tokyo', 'Rome', 'London', 'Chicago', 'Paris'])
batch_1 = data_column[:3]
batch_2 = data_column[3:]
# Convert categorical feature columnto matrix of dummy variables
batch_1_encoded = pd.get_dummies(batch_1, prefix='City')
batch_2_encoded = pd.get_dummies(batch_2, prefix='City')
# Row-bind (append) Encoded Data Back Together
final_encoded = pd.concat([batch_1_encoded, batch_2_encoded], axis=0)
# Final wrap-up. Replace nans with0, andconvert flags fromfloattoint
final_encoded = final_encoded.fillna(0)
final_encoded[final_encoded.columns] = final_encoded[final_encoded.columns].astype(int)
final_encoded
output
City_Chicago City_London City_Paris City_Rome City_Tokyo
0 0 0 1 0 0
1 0 0 0 0 1
2 0 0 0 1 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 1 0 0
Post a Comment for "Scikit-learn Labelencoder: How To Preserve Mappings Between Batches?"