the decision blog

mushroom edibility prediction with a vector support classifier

Written on

Suppose we'd like to make a vector support classifier to help us decide whether a mushroom might be edible. Disclaimer: I personally don't eat mushrooms that I find in the forest and, if I did, I definitely wouldn't use a machine learning algorithm alone to decide which ones to eat!! Now that's out of the way, let's see how well we can do.

After extracting the data, we have a look at it:

import pandas as pd
mushrooms = pd.read_csv('mushrooms.csv')
mushrooms.head()
class cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing gill-size gill-color ... stalk-surface-below-ring stalk-color-above-ring stalk-color-below-ring veil-type veil-color ring-number ring-type spore-print-color population habitat
0 p x s n t p f c n k ... s w w p w o p k s u
1 e x s y t a f c b k ... s w w p w o p n n g
2 e b s w t l f c b n ... s w w p w o p n n m
3 p x y w t p f c n n ... s w w p w o p k s u
4 e x s g f n f w b k ... s w w p w o e n a g

5 rows × 23 columns

The "class" field represents "poisonous" with a "p" and "edible" with "e"; the other fields encode other mushroom properties similarly. We'll need to convert these letters into numerical data; for that we'll use the python command "ord":

for n in range(0,mushrooms.shape[1]): 
    mushrooms.iloc[:,n] = [ord(x) - 97
                           for x in mushrooms[mushrooms.columns[n]]]
mushrooms.head()
class cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing gill-size gill-color ... stalk-surface-below-ring stalk-color-above-ring stalk-color-below-ring veil-type veil-color ring-number ring-type spore-print-color population habitat
0 15 23 18 13 19 15 5 2 13 10 ... 18 22 22 15 22 14 15 10 18 20
1 4 23 18 24 19 0 5 2 1 10 ... 18 22 22 15 22 14 15 13 13 6
2 4 1 18 22 19 11 5 2 1 13 ... 18 22 22 15 22 14 15 13 13 12
3 15 23 24 22 19 15 5 2 13 13 ... 18 22 22 15 22 14 15 10 18 20
4 4 23 18 6 5 13 5 22 1 10 ... 18 22 22 15 22 14 4 13 0 6

5 rows × 23 columns

Ideally we wouldn't want to have to input all of these fields to classify a mushroom, so let's see what happens if we try to classify based only on, say, cap-shape and cap-surface.

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd

model = SVC(kernel='rbf',C=1000)

y = mushrooms['class']
X = mushrooms[['cap-shape','cap-surface']]
X_train, X_test, y_train, y_test = \
        train_test_split(X,y,train_size=0.5)
model.fit(X_train,y_train)

y_pred = model.predict(X_test)

print('accuracy score: ',accuracy_score(y_test,y_pred))

%matplotlib inline
mat = confusion_matrix(y_test,y_pred)
sns.heatmap(mat.T, square=True,annot=True, \
            xticklabels=['edible','poisonous'], \
            yticklabels=['edible','poisonous'], cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label')
accuracy score:  0.6302314130969966
Text(109.44999999999997, 0.5, 'predicted label')

png

We are only able to achieve about 63% accuracy this way, with 730 poisonous mushrooms classified as edible. We can do better by including more mushroom attributes, for example cap-color and gill-color:

X = mushrooms[['cap-shape','cap-surface','cap-color','gill-color']]
X_train, X_test, y_train, y_test = \
        train_test_split(X,y,train_size=0.5)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
print('accuracy score: ',accuracy_score(y_test,y_pred))

mat = confusion_matrix(y_test,y_pred)
sns.heatmap(mat.T, square=True,annot=True, \
            xticklabels=['edible','poisonous'], \
            yticklabels=['edible','poisonous'],cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label')
accuracy score:  0.8648449039881831
Text(109.44999999999997, 0.5, 'predicted label')

png

We've gone up to 86% accuracy and now misclassify 400 poisonous mushrooms as edible. If we keep adding attributes then the accuracy increases, but of course that also means we have more work to do in describing the mushroom we've found.