the decision blog

classifying D&D monsters with isomap and k-means

Written on

Machine learning algorithms that are trained on data that has been labeled, in order to predict the labels for unlabeled data, are called "supervised learning" algorithms; on the other hand, those that search for patterns in unlabeled data, in order to simplify it, are called "unsupervised learning" algorithms. How might unsupervised learning algorithms, such as Isomap and K-means, categorize D&D monsters according to their stats alone? Would they, e.g., group dragons together?

Let's have a look at the data:

import pandas as pd

data = pd.read_json('srd_5e_monsters.json')
name meta Armor Class Hit Points Speed STR STR_mod DEX DEX_mod CON ... Challenge Traits Actions Legendary Actions img_url Damage Immunities Condition Immunities Damage Resistances Damage Vulnerabilities Reactions
0 Aboleth Large aberration, lawful evil 17 (Natural Armor) 135 (18d10 + 36) 10 ft., swim 40 ft. 21 (+5) 9 (-1) 15 ... 10 (5,900 XP) <p><em><strong>Amphibious.</strong></em> The a... <p><em><strong>Multiattack.</strong></em> The ... <p>The aboleth can take 3 legendary actions, c... NaN NaN NaN NaN NaN
1 Acolyte Medium humanoid, any 10 9 (2d8) 30 ft. 10 (+0) 10 (+0) 10 ... 1/4 (50 XP) <p><em><strong>Spellcasting.</strong></em> The... <p><em><strong>Club.</strong></em> <em>Melee W... NaN NaN NaN NaN NaN NaN
2 Adult Black Dragon Huge dragon, chaotic evil 19 (Natural Armor) 195 (17d12 + 85) 40 ft., fly 80 ft., swim 40 ft. 23 (+6) 14 (+2) 21 ... 14 (11,500 XP) <p><em><strong>Amphibious.</strong></em> The d... <p><em><strong>Multiattack.</strong></em> The ... <p>The dragon can take 3 legendary actions, ch... Acid NaN NaN NaN NaN
3 Adult Blue Dragon Huge dragon, lawful evil 19 (Natural Armor) 225 (18d12 + 108) 40 ft., burrow 30 ft., fly 80 ft. 25 (+7) 10 (+0) 23 ... 16 (15,000 XP) <p><em><strong>Legendary Resistance (3/Day).</... <p><em><strong>Multiattack.</strong></em> The ... <p>The dragon can take 3 legendary actions, ch... Lightning NaN NaN NaN NaN
4 Adult Brass Dragon Huge dragon, chaotic good 18 (Natural Armor) 172 (15d12 + 75) 40 ft., burrow 30 ft., fly 80 ft. 23 (+6) 10 (+0) 21 ... 13 (10,000 XP) <p><em><strong>Legendary Resistance (3/Day).</... <p><em><strong>Multiattack.</strong></em> The ... <p>The dragon can take 3 legendary actions, ch... Fire NaN NaN NaN NaN

5 rows × 31 columns

There's a lot of information here, and it will need to be converted into fully numerical form before Isomap and K-means can act on it. Also, there's a column that includes urls for images of nearly all of the monsters; we'd like to use those in presenting our results. I've already run a script that downloads those images and turns them into thumbnails, so we just need to clean the data.

We start by eliminating the rows for monsters that don't have corresponding images as well as columns that have NaN values. We also create a separate column to keep track of the row numbers identifying each monster. (The image files have these same numbers as suffixes, and it will be important to keep track of which row goes with which image file.)

import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer

data['monster num'] = data.index
rows_to_keep = pd.read_csv('rows_to_keep.csv')
data = data.iloc[rows_to_keep['keep']]
data = data.dropna(axis=1).reset_index(drop=True)
data = data.drop('img_url',axis=1)

Next we clean up the columns with numerical data (e.g., for the modifiers we need to have entries like "1" and not "(+1)") and vectorize the columns with linguistic data.

for i in range(data.shape[0]):
    data.loc[i,'Hit Points'] =int("(\d+)", \
                              data.loc[i]['Hit Points']).group())
    data.loc[i,'Challenge']  =int("(\d+)", \
    data.loc[i,'Armor Class']=int("(\d+)", \
                              data.loc[i]['Armor Class']).group())
    data.loc[i,'Speed']      =int("(\d+)", \
    data.loc[i,'STR_mod']    =int("(\d+)", \
    data.loc[i,'DEX_mod']    =int("(\d+)", \
    data.loc[i,'CON_mod']    =int("(\d+)", \
    data.loc[i,'CHA_mod']    =int("(\d+)", \
    data.loc[i,'INT_mod']    =int("(\d+)", \
    data.loc[i,'WIS_mod']    =int("(\d+)", \

vec = CountVectorizer()
langs = vec.fit_transform(data['Languages'])
langs = pd.DataFrame(langs.toarray(), \
data = data.drop(['Languages'],axis=1)

meta = vec.fit_transform(data['meta'])
meta = pd.DataFrame(meta.toarray(), \
data = data.drop(['meta'],axis=1)

senses = vec.fit_transform(data['Senses'])
senses = pd.DataFrame(senses.toarray(), \
data = data.drop(['Senses'],axis=1)

data = pd.concat((data,langs,meta,senses),axis=1)

Now that the data is cleaned, we store the monster number and name in a separate data frame and drop the "Challenge" column because we don't want the monster challenge rating to enter into the resulting categorization. In addition, we rescale the data so that, e.g., hit points don't dominate simply because they are on a larger numerical scale than the other attributes.

from sklearn.preprocessing import StandardScaler

X = data.drop(['name','monster num','Challenge'],axis=1)
names = data[['name','monster num']]

scaler = StandardScaler()
X = scaler.fit_transform(X)

Now we can apply Isomap to reduce the dimensionality of the space before applying K-means in the form of the spectral clustering algorithm.

from sklearn.manifold import Isomap
from sklearn.cluster import SpectralClustering

model = Isomap(n_neighbors=7)

X_iso = pd.DataFrame(model.fit_transform(X),index=data.index)
X_iso['monster num'] = data['monster num']
X_iso = X_iso.set_index('monster num')

kmeans = SpectralClustering(n_clusters=5, \
labels = kmeans.fit_predict(X_iso)

We have re-indexed X_iso above by monster number for plotting the results, with which we now proceed.

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from matplotlib import offsetbox
from PIL import Image
import os

%matplotlib inline

iso_x = 0
iso_y = 1
fig, ax = plt.subplots()
fig.set_size_inches(18, 10)
im = ax.scatter(X_iso.loc[:][iso_x],X_iso.loc[:][iso_y],s=2500, \

# load the monster images and plot them on top of the data
# points
directory = './monster_img'

pics = []
file_nums = np.arange(400)
for filename in os.listdir(directory):
    # strip the number off the file name
    img_id ="(\d+)", filename).group()
    f = os.path.join(directory, filename)
    # checking if it is a file
    if os.path.isfile(f):
images = pd.DataFrame(pics,columns=['monster num','jpeg'])
images = images.join(names.set_index('monster num'),on='monster num')
names = images['name']

def plot_components(X_iso, images, ax):
    props = dict(boxstyle='round', facecolor='wheat', alpha=0.6)
    for i in range(images.shape[0]):
        num = images['monster num'][i]
        coords = [X_iso.loc[num][iso_x],X_iso.loc[num][iso_y]]
        imagebox = offsetbox.AnnotationBbox( \
            offsetbox.OffsetImage( \
            images.set_index('monster num').loc[num][0], \
            cmap='gray', \
        ax.text(X_iso.loc[num][iso_x],X_iso.loc[num][iso_y], \
                names[i], \
    ax.set_xlabel('iso '+str(iso_x))
    ax.set_ylabel('iso '+str(iso_y))



The circles underlying the thumbnail images have been color-labeled and sorted into 5 groups via K-means; we see that, e.g., the dragons have indeed been placed together, as have mostly the undead with the other undead and likewise for the celestials (except that demons have been added in!). Isomap and K-means have done a reasonable job of collecting like monsters with like, with knowledge only of the stats and not including the challenge rating.

Discuss on Twitter