import sqlite3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
League of Legends is a highly popular multiplayer game where two teams of 5 face off in a match. Each team member has a role, and they can pick over 140 characters (aka champions). There is even esports and prize competitions held for this game.
In this project I am going to utilize the Riot API to scrape data from their database on ranked matches. I am going to gather a match's team composition, position, and the team that won and put it into a SQLite database. Only matches from Diamond rank will be used to limit any data points that may be caused by lack of player skill (such as seen in lower ranks).
The data has been collected by various python code to generate lists of over 10k entries. All the data has been put into a SQLite database to read from this notebook.
I am going to utilize machine learning to attempt making an algorithm that can predict a match's outcome based on both team's composition. Ideally we may want an algorithm that may be able to predict match outcome >70% of the time. However, this may be an indication that the game may not be balanced if an outcome can solely be determined by character selection.
#Reading our SQLite3 Database
conn = sqlite3.connect('LOL_Match_Champion_Database.sqlite3')
cur = conn.cursor()
df_match_summary = pd.read_sql('''SELECT match_summary.summonerName,match_summary.teamId, match_summary.kills, match_summary.kills, match_summary.assists, match_summary.deaths,match_summary.goldEarned,
match_summary.neutralMinionsKilled,match_summary.totalMinionsKilled,match_summary.win,champion_table.champion, teamPosition_table.teamPosition, matchid_table.match
FROM match_summary JOIN champion_table
ON match_summary.champion_id = champion_table.champion_id
JOIN teamPosition_table
ON match_summary.teamPosition_id = teamPosition_table.teamPosition_id
JOIN matchid_table
ON match_summary.match_id = matchid_table.match_id''', conn)
df_match_summary.shape
(10020, 13)
The data has been gathered and now we need to clean and organize the data. First, each entry is marked by a Match ID; each Match ID row represents a single player's champion and lane and team. We need to organize the same Match ID's into the same row to have each row represent one match.
The positions in league of legends are as follows: Top, Mid, Jungle, ADC, and Support. Each name in a column is the champion selected for that role.
#Reorganizing the data: We want to condense each match into a row. Each row has one team and if they won.
#Gather values from winning games
df_summary_win = df_match_summary.loc[df_match_summary['win']==1]
#Unique matches to dataframe
matchlist = []
#List comprehension
matchlist=[i for i in df_match_summary['match']]
#Set to get unique match ids. This is our dictionary keys
uniquematchlist = list({x for x in matchlist})
#A list of champions from each match key will be our dic values.
windict = {}
for key in uniquematchlist:
windict[key] = None
for i in uniquematchlist:
a = df_summary_win[df_summary_win['match'] == i]
championlist = [champ for champ in a['champion']]
windict[i] = championlist
df_win_comp = pd.DataFrame.from_dict(windict, orient = 'index')
df_win_comp['win'] = 1
#Do same thing but for losing games
df_summary_loss = df_match_summary.loc[df_match_summary['win']==0]
#A list of champions from each match key will be our dic values.
lossdict = {}
for key in uniquematchlist:
lossdict[key] = None
for i in uniquematchlist:
a = df_summary_loss[df_summary_loss['match'] == i]
championlist = [champ for champ in a['champion']]
lossdict[i] = championlist
df_loss_comp = pd.DataFrame.from_dict(lossdict, orient = 'index')
df_loss_comp['win'] = 0
We have 2004 matches to analyze in our data set.
#Join the dataframes together. The list we generated the champions from is ordered so we know each champ belongs in the right spot
df_team_comp = pd.concat([df_win_comp, df_loss_comp])
df_team_comp= df_team_comp.rename(columns={0:'Top',1:'Jungle',2:'Mid',3:'ADC',4:'Support'})
df_team_comp.reset_index(inplace=True)
df_team_comp = df_team_comp.rename(columns= {'index':'MatchID'})
df_team_comp.sort_values(by='MatchID')
| MatchID | Top | Jungle | Mid | ADC | Support | win | |
|---|---|---|---|---|---|---|---|
| 906 | NA1_4191686936 | Akshan | JarvanIV | Vladimir | Jinx | Lulu | 1 |
| 1908 | NA1_4191686936 | Graves | Viego | Sylas | Zeri | Nami | 0 |
| 1165 | NA1_4191752706 | Darius | FiddleSticks | Qiyana | Jinx | Thresh | 0 |
| 163 | NA1_4191752706 | Gwen | Nautilus | Ryze | Jhin | LeeSin | 1 |
| 688 | NA1_4191851012 | Darius | JarvanIV | Sylas | Jinx | Soraka | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 471 | NA1_4262790270 | Malphite | Khazix | AurelionSol | Jinx | Blitzcrank | 1 |
| 671 | NA1_4262799959 | DrMundo | LeeSin | Corki | Lucian | Xerath | 1 |
| 1673 | NA1_4262799959 | Volibear | Trundle | Viktor | Jhin | Morgana | 0 |
| 939 | NA1_4262821937 | Poppy | Kayn | Ryze | Draven | Nautilus | 1 |
| 1941 | NA1_4262821937 | Kled | Rengar | Annie | Zeri | TahmKench | 0 |
2004 rows × 7 columns
#Check for missing values
df_team_comp.isna().any()
MatchID False Top False Jungle False Mid False ADC False Support False win False dtype: bool
Now that the data has been organized, we can begin exploring the data before attempting to make our algorithm.
First, lets see how many unique entries we have in each lane.
#Explore the Data
print('Unique Entries:',df_team_comp['Top'].nunique())
plt.figure(figsize=(15,2))
ax = sns.countplot(x='Top', data=df_team_comp)
ax.bar_label(ax.containers[0])
plt.xticks(rotation=90)
plt.show()
Unique Entries: 115
print('Unique Entries:',df_team_comp['Jungle'].nunique())
plt.figure(figsize=(20,4))
ax = sns.countplot(x='Jungle', data=df_team_comp)
ax.bar_label(ax.containers[0])
plt.xticks(rotation=90)
plt.show()
Unique Entries: 86
print('Unique Entries:',df_team_comp['Mid'].nunique())
plt.figure(figsize=(20,4))
ax = sns.countplot(x='Mid', data=df_team_comp)
ax.bar_label(ax.containers[0])
plt.xticks(rotation=90)
plt.show()
Unique Entries: 127
print('Unique Entries:',df_team_comp['ADC'].nunique())
plt.figure(figsize=(20,4))
ax = sns.countplot(x='ADC', data=df_team_comp)
ax.bar_label(ax.containers[0])
plt.xticks(rotation=90)
plt.show()
Unique Entries: 81
print('Unique Entries:',df_team_comp['Support'].nunique())
plt.figure(figsize=(20,4))
ax = sns.countplot(x='Support', data=df_team_comp)
ax.bar_label(ax.containers[0])
plt.xticks(rotation=90)
plt.show()
Unique Entries: 94
With each role we near ~90 unique entries. This will be problematic for our algorithm as we would need to convert each text name into a unique identifier number for the algorithm to read.
This phenomenon is known as "The Curse of Dimensionality". As the number of unique dimensions increase (Unique Champions per lane), the number of data points needed for good performance increases exponentionally. As a result, our model may not perform well with reading new data.
To explore this, we will go ahead and organize this original data into a training and test set. We can compare the algorithms made using this data.
#Now to apply Machine Learning Processes
team_features = ['Top','Jungle','Mid','ADC','Support']
X=df_team_comp[team_features]
Y=df_team_comp.win
train_X,val_X,train_Y,val_Y = train_test_split(X,Y,random_state=0, train_size=0.8, test_size=0.2)
#Convienent variable for all categorical columns
s = (train_X.dtypes == 'object')
object_cols = list(s[s].index)
object_cols
['Top', 'Jungle', 'Mid', 'ADC', 'Support']
As we explore the data, we run into another issue as well. In League of Legends, one team may pick a champion based on the other team's champion pick. This is known in the game as "Counter Picking". Ignoring player skill, one champion may outplay the other based on the champion's kit.
We will need to change our dataframe once more to account for this. This time we will focus on the winning outcome of blue team as the predictor variable. Our algorithm will predict whether blue team wins based on a match composition.
#As champions are picked on one team, the other team may change their pick with a champ that counters another as well
#To solve this, we will need to change the dataframe once more. Each row will need to have both teams.
#Our win variable will measure if team 1 will win based on their comp + opponent comp.
bluewin = df_match_summary.loc[(df_match_summary['win'] == 1) & (df_match_summary['teamId'] == 100)]
bluelost = df_match_summary.loc[(df_match_summary['win'] == 0) & (df_match_summary['teamId'] == 100)]
matches = df_match_summary['match']
matchlist = list({x for x in matches})
#Match Ids where blue won
bluewinlist = list({x for x in bluewin['match']})
#Match Ids where blue lost
bluelostlist = list({x for x in bluelost['match']})
#Make values into keys
bluewindict = {}
for key in bluewinlist:
bluewindict[key] = None
bluelossdict = {}
for key in bluelostlist:
bluelossdict[key] = None
matchdict = {}
for key in matchlist:
matchdict[key] = None
#Make values the list of champions
for match in matchdict.keys():
a = df_match_summary[df_match_summary['match'] == match]
championlist = [champ for champ in a['champion']]
for i in bluewinlist:
if match == i:
championlist.append(1)
for x in bluelostlist:
if match == x:
championlist.append(0)
matchdict[match] = championlist
df_match_comp = pd.DataFrame.from_dict(matchdict, orient='index')
df_match_comp= df_match_comp.rename(columns={0:'BlueTop',1:'BlueJungle',2:'BlueMid',3:'BlueADC',4:'BlueSupport',
5:'RedTop',6:'RedJungle',7:'RedMid',8:'RedADC',9:'RedSupport',10:'win'})
df_match_comp.reset_index(inplace=True)
df_match_comp = df_match_comp.rename(columns= {'index':'MatchID'})
df_match_comp.head()
| MatchID | BlueTop | BlueJungle | BlueMid | BlueADC | BlueSupport | RedTop | RedJungle | RedMid | RedADC | RedSupport | win | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NA1_4252115608 | Irelia | Diana | Ryze | Caitlyn | Yuumi | Illaoi | Nocturne | Viktor | Twitch | Shaco | 0 |
| 1 | NA1_4258396373 | Poppy | Graves | Viktor | Kaisa | Janna | Tryndamere | RekSai | Leblanc | Ashe | Lux | 0 |
| 2 | NA1_4260814646 | Gangplank | Diana | Renekton | Caitlyn | Senna | Mordekaiser | Shaco | Yasuo | Tristana | Rakan | 0 |
| 3 | NA1_4245763670 | Urgot | Khazix | Veigar | Vayne | Soraka | Aatrox | Graves | Orianna | Draven | TahmKench | 1 |
| 4 | NA1_4253659873 | Aatrox | JarvanIV | Katarina | Xayah | Renata | Galio | Nocturne | Ahri | Vayne | Zilean | 1 |
In preparation for constructing the machine learning algorithm, we will generate a cleaned dataset where we reduce the amount of unique entries in each lane, hence decreasing dimentionality.
Through exploration, we notice there are many champions only selected one or twice per role compared to over 20 times. We will label champions picked 15 times or less as "Uncommon" champions picked. The code will count an entry for each time a champion is selected in the role, and then label it as uncommon if they do not exceed 15 picks in the match list.
#First we should clean the data more by reducing the amount of categories that exist.
#Exploration shows many champions picked only once or twice for a role. We will label champs picked only 15 or less as "Uncommon"
#We have such a high cardinality with our model due to the sheer number of champions that makes it hard for our model to identify patterns.
poslist= ['Top','Jungle','Mid','ADC','Support']
uncommposdict = {}
for key in poslist:
uncommposdict[key] = None
def uncommoncount(position):
count = df_team_comp[position].value_counts()
countdict = count.to_dict()
uncommposlist = [key for key, value in countdict.items() if value <=15]
uncommposdict[position] = uncommposlist
for i in poslist:
uncommoncount(i)
#Apply the uncommon champions per position dictionary to this new dataframe
extenduncomdict = {}
keys = ['BlueTop', 'BlueJungle', 'BlueMid', 'BlueADC', 'BlueSupport',
'RedTop', 'RedJungle', 'RedMid', 'RedADC', 'RedSupport']
for key in keys:
extenduncomdict[key] = None
extenduncomdict['BlueTop'] = uncommposdict['Top']
extenduncomdict['RedTop'] = uncommposdict['Top']
extenduncomdict['BlueJungle'] = uncommposdict['Jungle']
extenduncomdict['RedJungle'] = uncommposdict['Jungle']
extenduncomdict['BlueMid'] = uncommposdict['Mid']
extenduncomdict['RedMid'] = uncommposdict['Mid']
extenduncomdict['BlueADC'] = uncommposdict['ADC']
extenduncomdict['RedADC'] = uncommposdict['ADC']
extenduncomdict['BlueSupport'] = uncommposdict['Support']
extenduncomdict['RedSupport'] = uncommposdict['Support']
#Make the new dataframe with "Uncommon Champions"
df_match_redux = df_match_comp.copy()
for key in extenduncomdict.keys():
for value in extenduncomdict[key]:
df_match_redux[key] = df_match_redux[key].replace(value,'Uncommon')
As we can see from the output below, we reduced the amount of champions as many as 60%.
teamposlist= ['BlueTop', 'BlueJungle', 'BlueMid', 'BlueADC', 'BlueSupport',
'RedTop', 'RedJungle', 'RedMid', 'RedADC', 'RedSupport']
for i in teamposlist:
print('Original',i, df_match_comp[i].nunique(), 'Reduced',i,df_match_redux[i].nunique())
Original BlueTop 101 Reduced BlueTop 41 Original BlueJungle 72 Reduced BlueJungle 34 Original BlueMid 110 Reduced BlueMid 41 Original BlueADC 68 Reduced BlueADC 23 Original BlueSupport 76 Reduced BlueSupport 29 Original RedTop 96 Reduced RedTop 41 Original RedJungle 68 Reduced RedJungle 34 Original RedMid 106 Reduced RedMid 41 Original RedADC 59 Reduced RedADC 23 Original RedSupport 71 Reduced RedSupport 29
Since we are working with categorical variables as input and as predictor, we will use a confusion matrix to output our results.
We will output the accuracy for the training set (What the model learns from), and the validation set (What the model preforms seeing new data).
def confusionprintout(train_X,train_Y,valid_X,val_Y):
preds = model.predict(train_X)
cm = confusion_matrix(train_Y, preds)
TN, FP, FN, TP = cm.ravel()
print('True Positive(TP) = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN) = ', TN)
print('False Negative(FN) = ', FN)
accuracy = (TP+TN) /(TP+FP+TN+FN)
print('Accuracy of the binary classification Training = {:0.3f}'.format(accuracy))
preds = model.predict(valid_X)
cm = confusion_matrix(val_Y, preds)
TN, FP, FN, TP = cm.ravel()
print('True Positive(TP) = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN) = ', TN)
print('False Negative(FN) = ', FN)
accuracy = (TP+TN) /(TP+FP+TN+FN)
print('Accuracy of the binary classification Validation = {:0.3f}'.format(accuracy))
Onehot encoding is a technique to take categorical data (Like names of each champion) and encode them to a specific numeric variable. This allows our algorithm to read the data point and use for predictions.
#Do Onehot Encoding on our new data
team_features = ['BlueTop', 'BlueJungle', 'BlueMid', 'BlueADC', 'BlueSupport',
'RedTop', 'RedJungle', 'RedMid', 'RedADC', 'RedSupport']
X=df_match_redux[team_features]
Y=df_match_redux.win
train_X,val_X,train_Y,val_Y = train_test_split(X,Y,random_state=0, train_size=0.8, test_size=0.2)
#Apply Onehot Encoding to each column with categorical data...so all of them
OH_encoder = OneHotEncoder(handle_unknown='ignore',sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(train_X))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(val_X))
#One hot encoding removes index. Put it back. This completes our dataset.
OH_cols_train.index = train_X.index
OH_cols_valid.index = val_X.index
We will use a Random Forest Classifier and a xgboost classifer as our algorithms.
#We try using a Random Forest Classifier to model
model = RandomForestClassifier(random_state=1, n_estimators=1000, criterion='entropy',max_depth=3)
model.fit(OH_cols_train, train_Y)
confusionprintout(OH_cols_train,train_Y,OH_cols_valid,val_Y)
True Positive(TP) = 273 False Positive(FP) = 45 True Negative(TN) = 356 False Negative(FN) = 127 Accuracy of the binary classification Training = 0.785 True Positive(TP) = 42 False Positive(FP) = 35 True Negative(TN) = 64 False Negative(FN) = 60 Accuracy of the binary classification Validation = 0.527
import xgboost as xgb
#xgb uses gradient boosting, forms an ensemble of weak prediction models (like decision trees)
model = xgb.XGBClassifier(objective = 'reg:logistic', colsample_bytree= 0.3,learning_rate = 0.1,n_estimators=100, max_depth=3)
model.fit(OH_cols_train,train_Y)
confusionprintout(OH_cols_train,train_Y,OH_cols_valid,val_Y)
True Positive(TP) = 273 False Positive(FP) = 87 True Negative(TN) = 314 False Negative(FN) = 127 Accuracy of the binary classification Training = 0.733 True Positive(TP) = 45 False Positive(FP) = 40 True Negative(TN) = 59 False Negative(FN) = 57 Accuracy of the binary classification Validation = 0.517
C:\Users\conrad.cruz\Anaconda3\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1]. warnings.warn(label_encoder_deprecation_msg, UserWarning)
As we can see on both models, we get similar results of 50% accuracy on the validation.
Does this mean our model can only predict accurately 50% of the time? Or do we still have issues with dimentionality?
We will attempt to improve our models with a few ways. First, we will implement K-Fold Cross Validation. This will help ensure our model utilizes as much training data as it can.
#Do K-Fold Cross Validation
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
cv = KFold(n_splits=10, random_state=1, shuffle=True)
#Onehot encoding but for not split sets
X=df_match_redux[team_features]
Y=df_match_redux.win
#Apply Onehot Encoding to each column with categorical data...so all of them
OH_encoder = OneHotEncoder(handle_unknown='ignore',sparse=False)
OH_X = pd.DataFrame(OH_encoder.fit_transform(X))
#One hot encoding removes index. Put it back. This completes our dataset.
OH_X.index = X.index
model = RandomForestClassifier(random_state=1, n_estimators=2000, criterion='entropy',max_depth=3)
scores = cross_val_score(model, OH_X, Y, scoring='accuracy', cv=cv, n_jobs=-1)
print(scores)
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))
[0.48514851 0.54455446 0.53 0.51 0.57 0.48 0.44 0.47 0.6 0.52 ] Accuracy: 0.515 (0.046)
model = xgb.XGBClassifier(objective = 'reg:logistic', colsample_bytree= 0.3,learning_rate = 0.1,n_estimators=100, max_depth=3)
scores = cross_val_score(model, OH_X, Y, scoring='accuracy', cv=cv, n_jobs=-1)
print(scores)
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))
[0.54455446 0.55445545 0.49 0.52 0.55 0.53 0.54 0.46 0.54 0.55 ] Accuracy: 0.528 (0.029)
As we can see on both models, we get similar results of 50% accuracy on the validation still.
Perhaps we can get better insight if we impliment counterpicking as discussed earlier.
In order to simulate counterpicking, we will need to generate a new database on what champion counters who, and then using that, implement whether a matchup has an advantage or disadvantage due to counterpicking.
We will gather counterpicking from This Counterpicking Website. Bear in mind this was accessed on 04/01/2022 and the game may have changed since then on counters as champions get updated or added in.
We will also add in the uncommon champion picks as well to this dataframe.
#Read new database created on champ counters.
#This was constructed on 04/01/2022 on https://www.counterstats.net/
#We clean the strings in this new database.
conn.close()
conn = sqlite3.connect('Champ_Counter.sqlite3')
cur = conn.cursor()
df_counter = pd.read_sql("""SELECT champcounter_table.champion, champcounter_table.bestcounter, champcounter_table.worstcounter
FROM champcounter_table""",conn)
size = len(df_counter['champion'])
for i in range(size):
champ = df_counter['champion'][i]
champ = champ.replace('-','')
df_counter['champion'][i] = champ.strip()
size = len(df_counter['bestcounter'])
for i in range(size):
champ = df_counter['bestcounter'][i]
champ = champ.replace('-','')
df_counter['bestcounter'][i] = champ.strip()
size = len(df_counter['worstcounter'])
for i in range(size):
champ = df_counter['worstcounter'][i]
champ = champ.replace('-','')
df_counter['worstcounter'][i] = champ.strip()
df_counter['champion'] = df_counter['champion'].replace('wukong','monkeyking')
df_counter['champion'] = df_counter['champion'].replace('nunuwillump','nunu')
df_counter['champion'] = df_counter['champion'].replace('renataglasc','renata')
df_counter['bestcounter'] = df_counter['bestcounter'].replace('wukong','monkeyking')
df_counter['bestcounter'] = df_counter['bestcounter'].replace('nunuwillump','nunu')
df_counter['bestcounter'] = df_counter['bestcounter'].replace('renataglasc','renata')
df_counter['worstcounter'] = df_counter['worstcounter'].replace('wukong','monkeyking')
df_counter['worstcounter'] = df_counter['worstcounter'].replace('nunuwillump','nunu')
df_counter['worstcounter'] = df_counter['worstcounter'].replace('renataglasc','renata')
df_counter=df_counter.set_index('champion')
df_match_comp['topadv'] = 0
df_match_comp['jungadv'] = 0
df_match_comp['midadv'] = 0
df_match_comp['ADCadv'] = 0
df_match_comp['supadv'] = 0
def advantage(bluerole,redrole,advcol):
for champ1, champ2, i in zip(df_match_comp[bluerole], df_match_comp[redrole], range(0,size,1)):
champ1 = champ1.lower()
champ2 = champ2.lower()
pair = (champ1, champ2)
bestcounter = df_counter['bestcounter'].loc[champ1]
worstcounter = df_counter['worstcounter'].loc[champ1]
opponent = pair[1]
if bestcounter in opponent:
df_match_comp[advcol][i] = 1
else:
df_match_comp[advcol][i] = 0
if worstcounter in opponent:
df_match_comp[advcol][i] = 2
otheropponent = pair[0]
bestcounter = df_counter['bestcounter'].loc[champ2]
worstcounter = df_counter['worstcounter'].loc[champ2]
if bestcounter in otheropponent:
df_match_comp[advcol][i] = 2
else:
pass
if worstcounter in otheropponent:
df_match_comp[advcol][i] = 1
else:
pass
pd.options.mode.chained_assignment = None
advantage('BlueTop','RedTop','topadv')
advantage('BlueJungle','RedJungle','jungadv')
advantage('BlueMid','RedMid','midadv')
advantage('BlueADC','RedADC','ADCadv')
advantage('BlueSupport','RedSupport','supadv')
#Now we added the new features in
df_match_comp.head()
| MatchID | BlueTop | BlueJungle | BlueMid | BlueADC | BlueSupport | RedTop | RedJungle | RedMid | RedADC | RedSupport | win | topadv | jungadv | midadv | ADCadv | supadv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NA1_4252115608 | Irelia | Diana | Ryze | Caitlyn | Yuumi | Illaoi | Nocturne | Viktor | Twitch | Shaco | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | NA1_4258396373 | Poppy | Graves | Viktor | Kaisa | Janna | Tryndamere | RekSai | Leblanc | Ashe | Lux | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | NA1_4260814646 | Gangplank | Diana | Renekton | Caitlyn | Senna | Mordekaiser | Shaco | Yasuo | Tristana | Rakan | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | NA1_4245763670 | Urgot | Khazix | Veigar | Vayne | Soraka | Aatrox | Graves | Orianna | Draven | TahmKench | 1 | 0 | 0 | 0 | 0 | 0 |
| 4 | NA1_4253659873 | Aatrox | JarvanIV | Katarina | Xayah | Renata | Galio | Nocturne | Ahri | Vayne | Zilean | 1 | 1 | 0 | 0 | 0 | 0 |
#Make the new dataframe with "Uncommon Champions" & the Advantage saved.
df_match_final = df_match_comp.copy()
for key in extenduncomdict.keys():
for value in extenduncomdict[key]:
df_match_final[key] = df_match_final[key].replace(value,'Uncommon')
teamposlist= ['BlueTop', 'BlueJungle', 'BlueMid', 'BlueADC', 'BlueSupport',
'RedTop', 'RedJungle', 'RedMid', 'RedADC', 'RedSupport']
Now to apply Onehot encoding.
df_match_finalset = df_match_final.copy()
#Preprocess data with onehot encoding
OH_X_Features = df_match_finalset[teamposlist]
#Apply Onehot Encoding to each column with categorical data...so all of them
OH_encoder = OneHotEncoder(handle_unknown='ignore',sparse=False)
OH_cols_x = pd.DataFrame(OH_encoder.fit_transform(OH_X_Features))
#One hot encoding removes index. Put it back.
OH_cols_x.index = OH_X_Features.index
#Remove categorical columns to replace with OH encoding
num_features = df_match_finalset.drop(teamposlist, axis=1)
df_OH = pd.concat([num_features, OH_cols_x], axis=1)
finalY = df_OH['win']
finalX = df_OH.drop(['win'], axis=1)
finalX = finalX.drop(['MatchID'], axis=1)
| topadv | jungadv | midadv | ADCadv | supadv | 0 | 1 | 2 | 3 | 4 | ... | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0 | 1 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 1 | 0 | 0 | 0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 997 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 998 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 999 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1000 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1001 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1002 rows × 341 columns
Implement K-Cross Validation. Create the model. And print out results.
cv = KFold(n_splits=10, random_state=1, shuffle=True)
model = RandomForestClassifier(random_state=1, n_estimators=2000, criterion='entropy',max_depth=3)
scores = cross_val_score(model, finalX, finalY, scoring='accuracy', cv=cv, n_jobs=-1)
print(scores)
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))
[0.5049505 0.56435644 0.52 0.44 0.57 0.52 0.44 0.46 0.57 0.51 ] Accuracy: 0.510 (0.048)
model = xgb.XGBClassifier(objective = 'reg:logistic', colsample_bytree= 0.3,learning_rate = 0.1,n_estimators=100, max_depth=3)
scores = cross_val_score(model, finalX, finalY, scoring='accuracy', cv=cv, n_jobs=-1)
print(scores)
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))
[0.54455446 0.51485149 0.46 0.54 0.57 0.5 0.52 0.46 0.51 0.56 ] Accuracy: 0.518 (0.036)
Despite this work, we still get an accuracy rating of only ~50%.
This could mean one of two reasons. The first reason is that perhaps our model requires more adjustment. With the curse of dimensionality, our model may need many more datapoints in order to accurately make predictions. Despite reducing the amount of unique entries with the "Uncommon" champion cleanup, we still have maybe 20-40 unique entries for each role. Improvement of this model may require skills I currently do not have at this time.
The other reason could be the model is predicting accurately 50%. This would support the idea that perhaps the game is balanced as it is. Indeed, if a match outcome could be predicted just by looking at the match champion composition alone, it may not be a very balanced game. The reality is League of Legends match outcomes vary by player skill, how they "build" their characters, how communicative they are, and how they play the game (Safe or Aggressive). While match composition matters, game developers may want it to only matter 50% of the time, and the other 50% of the time be dictated by players. A match wouldnt be fun if you can reliably tell who would win by the composition!
This project helped to explore Riot's API and generating Databases with SQLite. It also helped to explore manipulating datasets, explore the Curse of Dimentionality, and creating categorical predictor algorithm models.