In the last two articles on K-Means Clustering and Hierarchical Clustering I focused on clustering algorithms (obviously!). But now, we move on to a different genre of algorithm with K-Nearest Neighbor or k-NN. k-NN is a classification algorithm that assumes that similar observations are near one another. This means, that given a new observation, the class label assigned to that point is the label that is most frequently present around that new observation. The ‘k’ in k-NN determines how many neighbors are we to look at before deciding on a class label. As can be seen in Figure A.1.1, we employ a 3-NN. Hence, for the new observation, we look at the three nearest neighbors. Since 2 out the 3 neighbors belong to Class B, we assign Class B to the new observation. Choosing the value of ‘k’ largely depends on the input data for e.g., data with huge number of outliers will be benefited from a higher value of ‘k’. However, larger values of ‘k’ can lead to high bias and low variance. Therefore, its a tradeoff that the user needs to be careful about.
Figure A.1.1 |
Data Analysis
Unlike the previous two algorithms - K-Means and Hierarchical Clustering which are unsupervised clustering algorithms, k-NN is a supervised classification algorithm. This means that -
- We are no longer clustering given observations; instead we are classifying a new observation based on a given set of observations.
- We now need a label corresponding to our observations.
Due to this, we now have to change the data we can use for the analysis for k-NN. We will start with our data variable from the K-Means article, which contains the pre-processed data. Let look at the data once again using head(data).
For k-NN, we will use this entire data. In this, our class label will be the “Channel” column and the rest of the six columns will be our predictor variables.
Now, we have to divide this dataset into Training and Testing.
In the above code, we split the data using a 3:1 ratio. We have to create separate variables for predictor and target variables. We have 329 observations in our training data and 111 observations in our testing data. Now, just like the clustering algorithms, we need to scale our data. We can scale our training data like usual, but we need to scale our testing data using the mean and standard deviation that we used to scale our training data. This is due to the fact that, in real world, we dont know our test data. All we have is the training data, hence, when scaling a new observation, we can scale it using the parameters of our original set of observations.
k-NN in R
We will now implement k-NN in R using the knn() function, which is provided in the class library. knn() takes in four inputs - predictor variables for training data, predictor variables for test data, target variable for training data and the value for ‘k’. For now, we will follow the rule of thumb to use \(k = \sqrt{n}\) where n is the number of observations in the training data. This generally works and is good enough to give a baseline result for our data.
Given this, we can use \(k=18\).
We can evaluate this by comparing the results to our test target variables
The accuracy that we get from our model is 89.19%, which is quite good.
Analysis for ‘k’
In the previous section, we got an accuracy of 89.19% with \(k=18\). However, now we should run our previous code in a loop to see if there is a more optimal value of ‘k’ we can find. For this, we will just use our previous code and add it inside a for loop.
The result of the above plot can be seen in Figure A.1.2. From the plot we can observe that our rule of thumb ‘k’ gave a pretty good result. However, if we reduce it to \(k=15\), we can increase the accuracy to 91.89%. However, it is important to remember that currently we are using a very small amount of data to “train” our model. A more robust analysis would have been to use something like n-fold cross validation to get a more thorough idea about the optimal value of ‘k’.
Figure A.1.2 |
With this I complete the analysis on the Wholesale Dataset starting from two clustering algorithms and ending it with a classification technique.