k-Nearest Neighbors – Chris Cornwell

Guest author Rachel Gorenstein is writing for this post on k-Nearest Neighbors. She did an awesome job! 👏

k-Nearest Neighbors, abbreviated kNN, is an example of a supervised learning algorithm used for classification and regression. Stripped down to its most basic assumptions, the algorithm is akin to “majority rules” decision making. It operates by first memorizing a given training set of observations and their labels, and subsequently predicting the labels of new, unseen instances based on those of their “nearest neighbors.”

kNN Classification. Let’s say that we are given a set of observations $(x_1,y_1),(x_2,y_2),\ldots (x_n,y_n)$ , where the $x_i$ ‘s denote features (also called predictors or attributes) and the $y_i$ ‘s denote their respective labels (also called the target). The goal of kNN is to capture the relationship between the features and their labels in order to learn some function $h: X \rightarrow Y$ that can predict the corresponding $y$ given an unseen observation, $x$ . kNN tells us that the label of $x$ will be determined by the majority vote of the $k$ most similar instances to $x$ . Here, similarity is defined according to a distance metric.

Using random and shuffle from the NumPy library, you can generate a set $S$ of labeled observations. One such set is shown below:

One unlabeled point and the 3 points nearest to it in S.

In the figures the points in $S$ are the features and each color is a label. The unseen point $(1,3.25)$ (plotted in black) and it’s nearest three neighbors are circled. kNN concludes that it is safe to assume that the unseen point would be labeled green. This fact is confirmed using the kNeighborsClassifier from the sci-kit learn library in Python as below:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
X = list(zip(x,y))
c = list(colors)
knn.fit(X,c)
pred = knn.predict([[1,3.25]])

The process of kNN can be more formally described in a few steps. Given a set of training observations, $S=\{(x_1,y_1),(x_2,y_2),\ldots (x_n,y_n)\}$ , a positive integer $k$ , an unseen observation $x$ , and a distance metric $d$ :

- Compute $d$ between $x$ and each $x_i \in S$ to determine the $k$ nearest points to $x$ and call this set $A$
- For each $y_i \in S$ , compute the fraction of points in $A$ with that label
- The input $x$ should be assigned the label which comprises the highest fraction of $A$

Note: choosing an odd $k$ will help to avoid ties.

When $k = 1$ . Using Voronoi and voronoi_plot_2d from the SciPy library, you can plot the Voronoi tessellation of the plane¹ such that each training point is the generating point in its own Voronoi region² as in the figure.

Each region in the figure is labeled according to its center/generating point. Therefore, in the case of $k=1$ the label of an unseen observation will be determined according to which Voronoi region it falls in.

^1. The partitioning of a plane with n points into regions such that each region contains exactly one generating point and every point in a given region is closer to its generating point than to any other.
^2.What Are Voronoi Diagrams?

kNN Regression. In addition to the kNeighborsClassifier, the sci-kit learn library has the kNeighborsRegressor which can be used when the target variable (the label) is continuous in nature. The process by which the kNN regressor works is nearly identical to that of the kNN classifier (see section kNN Classification) except that the input $x$ will be assigned the label which is either the average or median of all $y$ ’s in $A$ , the set of training points that are the $k$ nearest points to $x$ .

Points sampled from the graph of a function.

We will use the two figures below to illustrate how the kNeighborsRegressor works.

The figure to the right is the plot of $f(x)=x(x-1)(x-4)$ for 40 random values of $x$ . The
regressor can be fit to the data using varying values for $k$ (aka n_neighbors) as
below:

from sklearn.neighbors import KNeighborsRegressor
nnregress = KNeighborsRegressor(n_neighbors=k)
nnregress.fit(X, y)

Using this regressor, $y$ is predicted for 500 random $x$ values via nnregress.predict
and plotted in orange along with the original function (blue) in the figures below.

Applications. kNN is useful for a variety of classification problems in a variety of fields. The algorithm can be used to help determine how a potential voter may vote or even whether or not they will vote at all. The algorithm can be used by financial institutions to predict the credit rating of customers or by potential Airbnb hosts to determine how much to charge renters per night. Some more advanced applications include text classification, handwriting recognition, and even image or video detection.

Drawbacks of kNN. While the kNN algorithm certainly has its advantages, it also suffers as a result of the curse of dimensionality³. Using a small number of input variables, kNN operates with excellent efficiency. However, as the data set grows, the efficiency and speed of the algorithm declines. This also results in a high computational cost as the algorithm requires that the distance between each $x$ and each training observation be calculated.

^3. The Curse of Dimensionality – Wikipedia