Concept#
Using Seed Number 1992
Intuition#
This section is adapted from Machine Learning, The Basics, by Alexander
Jung[Jung, 2023]. Nearest Neighbor (NN) methods, including
Metric Space Requirement#
NN methods mandate that the feature space
For example, when the feature space is the Euclidean space
between two vectors
Training and Hypothesis Space#
Given a training set
The function value
Binary Classification Illustration#
To illustrate, let’s consider a binary classification task where
For a given feature vector
Notable Characteristics#
Dependence on Training Set: The hypothesis space of
-NN relies on , requiring access and storage of the training set to compute predictions , leading to significant memory demands.No Parameter Tuning: Unlike linear regression or deep learning,
-NN does not require the solution of optimization problems for model parameters, aside from choosing .Conceptual Appeal: With low computational requirements (aside from memory), NN methods are conceptually appealing as natural approximations of Bayes estimators.
In summary, NN methods, including
Since the models keep the training examples around at test time, we call them exemplar-based models. (This approach is also called instance-based learning, or memory-based learning)[Murphy, 2022].
Problem Formulation for K-Nearest Neighbors (KNN)#
Given a set
where
is the feature space, for regression or for classification is the label space, is the -th sample with number of features, is its corresponding label.
Here both
The K-Nearest Neighbors algorithm is a non-parametric method that classifies
an unlabeled instance
The Nearest Neighbour Set and the Distance Metric#
Define
where
is a distance metric that measures the similarity between two data points. is the distance between and .
The set includes the
labels of the nearest neighbors of in the training set for convenience. In actual fact, we only need the nearest neighbors of with respect to the distance metric on all the training samples and do not depend on the labels .
The most commonly used distance metric is the Euclidean distance:
However, other distance metrics such as the Manhattan distance and the cosine distance can also be used.
Probabilistic Interpretation of K-Nearest Neighbors#
Given a new input
where
is a class label in the label space , is the indicator function that returns if the condition is true and otherwise.
We can then return this distribution or the majority label.
Classification#
The K-Nearest Neighbors algorithm for classification defines a function
or more explicitly:
Here,
In short, we do the following:
Given one query point
from the test set , we aim to assign a label.Calculate the distance from the test query point to all other points in the train dataset.
We sort and then store the
nearest neighbors (data points).Get the majority class from these neighbors.
Return this class as the prediction.
This formalism can be related to the probabilistic interpretation by recognizing that the majority vote corresponds to selecting the class with the highest estimated probability:
Note that the probabilistic interpretation of KNN aligns with the non-parametric
nature of the algorithm, estimating the conditional probability distribution
over labels based on local similarities in the feature space
Regression#
For regression, the K-Nearest Neighbors algorithm defines a function
In short, we do the following:
Calculate the distance from the test query point $\mathbf{x}^{(q)} to all other points in the train dataset.
We sort and then store the
nearest neighbors (data points).Get the majority class from these neighbors.
Take the mean of the top
nearest labels and return the mean as the prediction.
Summary#
The goal of the K-Nearest Neighbors algorithm, whether for classification or regression, is to predict the labels of unlabeled instances in a manner that is consistent with the underlying distribution of the labels in the neighborhood of each instance. It assumes that instances close in the feature space are likely to have similar labels.
The algorithm has no explicit training phase but relies on the entire training
set for prediction. Consequently, the complexity of the model grows linearly
with the size of the training set, making it computationally expensive for large
datasets. Moreover, the choice of
KNN has been widely applied in various domains, including pattern recognition, anomaly detection, and recommendation systems. Its simplicity and flexibility to adapt to different types of data and problems make it a popular choice among practitioners. However, it may suffer from the curse of dimensionality and requires careful preprocessing and parameter tuning to perform optimally.
In summary, K-Nearest Neighbors provides an intuitive and straightforward way to perform classification (or regression) based on the principle that similar instances are likely to have similar outputs. Its non-parametric nature allows it to capture complex relationships without relying on a specific functional form, but it requires careful consideration of the hyperparameters and the underlying assumptions about the data.
Our Focus (K-Nearest Neighbors Classification)#
We will primarily focus on the K-Nearest Neighbors algorithm for classification in this chapter.
K-Nearest Neighbors Convergence#
Murphy wrote in his book the following:
The K-Nearest Neighbors (KNN) classifier becomes within a factor of 2 of the Bayes error as
[Murphy, 2022].
For a more concrete proof, see CS4780.
Here’s why I think is the case:
1. Convergence of Nearest Neighbors#
As the number of data points
2. Density Estimation#
KNN effectively acts as a non-parametric density estimator. By considering the
majority vote of the
3. Bayes Classifier#
The Bayes classifier is the optimal classifier that assigns a label to a query
point based on the true conditional probability
4. KNN Convergence to Bayes Classifier#
As
5. Factor of 2 Result#
The specific result that the error rate of the KNN classifier is at most twice
the Bayes error rate can be proven rigorously using mathematical analysis. It
relies on the assumptions that the data is drawn independently from a stationary
distribution and that certain conditions on
In summary, the convergence of the KNN error rate to within a factor of 2 of the
Bayes error is a manifestation of the algorithm’s statistical consistency and
its ability to approximate the true underlying distribution of the data as the
sample size grows. It’s a powerful result that illustrates the theoretical
strength of KNN despite its simplicity. It also underscores the importance of
the choice of
Hypothesis Space for K-Nearest Neighbors (KNN)#
The hypothesis space
Definition for Classification#
Given a set of
Here:
is the feature space. is the set of possible labels (e.g., for a -class classification problem). denotes the set of nearest neighbors of in the training set. is the indicator function.
This definition represents the class of functions that can be constructed using
the KNN algorithm, considering different choices of
General Definition and Properties#
The hypothesis space
Summary#
Unlike K-Means, where the hypothesis space has finite cardinality, the
hypothesis space for KNN is more complex and depends on the continuous nature of
the feature space and the choice of
In summary, the hypothesis space for KNN encapsulates the algorithm’s ability to classify data points based on local similarities and the principle that similar instances in the feature space are likely to have similar labels. Its richness and complexity mirror the inherent flexibility and adaptability of the KNN algorithm.
Loss, Cost and Objective Functions???#
The K-Nearest Neighbors (KNN) algorithm is a non-parametric method, meaning it doesn’t learn parameters from the training data through optimization. Instead, it relies on the entire training set to make predictions. Because of this, there is no specific cost, loss, or objective function that needs to be minimized during a training phase.
See this post in stackexchange for more info.
If you really want, there is no stopping you from defining a loss function for KNN to check for “goodness” of the model. But, it is not necessary.
Loss Function: The empirical risk can be defined using various loss functions
, such as the 0-1 loss for classification or mean squared error for regression:Generalization: The performance of KNN on new data points can be studied through statistical learning theory. The true risk under the data-generating distribution
can be expressed as:
Algorithm#
Remark 5 (Choice of K and Distance Metric)
The choice of
Algorithm 3 (K-Nearest Neighbors (KNN) Algorithm)
Given a set of labeled data points (training samples)
and a new unlabeled instance
Choose the Number of Neighbors (K): Select a value for
, which is the number of nearest neighbors to consider for classification.Compute Distances: Calculate the distance between
and each data point in . The distance metric (e.g., Euclidean distance) can be chosen based on the problem domain.Select K Nearest Neighbors: Identify the
data points in with the smallest distances to . This set of nearest neighbors is denoted as .Compute the Predicted Label: Determine the predicted label for
based on the labels of its nearest neighbors. For classification, this can be done by taking a majority vote among the nearest neighbors.where
is the indicator function.
The predicted label
This algorithm represents the standard operation of KNN for classification. If
used for regression, step 4 would involve computing a mean or weighted mean of
the target values of the
Time Complexity of Naive (Brute-Force) KNN#
Euclidean Distance Calculation: The time complexity for calculating the Euclidean distance between two points of dimension
is , as it involves subtractions, squares, and additions with one square root operation at the end.Distance Computation for All Training Examples:
For each test sample, calculating the distance to all
training examples takes .
Sorting the Distances:
Sorting the distances for each test sample takes
.
Calculating Majority Class:
Determining the majority class based on the
nearest neighbors takes time.
Total Time Complexity for a Single Query:
The overall complexity for one test sample is
but since , the time complexity is .
Total Time Complexity for Multiple Queries:
If there are
test samples, the final time complexity is .
Space Complexity of Naive (Brute-Force) KNN#
Input Matrix: The space required for the training data
is .Distance List: The space required for storing the distances to all training examples (for a single query) is
.Total Space Complexity:
The overall space complexity is
, which simplifies to since is usually the dominant term.
Summary#
Time Complexity:
for test samples.
Space Complexity:
.
The naive K-NN’s time complexity is primarily influenced by the number of training examples, the dimensionality of the feature space, and the number of test samples. The space complexity is determined by the size of the training data. This detailed analysis aligns with the practical implementation of the brute-force K-NN method, highlighting the computational challenges that may arise with large datasets or high-dimensional feature spaces.
Other K-NN Methods and Their Time and Space Complexity#
I will not go into the details of these methods, but I will provide a brief overview of their time and space complexity. The table below summarizes the complexity of the QuickSelect, Ball Tree, and KD-Tree methods.
Method |
Operation |
Best Case Time |
Average Case Time |
Worst Case Time |
Space Complexity |
Description |
---|---|---|---|---|---|---|
QuickSelect |
Finding |
QuickSelect is used to find the |
||||
Ball Tree |
Build |
Building a Ball Tree involves partitioning data into hyper-spheres. |
||||
Ball Tree |
Query |
Querying a Ball Tree is usually logarithmic but can degrade to linear time in the worst case. |
||||
KD-Tree |
Build |
Building a KD-Tree involves partitioning space along axes-aligned hyperplanes. |
||||
KD-Tree |
Query |
Querying a KD-Tree is usually logarithmic but can degrade to linear time in the worst case. |
Example#
We illustrate the KNN classifier in

Fig. 12 Illustration of a
K-Nearest Neighbors (KNN) and Voronoi Regions#
The K-Nearest Neighbors classifier classifies an unlabeled instance
Voronoi Regions for 1-NN#
Given a training set
The Voronoi regions partition the feature space
This definition says that the Voronoi region
Characteristics of 1-NN Voronoi Regions#
When
Delta Function Predictive Distribution: The predicted label for a query point
is exactly the label of its nearest neighbor, resulting in a delta function predictive distribution[Murphy, 2022].Voronoi Tessellation: The Voronoi regions create a partitioning of the feature space, with each region corresponding to a training point.
Zero Training Error: The model will perfectly classify all training points, leading to zero training error.
Overfitting: A 1-NN model may overfit the training data and perform poorly on unseen data.
Classification Using Voronoi Regions#
Given an unlabeled instance
Example#
Let’s illustrate with an example from Probablistic Machine Learning, An Introduction by Kevin Murphy[Murphy, 2022].
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import KDTree, Voronoi, voronoi_plot_2d
np.random.seed(42)
data = np.random.rand(25, 2)
vor = Voronoi(data)
print("Using scipy.spatial.voronoi_plot_2d, wait...")
voronoi_plot_2d(vor)
xlim = plt.xlim()
ylim = plt.ylim()
plt.show()
print("Using scipy.spatial.KDTree, wait a few seconds...")
plt.figure()
tree = KDTree(data)
x = np.linspace(xlim[0], xlim[1], 200)
y = np.linspace(ylim[0], ylim[1], 200)
xx, yy = np.meshgrid(x, y)
xy = np.c_[xx.ravel(), yy.ravel()]
plt.plot(data[:, 0], data[:, 1], "ko")
plt.pcolormesh(x, y, tree.query(xy)[1].reshape(200, 200), cmap="jet")
plt.show()
Using scipy.spatial.voronoi_plot_2d, wait...
Using scipy.spatial.KDTree, wait a few seconds...
Extending to #
For
Geometrical Structure of K-Nearest Neighbors (KNN) Hypothesis Space#
The KNN algorithm’s hypothesis space reveals a rich geometrical structure,
defined through Voronoi tessellation. The space can be understood as the class
of all functions that are piecewise constant on the cells of the
Voronoi Tessellation in KNN#
For a collection of training examples
Connection to Previous Discussion#
This geometrical perspective connects to the earlier description of the KNN hypothesis space by emphasizing the piecewise constant nature of the KNN functions. It showcases how the Voronoi cells, defined by the training examples, form the underlying structure that drives the classification or regression decision.
The piecewise constant behavior in each Voronoi cell aligns with the KNN
algorithm’s logic, where the prediction for a query point is determined by the
majority label or average value among the
The hypothesis space for KNN, through the lens of Voronoi tessellation, offers a
profound understanding of the algorithm’s ability to adapt to local patterns in
the data. This geometrical view complements the decision-making perspective,
providing a comprehensive picture of the KNN algorithm’s richness, flexibility,
and complexity. It illustrates how the choice of
Summary#
In summary, the concept of Voronoi regions applies to the KNN algorithm for the
special case of
However, when
The Curse of Dimensionality in K-Nearest Neighbors (K-NN)#
The effectiveness of the K-NN algorithm relies on the notion that similar points in the feature space likely share similar labels. This assumption, however, breaks down in high-dimensional spaces due to the phenomenon known as the “curse of dimensionality.” In high-dimensional spaces, points drawn from a probability distribution tend to be far apart, making the concept of closeness or similarity less meaningful.
Illustration Using Hyper-Cube#
A unit hypercube is a geometric shape in
In mathematical terms, the unit hypercube in
Here,
Consider the unit hyper-cube

Let
Given
Dimension ( |
Length ( |
---|---|
2 |
0.1 |
10 |
0.63 |
100 |
0.955 |
1000 |
0.9954 |
Implications#
As
In other words, given a test point
Required Number of Data Points#
One might think of increasing the number of training samples,
For dimensions greater than 100, this would require more data points than there are electrons in the universe.
Conclusion#
The curse of dimensionality presents a significant challenge for the K-NN algorithm in high-dimensional spaces. It undermines the foundational assumption of similarity and requires an impractical number of samples for effective learning. This phenomenon emphasizes the importance of dimensionality reduction and careful feature selection when applying K-NN to real-world high-dimensional problems.
See Curse of Dimensionality for more details.
Decision Boundary#
Definition of Decision Boundary#
The decision boundary in K-NN is a hypersurface that partitions the feature
space
Mathematical Description#
Given the training set
and the label space
where
and
Properties of the Decision Boundary#
Complexity: The decision boundary can be highly complex or smooth, depending on the choice of
. A small leads to a complex, jagged boundary that may capture noise (potentially overfitting), while a large leads to a smoother boundary that may oversimplify the underlying patterns.Sensitivity to Noise: The decision boundary is sensitive to noise and outliers, especially for small
. Misclassified training examples near the boundary can distort it.Non-Linearity: The decision boundary in K-NN is not restricted to be linear. It can adapt to the underlying distribution of the data, capturing non-linear relationships.
Local Decision Making: The decision boundary is determined by local information, specifically the
nearest neighbors. This reflects the instance-based nature of K-NN, where decisions are made based on localized regions of the feature space.Influence of Distance Metric: The shape and position of the decision boundary are influenced by the chosen distance metric
. Different metrics may lead to different boundaries.
In summary, the decision boundary in K-NN is a fundamental concept that describes how the algorithm classifies instances in the feature space. Its properties are influenced by the choice of hyperparameters, the nature of the data, and the underlying assumptions of the algorithm. Understanding the decision boundary helps in interpreting the behavior of K-NN and in tuning its parameters for optimal performance.
Mathematical Description for Classes#
Given the same training set and label space as before, the decision boundaries
for
where the class probability is defined as before:
Properties and Characteristics#
Multiple Boundaries: For
classes, there will be multiple decision boundaries separating the regions corresponding to different classes.Higher Complexity: The geometry of the decision boundaries becomes more complex as the number of classes increases, especially if the classes are not well-separated.
Intersecting Boundaries: Decision boundaries for different pairs of classes may intersect, creating a network of hypersurfaces that partitions the feature space.
Local Decision Making: The decision boundaries are still determined by local information, specifically the
nearest neighbors. The local structure of the data dictates how the boundaries are formed.Non-Linearity: As with the binary case, the decision boundaries in K-NN are not restricted to being linear. They can capture complex, non-linear relationships between classes.
Visualization#
Visualizing the decision boundaries for more than two classes can be challenging, especially in high-dimensional spaces. In two dimensions, the decision boundaries can be visualized as lines (or curves) that separate the plane into regions corresponding to different classes. In three dimensions, the boundaries become surfaces, and in higher dimensions, they are hypersurfaces.
Summary#
The generalization of the decision boundary concept to
The decision boundary in a classification problem serves as a critical concept that separates the feature space into distinct regions corresponding to different classes. Intuitively, it’s like a frontier or a dividing line that helps the model decide which class a new observation belongs to based on its features. Let’s delve into why decision boundaries are useful and why having a tie on the boundary signifies something important.
Why Decision Boundaries Are Useful#
Understanding Classification Behavior: The decision boundary provides a visual and mathematical representation of how the algorithm classifies different regions of the feature space. It helps in understanding how changes in features affect the predicted class.
Identifying Class Boundaries: It demarcates where one class ends and another begins. This can be vital for understanding how well the model can distinguish between classes.
Analyzing Model Complexity: The shape and structure of the decision boundary can give insights into the complexity of the model. A linear boundary may suggest a simple relationship between features and classes, while a highly non-linear, jagged boundary might indicate overfitting.
Evaluating Sensitivity to Features: By examining the decision boundary, one can understand how sensitive the classification is to changes in different features. It helps in identifying which features have more influence on the classification.
Assessing Multi-Class Dynamics: In multi-class problems, decision boundaries show how classes relate to each other in the feature space, revealing potential challenges in distinguishing some classes from others.
Why a Tie on the Boundary Signifies Something#
A tie on the decision boundary means that the algorithm is equally uncertain about assigning an observation on the boundary to either of the adjacent classes. Here’s why it’s significant:
Critical Point of Decision Making: A tie signifies that the observation lies exactly on the point where the algorithm cannot favor one class over the other. It’s like standing right on the border between two countries.
Sensitivity to Noise and Model Parameters: Observations on or near the decision boundary may be highly sensitive to noise in the data or changes in model parameters (like the distance metric or
in K-NN). A small change might lead to a different classification.Highlighting Ambiguities: Ties on the decision boundary highlight areas where the model may struggle to make clear distinctions between classes. It can reveal regions of ambiguity that might need special attention, such as gathering more features or tuning hyperparameters.
Guidance for Model Improvement: Understanding where and why ties occur on the decision boundary can guide further model development, including feature engineering, selection of distance metrics, or choice of
in K-NN.
Summary#
The decision boundary acts as a geographical map of how the classification algorithm interprets the feature space. Ties on this boundary reveal critical points where the model is most uncertain and sensitive. Understanding these aspects of the decision boundary can lead to a more intuitive grasp of the model’s behavior, better diagnostic insights, and guidance for model refinement and improvement.
Assumptions#
The K-Nearest Neighbors (K-NN) algorithm is relatively straightforward, but it operates under certain assumptions that can impact its performance. Here are the major assumptions:
1. Homogeneous Feature Scale#
The scale of the variables matters in K-NN. All features must be on a similar scale; otherwise, features with larger scales will unduly influence the distance metric. This emphasizes the need for normalization or standardization.
2. Relevance of Euclidean Distance#
The K-NN algorithm assumes that the Euclidean distance (or other chosen distance metrics) is a meaningful measure of similarity between instances. If this assumption does not hold, the algorithm may not perform well.
3. Equal Importance of Features#
Unless feature weighting is employed, K-NN treats all features as equally important in the similarity measure. If one or more irrelevant features are included in the model, the performance can be adversely affected.
4. Data Distribution and Density#
K-NN assumes that the local structure or density of the data is informative. This means that instances that are close to one another in the feature space are assumed to have similar outputs.
5. Smooth Decision Boundary#
The algorithm assumes that the decision boundary is smooth, especially for
larger values of
6. Independence of Observations#
K-NN assumes that the observations are independent of one another. If there are dependencies between observations, the algorithm might produce biased predictions.
7. Non-parametric Nature#
K-NN is a non-parametric method, meaning that it makes no explicit assumptions about the functional form of the transformation from features to output. While this adds to its flexibility, it also means that the algorithm relies heavily on the data itself, rather than any underlying parameters.
8. Noisy Data Sensitivity#
K-NN is sensitive to noisy data. Outliers or mislabeled examples can
significantly influence the decision boundary, especially for small values of
9. Curse of Dimensionality#
K-NN suffers from the curse of dimensionality. As the number of dimensions (features) increases, the distance between samples in the feature space becomes less meaningful, and more data are required to make reliable predictions.
Balanced Class Distribution
In classification problems, K-NN assumes that the classes are relatively
balanced. If one class significantly outnumbers the others, the majority class
may dominate the voting process, leading to biased predictions. This can result
in the algorithm consistently predicting the majority class, particularly when
the value of
Including this assumption completes the list, giving a comprehensive overview of the factors that can influence the performance of the K-NN algorithm.
From Wikipedia with modification: examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the
nearest neighbors due to their large number. This is intuitive, as if our dataset contains 1 million samples, and 99% of it are of one label, and say we use a neighbour , then it is relatively likely our 11 nearest neighbours have the dominant class. One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its nearest neighbors.
Conclusion#
The effectiveness of the K-NN algorithm relies on these assumptions. Violating
these assumptions can lead to poor performance. Proper preprocessing, such as
feature scaling and selection, along with careful choice of hyperparameters like
Summary#
Instance-Based Learning: KNN is an example of an instance-based learning algorithm. Rather than summarizing the training data into a compact model, it uses the entire training dataset to make predictions. The rationale is that similar input examples are likely to have similar output values.
Lazy Learning: KNN is often referred to as a “lazy learning” algorithm because it defers most of the computation until a query is made. When a query point
is presented, the algorithm calculates the distance from to every example in to find the k-nearest neighbors and predict the output.Non-Parametric: The algorithm does not assume any specific form for the underlying distribution that generates the data. It makes no parametric assumptions, and thus, the model complexity grows with the size of the data. The data itself serves as a representation of the underlying function.
Memory vs. Computation Trade-off: By storing the entire training dataset, KNN leverages memory to reduce computation during the learning phase. However, this leads to increased computational complexity during the prediction phase, where distances must be computed to all stored instances.
KNN for Classification and Regression: KNN can be employed for both classification and regression tasks.
Classification: It assigns the mode of the labels of the top
nearest neighbors.Regression: It assigns the mean of the labels of the top
nearest neighbors.Refer to the illustration named “classification_vs_regression” for a visual representation.
Nature of KNN:
Non-parametric: KNN does not make explicit assumptions about the functional form of the distribution of data. This allows it to capture complex relationships without relying on specific assumptions that might not hold.
Instance-based and Supervised: KNN memorizes the training instances and uses them for prediction, requiring supervision through labeled data.
Algorithm Breakdown: The detailed steps of the KNN algorithm can be found in the Algorithm section.
Computational Considerations: KNN requires scanning the entire training set for each query, making it computationally expensive. Various techniques, such as dimensionality reduction (e.g., PCA) or clustering, can alleviate this issue.
Bias and Variance Tradeoff:
Bias: High bias may cause underfitting due to erroneous assumptions.
Variance: High variance may lead to overfitting due to sensitivity to fluctuations in the training set.
Refer to the Wikipedia section on KNN for formal mathematical definitions.
Hyperparameters:
: The number of neighbors, the selection of which is discussed in the Decision Boundary and section.Distance Metric
: Defines the measure of similarity between data points.
Handling Large Datasets: Techniques like dimensionality reduction, random sampling, and regional clustering can reduce computational cost.
Handling Missing Values: Missing values can be imputed using the average or majority class of the closest
neighbors.Notable Properties of KNN:
Data Dependency: More data generally implies better KNN, as it provides a richer representation of the underlying distribution.
Choice of
: is usually chosen as an odd number to avoid ties. Small leads to a complex decision surface, while large smooths the surface.Edge Cases: Special cases for
and are explained in the Decision Boundary and section.Normalization: Feature scaling is essential since KNN relies on distance measures. See the Normalization section.
This summary encapsulates the essential aspects of the KNN algorithm, adhering to the notations and rigor established in previous sections. It serves as a concise and comprehensive review for those studying the algorithm, with references to more detailed explanations and visualizations.
References#
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
kevinzakka’s end-to-end KNN: Primarily based off this reference.