Recreated and improved the K-Means algorithm from scratch to classify MNIST digits. Implemented K-Means++ initialization, centroid updates, and Euclidean-distance assignment, plus an outlier-detection system. Achieved 78% accuracy and identified high-variance misclassified digits.
Project Report
Read the full report here: Clustering and Classification of Handwritten Digits Using the K-Means Algorithm (PDF)
Overview
This project implements and optimizes the K-means clustering algorithm to classify handwritten digits from the Modified National Institute of Standards and Technology (MNIST) database. The goal was to classify 784-dimensional image vectors by forming clusters and calculating representative centroids. We modified the centroid initialization process using the K-means++ method for improved performance and established a distance-based statistical threshold for robust outlier detection. The resulting algorithm achieved 78% classification accuracy on the test set and successfully flagged 14 outliers.
Features
- Core K-means Implementation: Developed a full K-means algorithm to classify high-dimensional MNIST image vectors
- Centroid Initialization Optimization: Employed the K-means++ method to select initial centroids and improve overall clustering performance
- Statistical Outlier Detection: Implemented a distance-based system to identify data anomalies using a statistical threshold
- Parameter Tuning: Optimized the algorithm by running tests to determine the best number of clusters and iterations to minimize the cost function
- Performance and Analysis: Achieved a classification accuracy of 78% and analyzed sources of error, particularly for digits lacking closed borders
Authors
- Nick Regas
- Lucas Selvik