As the name suggests, I hate the overrate of the use of machine learning to classifying MNIST data. Like, why do we need to use those fancy algorithm for this simple task? So, I propose to use the more traditional way that can be taught to first year student who are taking basic statistics. Introducing hypothesis testing classification algorithm.
This is just a metric learning. Here, we use Hotelling’s $T^2$ statistic to perform a $F$-test whether the mean difference between the new observation $x$ and known data $X_c$ is significant or not. If it is significant, then $x$ probably not in class $c$. We iterate all until we run out of class to iterate and we done. More technical detail would be
We got accuracy of 90.43% on 50% test data. Here is the visualization of what we got.

Here is also the test results for the others dataset
| Dataset | Accuracy | class |
|---|---|---|
| Fashion | 51.22% | 10 |
| Iris | 93.33% | 3 |
| Wisconsin | 76.14% | 2 |
Ok, I will let you evaluate time complexity of the said algorithm as your homework.
First and foremost, we need to focus on the theoretical justification on this method. At this point, this is a one-observation and one-sample test. We can’t do much with one-observation test as we can’t infers its distribution or its properties.
Also, the test statistic is a metric in a sense. Another approach that is relatively well-explored is to use the kernel method[1,2] where the whole goal is to find a difference between mean function value from two distributions. We can try that or go on the tangent line and try to find the non-metric version of the test statistic.
Idea that I have right now is to turn the decision tree into a series of hypothesis tests and use that as a non-metric alternative. Note that there is a need for non-metric test statistic as to classify things in a non-metric space, we need a non-metric test statistic. However, the behavior of the test in that space is needed to be explored.
Another idea is to turn the naive bayes into bayesian test. That is more staightforward. However, that should be well explored.
At this point of the work, it is just essentially try to find \(c_{x_0} = \arg\min_{\mu_c\in\boldsymbol{\mu}} (x_0 - \mu_c)^\top S_c^{-1}(x_0 - \mu_c)\) i.e., minimum Mahalanobis distance (in a sense). Other test statistic needed to be explored to make this a real work.
Code can be access here
[1] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, ‘A Kernel Two-Sample Test’, Journal of Machine Learning Research, vol. 13, no. 25, pp. 723–773, 2012.
[2] A. Schrab, I. Kim, M. Albert, B. Laurent, B. Guedj, and A. Gretton, ‘MMD Aggregated Two-Sample Test’, Journal of Machine Learning Research, vol. 24, no. 194, pp. 1–81, 2023.