Handwritten Digit Recognition
2009/04/15
To have full control over preprocessing, we have created our own dataset for handwritten digit recognition. The full preprocessing
is described in technical report TR-2005-27 (citation see below), and the digits were
contributed by Austrian university students as part of our 2005 lecture AI Methods for Data Analysis, sadly discontinued. We scanned one page per student and automatically extracted the image data for the handwritten digits.
Please contact us if you want to set up something similar for your own lecture. We welcome contributions, as it would take around 5 years for one similar-sized lecture to create a dataset of similar size as MNIST using this approach, and size seems to be the main determinant of error rate for SVM classifiers. Note that MNIST has due to its automated segmentation a segmentation error rate of around 1%, which makes interpretation of quoted error rates less than 1% quite hard.
The approach has the following advantages for you and your students.
- You can give each class their own data. Cheating is made impossible. ;-)
- Besides the pixel-based representation here, we also have code for OCR-like features. These are more amenable to feature selection, transformation and construction tasks.
- You are helping to build a resource to determine the influence of preprocessing on handwritten digit recognition.
Downloads
All files are in gzipped ARFF format (for WEKA). Please gunzip before use. Digits were downsampled to
16x16 pixels with Mitchell filter with parameter blur set to 2.5 (see paper for more details). SMO -E 5 -C 10 -F gives 6.46%
error rate on these datasets (use -t train -T test). This set was contributed by students of the class SS2005.
If you use this dataset, please cite one of the technical reports.
Update
We also revisited some assumptions about machine learning in 2009 and found that state-of-the-art machine learning systems are just as brittle as their old classical AI counterparts. Brittleness in this context means that their generalization performance on the whole task space (estimated by three distinct datasets) is very unsatisfactory -- they are unable to recognize handwritten digits in general, and the models are very specific to each dataset. You can find the empirically well-founded argumentation in the paper. This might be generally true as well, although that would be extremely hard to prove.
2004/03/01
Lecture with exercise at the Institute for Medical Cybernetics and Artificial Intelligence at the Medical University of Vienna (2h/week lecture, 1h/week exercise)
2005/12/31
Research, design and development of a SpamAssassin-based spam filter system (sampling methodology, training methodology, evaluation), initially seven test users, prepared for institute-wide deployment; involved in many locally and EU-funded research projects.
2011/12/01
Alexander K. Seewald, "On the Brittleness of Handwritten Digit Recognition Models," ISRN Machine Vision, vol. 2012, Article ID 834127, 10 pages, 2012. doi:10.5402/2012/834127.
2009/04/15
Seewald A.K.: On the Brittleness of Handwritten Digit Recognition Models. Technical Report, Seewald Solutions, Wien, 2009.
2005/01/01
Seewald A.K.: Digits - A Dataset for Handwritten Digit Recognition. Technical Report, Austrian Research Institute for Artificial Intelligence, Vienna, TR-2005-27, 2005.