An essential step in the understanding of printed documents is the classification of such documents based on their class, i.e., on the nature of information they contain and their layout.
In this work we are concerned with automatic classification of such documents. This task is usually accomplished by extracting a suitable set of low-level features from each document which are then fed to a classifier.
The quality of the results depends primarily on the classifier, but they are also heavily influenced by the specific features used. In this work we focus on the feature extraction part and propose a method that characterizes each document based on the spatial density of black pixels and of image edges.
We assess our proposal on a real-world dataset composed of 560 invoices belonging to 68 different classes. These documents have been digitalized after their printed counterparts have been handled by a corporate environment, thus they contain a substantial amount of noise---big stamps and handwritten signatures at unfortunate positions and so on. We show that our proposal is accurate, even a with very small learning set.