Open World Classification of Printed Invoices

posted Jan 17, 2012, 4:52 AM by Eric Medvet   [ updated Dec 10, 2012, 6:26 AM ]
  • 10th ACM Symposium on Document Engineering (DocEng), 2010, Manchester (United Kingdom)
  • Enrico Sorio, Alberto Bartoli, Giorgio Davanzo, Eric Medvet
  • Google Scholar
A key step in the understanding of printed documents is their classification based on the nature of information they contain and their layout. In this work we consider a dynamic scenario in which document classes are not known a priori and new classes can appear at any time. This open world setting is both realistic and highly challenging. We use an SVM-based classifier based only on image-level features and use a nearest-neighbor approach for detecting new classes. We assess our proposal on a real-world dataset composed of 562 invoices belonging to 68 different classes. These documents were digitalized after being handled by a corporate environment, thus they are quite noisy---e.g., big stamps and handwritten signatures at unfortunate positions and alike. The experimental results are highly promising.
Ċ
Eric Medvet,
Jan 17, 2012, 5:00 AM