A Domain Knowledge-based Approach for Automatic Correction of Printed Invoices

posted Mar 27, 2012, 2:23 AM by Eric Medvet   [ updated Dec 10, 2012, 6:24 AM ]
  • IEEE International Conference on Information Society (iSociety), 2012, London (United Kingdom)
  • Enrico Sorio, Alberto Bartoli, Giorgio Davanzo, Eric Medvet
  • Google Scholar
Although OCR technology is now commonplace, character recognition errors are still a problem, in particular, in automated systems for information extraction from printed documents. This paper proposes a method for the automatic detection and correction of OCR errors in an information extraction system. Our algorithm uses domain-knowledge about possible misrecognition of characters to propose corrections; then it exploits knowledge about the type of the extracted information to perform syntactic and semantic checks in order to validate the proposed corrections.
We assess our proposal on a real-world, highly challenging dataset composed of nearly 800 values extracted from approximately 100 commercial invoices and we obtained very good results.
Eric Medvet,
Dec 14, 2012, 6:54 AM
Eric Medvet,
Sep 3, 2012, 3:14 AM