Monday, April 09, 2007

Announcing the OCRopus Open Source OCR System



We're happy to announce the OCRopus OCR Project, a Google-sponsored project to develop advanced OCR technologies in the IUPR research group, headed by Prof. Thomas Breuel at the DFKI (German Research Center for Artificial Intelligence, Kaiserslautern, Germany).

The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.

The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.

The project is expected to run for three years and support three Ph.D. students or postdocs. We are announcing a technology preview release of the software under the Apache license (English-only, combining the Tesseract character recognizer with IUPR layout analysis and language modeling tools), with additional recognizers and functionality in future releases.

The IUPR research group has extensive experience in OCR and related technologies, and will be basing the work on previous research and existing software in the area. Existing software components include high-performance handwriting recognition software that has received top evaluations by NIST and was deployed by the US Census Bureau, the recently open sourced Tesseract OCR system, a separate Google project for probabilistic natural language modeling, and software for layout analysis and character recognition. The IUPR research group gratefully acknowledges funding by the German BMBF, the state of Rhineland Palatinate, and other public and private partners (please see www.iupr.org for more details).

We are hoping for contributions by the open source community in areas such as adapting the system to additional languages, creating a Gnome desktop application, integration with Gnome desktop search, web-based tools for proofing and training, language modeling, additional character recognition engines, and other useful tools and add-ons.

The project web page can be found at ocropus.org.

1 comment:

  1. Very cool. This method of picking up university projects and raising them to the state where they can be used by outsider is an excellent lever.

    Google's asset is also its recognition as a brand and those projects can use that as well in more way than one.

    Very smart on Google's part to do this.

    ReplyDelete