Thursday, July 12, 2012

Installing tesseract ocr engine

In order to build the tesseract ocr engine from source on my machine, I had to jump through a few hoops that I figure I might as well document here. Note: I am using linux.

What I did:
  • downloaded tesseract-3.01.tar.gz , and 
  • tesseract-ocr-3.01.eng.tar.gz . Follow the instructions so that the english language files end up in the tessdata folder created when the tesseract tarball is extracted.
  • downloaded and installed leptonica. This is listed as a dependency in the tesseract README, and was pretty straightforward. Just a simple ./configure, make, make install.
  • To build tesseract itself, it is not enough to do the standard ./configure , make , make install. you must also run ./autogen.sh. If you don't, you will get a cryptic error, something about missing Makefile.in
  • For autogen.sh to run, i had to sudo apt-get install autoconf automake libtool. This is not made obvious, and I had to dig around a bit to figure it out.
  • after autogen.sh ran to successful completion, ./configure , make , make install in the tesseract directory all worked fine as well.