Tess4J

Tess4J is being developed and tested on Windows and Linux.

Instructions

Tesseract, Leptonica 32- and 64-bit DLLs, language data for English, and sample images are bundled with the program. Language data packs for Tesseract should be decompressed and placed into the tessdata folder. The Windows native libraries were built with Visual Studio and therefore depend on the Visual C++ 2015-2022 Redistributable Packages.

The Linux shared object library (libtesseract.so) equivalent to the DLL can be installed or built from the source with the instructions given in Tesseract Wiki.

Tess4J can be built and unit tested using Apache Ant and JUnit. Unzip the source and execute at the command line:

ant test

Notes: On platforms that do not have UTF-8 as their default charset, the output text may have character encoding issues. You may need to set the default character encoding for your program that calls Tess4J by supplying the JVM with the command-line option -Dfile.encoding=UTF8 or setting the environment variable JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF8 for version 1.0. This is no longer needed for version 1.1 and later.

Support for PDF documents is available through PDFBox.

Images intended for OCR should have at least 200 DPI in resolution, typically 300 DPI, 1 bpp (bit per pixel) monochrome or 8 bpp grayscale uncompressed TIFF or PNG format. PNG is usually smaller in size than other image formats and still keeps high quality due to its employing lossless data compression algorithms; TIFF has the advantage of the ability to contain multiple images (pages) in a file.

Instructions

References