Tess4J is being developed and tested on Windows and Linux.
Tesseract, Leptonica, and Ghostscript 32- and 64-bit DLLs, language data for
English, and sample images are bundled with the program.
Language data packs for Tesseract should be decompressed and placed into
tessdata folder. The Windows native libraries were built with VS2015 and therefore
depend on the
Visual C++ 2015 Redistributable Packages.
Notes: On platforms that do not have UTF-8 as their
default charset, the output text may have character encoding issues. You may need
to set the default character encoding for your program that calls Tess4J by supplying
the JVM with the command-line option
-Dfile.encoding=UTF8 or setting
the environment variable
for version 1.0. This is no longer needed for version 1.1 and later.
Support for PDF documents is available through GPL Ghostscript, which should be installed and included in system path.
Images intended for OCR should have at least 200 DPI in resolution, typically 300 DPI, 1 bpp (bit per pixel) monochome or 8 bpp grayscale uncompressed TIFF or PNG format. PNG is usually smaller in size than other image formats and still keeps high quality due to its employing lossless data compression algorithms; TIFF has the advantage of the ability to contain multiple images (pages) in a file.