Tess4J is being developed and tested on Windows and Linux.
Instructions
Tesseract, Leptonica 32- and 64-bit DLLs, language data for English,
and sample images are bundled with the program. Language
data packs for Tesseract should be decompressed and placed into
the tessdata
folder. The Windows native libraries were
built with Visual Studio and therefore depend on the Visual C++ 2015-2022 Redistributable
Packages.
The Linux shared object library (libtesseract.so
)
equivalent to the DLL can be installed or built from the source with the instructions given in Tesseract Wiki.
Tess4J can be built and unit tested using Apache Ant and JUnit. Unzip the source and execute at the command line:
ant test
Notes: On platforms that do not have
UTF-8 as their default charset, the output text may have character
encoding issues. You may need to set the default character encoding for
your program that calls Tess4J by supplying the JVM with the
command-line option -Dfile.encoding=UTF8
or setting the
environment variable JAVA_TOOL_OPTIONS
to -Dfile.encoding=UTF8
for version 1.0. This is no longer needed for version 1.1 and
later.
Support for PDF documents is available through PDFBox.
Images intended for OCR should have at least 200 DPI in resolution, typically 300 DPI, 1 bpp (bit per pixel) monochrome or 8 bpp grayscale uncompressed TIFF or PNG format. PNG is usually smaller in size than other image formats and still keeps high quality due to its employing lossless data compression algorithms; TIFF has the advantage of the ability to contain multiple images (pages) in a file.