Class Tesseract

  • All Implemented Interfaces:
    ITesseract

    public class Tesseract
    extends java.lang.Object
    implements ITesseract
    An object layer on top of TessAPI, provides character recognition support for common image formats, and multi-page TIFF images beyond the uncompressed, binary TIFF format supported by Tesseract OCR engine. The extended capabilities are provided by the Java Advanced Imaging Image I/O Tools.

    Support for PDF documents is available through Ghost4J, a JNA wrapper for GPL Ghostscript, which should be installed and included in system path. If Ghostscript is not available, PDFBox will be used.

    Any program that uses the library will need to ensure that the required libraries (the .jar files for jna, jai-imageio, and ghost4j) are in its compile and run-time classpath.
    • Constructor Summary

      Constructors 
      Constructor Description
      Tesseract()  
    • Method Summary

      Modifier and Type Method Description
      void createDocuments​(java.lang.String[] filenames, java.lang.String[] outputbases, java.util.List<ITesseract.RenderedFormat> formats)
      Creates documents for given renderer.
      void createDocuments​(java.lang.String filename, java.lang.String outputbase, java.util.List<ITesseract.RenderedFormat> formats)
      Creates documents for given renderer.
      java.util.List<OCRResult> createDocumentsWithResults​(java.awt.image.BufferedImage[] bis, java.lang.String[] filenames, java.lang.String[] outputbases, java.util.List<ITesseract.RenderedFormat> formats, int pageIteratorLevel)
      Creates documents with OCR results for given renderers at specified page iterator level.
      OCRResult createDocumentsWithResults​(java.awt.image.BufferedImage bi, java.lang.String filename, java.lang.String outputbase, java.util.List<ITesseract.RenderedFormat> formats, int pageIteratorLevel)
      Creates documents with OCR result for given renderers at specified page iterator level.
      java.util.List<OCRResult> createDocumentsWithResults​(java.lang.String[] filenames, java.lang.String[] outputbases, java.util.List<ITesseract.RenderedFormat> formats, int pageIteratorLevel)
      Creates documents with OCR results for given renderers at specified page iterator level.
      OCRResult createDocumentsWithResults​(java.lang.String filename, java.lang.String outputbase, java.util.List<ITesseract.RenderedFormat> formats, int pageIteratorLevel)
      Creates documents with OCR result for given renderers at specified page iterator level.
      protected void dispose()
      Releases all of the native resources used by this instance.
      java.lang.String doOCR​(int xsize, int ysize, java.nio.ByteBuffer buf, java.awt.Rectangle rect, int bpp)
      Performs OCR operation.
      java.lang.String doOCR​(int xsize, int ysize, java.nio.ByteBuffer buf, java.lang.String filename, java.awt.Rectangle rect, int bpp)
      Performs OCR operation.
      java.lang.String doOCR​(java.awt.image.BufferedImage bi)
      Performs OCR operation.
      java.lang.String doOCR​(java.awt.image.BufferedImage bi, java.awt.Rectangle rect)
      Performs OCR operation.
      java.lang.String doOCR​(java.io.File imageFile)
      Performs OCR operation.
      java.lang.String doOCR​(java.io.File inputFile, java.awt.Rectangle rect)
      Performs OCR operation.
      java.lang.String doOCR​(java.util.List<javax.imageio.IIOImage> imageList, java.awt.Rectangle rect)
      Performs OCR operation.
      java.lang.String doOCR​(java.util.List<javax.imageio.IIOImage> imageList, java.lang.String filename, java.awt.Rectangle rect)
      Performs OCR operation.
      protected TessAPI getAPI()
      Returns TessAPI object.
      protected ITessAPI.TessBaseAPI getHandle()
      Returns API handle.
      protected java.lang.String getOCRText​(java.lang.String filename, int pageNum)
      Gets recognized text.
      java.util.List<java.awt.Rectangle> getSegmentedRegions​(java.awt.image.BufferedImage bi, int pageIteratorLevel)
      Gets segmented regions at specified page iterator level.
      java.util.List<Word> getWords​(java.awt.image.BufferedImage bi, int pageIteratorLevel)
      Gets recognized words at specified page iterator level.
      protected void init()
      Initializes Tesseract engine.
      void setConfigs​(java.util.List<java.lang.String> configs)
      Sets configs to be passed to Tesseract's Init method.
      void setDatapath​(java.lang.String datapath)
      Sets path to tessdata.
      void setHocr​(boolean hocr)
      Enables hocr output.
      protected void setImage​(int xsize, int ysize, java.nio.ByteBuffer buf, java.awt.Rectangle rect, int bpp)
      Sets image to be processed.
      protected void setImage​(java.awt.image.RenderedImage image, java.awt.Rectangle rect)
      void setLanguage​(java.lang.String language)
      Sets language for OCR.
      void setOcrEngineMode​(int ocrEngineMode)
      Sets OCR engine mode.
      void setPageSegMode​(int mode)
      Sets page segmentation mode.
      void setTessVariable​(java.lang.String key, java.lang.String value)
      Set the value of Tesseract's internal parameter.
      protected void setTessVariables()
      Sets Tesseract's internal parameters.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • Tesseract

        public Tesseract()
    • Method Detail

      • getAPI

        protected TessAPI getAPI()
        Returns TessAPI object.
        Returns:
        api
      • setDatapath

        public void setDatapath​(java.lang.String datapath)
        Sets path to tessdata.
        Specified by:
        setDatapath in interface ITesseract
        Parameters:
        datapath - the tessdata path to set
      • setLanguage

        public void setLanguage​(java.lang.String language)
        Sets language for OCR.
        Specified by:
        setLanguage in interface ITesseract
        Parameters:
        language - the language code, which follows ISO 639-3 standard.
      • setOcrEngineMode

        public void setOcrEngineMode​(int ocrEngineMode)
        Sets OCR engine mode.
        Specified by:
        setOcrEngineMode in interface ITesseract
        Parameters:
        ocrEngineMode - the OcrEngineMode to set
      • setPageSegMode

        public void setPageSegMode​(int mode)
        Sets page segmentation mode.
        Specified by:
        setPageSegMode in interface ITesseract
        Parameters:
        mode - the page segmentation mode to set
      • setHocr

        public void setHocr​(boolean hocr)
        Enables hocr output.
        Parameters:
        hocr - to enable or disable hocr output
      • setTessVariable

        public void setTessVariable​(java.lang.String key,
                                    java.lang.String value)
        Set the value of Tesseract's internal parameter.
        Specified by:
        setTessVariable in interface ITesseract
        Parameters:
        key - variable name, e.g., tessedit_create_hocr, tessedit_char_whitelist, etc.
        value - value for corresponding variable, e.g., "1", "0", "0123456789", etc.
      • setConfigs

        public void setConfigs​(java.util.List<java.lang.String> configs)
        Sets configs to be passed to Tesseract's Init method.
        Specified by:
        setConfigs in interface ITesseract
        Parameters:
        configs - list of config filenames, e.g., "digits", "bazaar", "quiet"
      • doOCR

        public java.lang.String doOCR​(java.io.File imageFile)
                               throws TesseractException
        Performs OCR operation.
        Specified by:
        doOCR in interface ITesseract
        Parameters:
        imageFile - an image file
        Returns:
        the recognized text
        Throws:
        TesseractException
      • doOCR

        public java.lang.String doOCR​(java.io.File inputFile,
                                      java.awt.Rectangle rect)
                               throws TesseractException
        Performs OCR operation.
        Specified by:
        doOCR in interface ITesseract
        Parameters:
        inputFile - an image file
        rect - the bounding rectangle defines the region of the image to be recognized. A rectangle of zero dimension or null indicates the whole image.
        Returns:
        the recognized text
        Throws:
        TesseractException
      • doOCR

        public java.lang.String doOCR​(java.awt.image.BufferedImage bi)
                               throws TesseractException
        Performs OCR operation.
        Specified by:
        doOCR in interface ITesseract
        Parameters:
        bi - a buffered image
        Returns:
        the recognized text
        Throws:
        TesseractException
      • doOCR

        public java.lang.String doOCR​(java.awt.image.BufferedImage bi,
                                      java.awt.Rectangle rect)
                               throws TesseractException
        Performs OCR operation.
        Specified by:
        doOCR in interface ITesseract
        Parameters:
        bi - a buffered image
        rect - the bounding rectangle defines the region of the image to be recognized. A rectangle of zero dimension or null indicates the whole image.
        Returns:
        the recognized text
        Throws:
        TesseractException
      • doOCR

        public java.lang.String doOCR​(java.util.List<javax.imageio.IIOImage> imageList,
                                      java.awt.Rectangle rect)
                               throws TesseractException
        Performs OCR operation.
        Specified by:
        doOCR in interface ITesseract
        Parameters:
        imageList - a list of IIOImage objects
        rect - the bounding rectangle defines the region of the image to be recognized. A rectangle of zero dimension or null indicates the whole image.
        Returns:
        the recognized text
        Throws:
        TesseractException
      • doOCR

        public java.lang.String doOCR​(java.util.List<javax.imageio.IIOImage> imageList,
                                      java.lang.String filename,
                                      java.awt.Rectangle rect)
                               throws TesseractException
        Performs OCR operation.
        Specified by:
        doOCR in interface ITesseract
        Parameters:
        imageList - a list of IIOImage objects
        filename - input file name. Needed only for training and reading a UNLV zone file.
        rect - the bounding rectangle defines the region of the image to be recognized. A rectangle of zero dimension or null indicates the whole image.
        Returns:
        the recognized text
        Throws:
        TesseractException
      • doOCR

        public java.lang.String doOCR​(int xsize,
                                      int ysize,
                                      java.nio.ByteBuffer buf,
                                      java.awt.Rectangle rect,
                                      int bpp)
                               throws TesseractException
        Performs OCR operation. Use SetImage, (optionally) SetRectangle, and one or more of the Get*Text functions.
        Specified by:
        doOCR in interface ITesseract
        Parameters:
        xsize - width of image
        ysize - height of image
        buf - pixel data
        rect - the bounding rectangle defines the region of the image to be recognized. A rectangle of zero dimension or null indicates the whole image.
        bpp - bits per pixel, represents the bit depth of the image, with 1 for binary bitmap, 8 for gray, and 24 for color RGB.
        Returns:
        the recognized text
        Throws:
        TesseractException
      • doOCR

        public java.lang.String doOCR​(int xsize,
                                      int ysize,
                                      java.nio.ByteBuffer buf,
                                      java.lang.String filename,
                                      java.awt.Rectangle rect,
                                      int bpp)
                               throws TesseractException
        Performs OCR operation. Use SetImage, (optionally) SetRectangle, and one or more of the Get*Text functions.
        Specified by:
        doOCR in interface ITesseract
        Parameters:
        xsize - width of image
        ysize - height of image
        buf - pixel data
        filename - input file name. Needed only for training and reading a UNLV zone file.
        rect - the bounding rectangle defines the region of the image to be recognized. A rectangle of zero dimension or null indicates the whole image.
        bpp - bits per pixel, represents the bit depth of the image, with 1 for binary bitmap, 8 for gray, and 24 for color RGB.
        Returns:
        the recognized text
        Throws:
        TesseractException
      • init

        protected void init()
        Initializes Tesseract engine.
      • setTessVariables

        protected void setTessVariables()
        Sets Tesseract's internal parameters.
      • setImage

        protected void setImage​(java.awt.image.RenderedImage image,
                                java.awt.Rectangle rect)
                         throws java.io.IOException
        Parameters:
        image - a rendered image
        rect - region of interest
        Throws:
        java.io.IOException
      • setImage

        protected void setImage​(int xsize,
                                int ysize,
                                java.nio.ByteBuffer buf,
                                java.awt.Rectangle rect,
                                int bpp)
        Sets image to be processed.
        Parameters:
        xsize - width of image
        ysize - height of image
        buf - pixel data
        rect - the bounding rectangle defines the region of the image to be recognized. A rectangle of zero dimension or null indicates the whole image.
        bpp - bits per pixel, represents the bit depth of the image, with 1 for binary bitmap, 8 for gray, and 24 for color RGB.
      • getOCRText

        protected java.lang.String getOCRText​(java.lang.String filename,
                                              int pageNum)
        Gets recognized text.
        Parameters:
        filename - input file name. Needed only for reading a UNLV zone file.
        pageNum - page number; needed for hocr paging.
        Returns:
        the recognized text
      • createDocuments

        public void createDocuments​(java.lang.String[] filenames,
                                    java.lang.String[] outputbases,
                                    java.util.List<ITesseract.RenderedFormat> formats)
                             throws TesseractException
        Creates documents for given renderer.
        Specified by:
        createDocuments in interface ITesseract
        Parameters:
        filenames - array of input files
        outputbases - array of output filenames without extension
        formats - types of renderer
        Throws:
        TesseractException
      • getSegmentedRegions

        public java.util.List<java.awt.Rectangle> getSegmentedRegions​(java.awt.image.BufferedImage bi,
                                                                      int pageIteratorLevel)
                                                               throws TesseractException
        Gets segmented regions at specified page iterator level.
        Specified by:
        getSegmentedRegions in interface ITesseract
        Parameters:
        bi - input buffered image
        pageIteratorLevel - TessPageIteratorLevel enum
        Returns:
        list of Rectangle
        Throws:
        TesseractException
      • getWords

        public java.util.List<Word> getWords​(java.awt.image.BufferedImage bi,
                                             int pageIteratorLevel)
        Gets recognized words at specified page iterator level.
        Specified by:
        getWords in interface ITesseract
        Parameters:
        bi - input buffered image
        pageIteratorLevel - TessPageIteratorLevel enum
        Returns:
        list of Word
      • createDocumentsWithResults

        public OCRResult createDocumentsWithResults​(java.awt.image.BufferedImage bi,
                                                    java.lang.String filename,
                                                    java.lang.String outputbase,
                                                    java.util.List<ITesseract.RenderedFormat> formats,
                                                    int pageIteratorLevel)
                                             throws TesseractException
        Creates documents with OCR result for given renderers at specified page iterator level.
        Specified by:
        createDocumentsWithResults in interface ITesseract
        Parameters:
        bi - input buffered image
        filename - filename (optional)
        outputbase - output filenames without extension
        formats - types of renderer
        pageIteratorLevel - TessPageIteratorLevel enum
        Returns:
        OCR result
        Throws:
        TesseractException
      • createDocumentsWithResults

        public java.util.List<OCRResult> createDocumentsWithResults​(java.awt.image.BufferedImage[] bis,
                                                                    java.lang.String[] filenames,
                                                                    java.lang.String[] outputbases,
                                                                    java.util.List<ITesseract.RenderedFormat> formats,
                                                                    int pageIteratorLevel)
                                                             throws TesseractException
        Creates documents with OCR results for given renderers at specified page iterator level.
        Specified by:
        createDocumentsWithResults in interface ITesseract
        Parameters:
        bis - array of input buffered images
        filenames - array of filenames
        outputbases - array of output filenames without extension
        formats - types of renderer
        pageIteratorLevel - TessPageIteratorLevel enum
        Returns:
        list of OCR results
        Throws:
        TesseractException
      • createDocumentsWithResults

        public OCRResult createDocumentsWithResults​(java.lang.String filename,
                                                    java.lang.String outputbase,
                                                    java.util.List<ITesseract.RenderedFormat> formats,
                                                    int pageIteratorLevel)
                                             throws TesseractException
        Creates documents with OCR result for given renderers at specified page iterator level.
        Specified by:
        createDocumentsWithResults in interface ITesseract
        Parameters:
        filename - input file
        outputbase - output filenames without extension
        formats - types of renderer
        pageIteratorLevel - TessPageIteratorLevel enum
        Returns:
        OCR result
        Throws:
        TesseractException
      • createDocumentsWithResults

        public java.util.List<OCRResult> createDocumentsWithResults​(java.lang.String[] filenames,
                                                                    java.lang.String[] outputbases,
                                                                    java.util.List<ITesseract.RenderedFormat> formats,
                                                                    int pageIteratorLevel)
                                                             throws TesseractException
        Creates documents with OCR results for given renderers at specified page iterator level.
        Specified by:
        createDocumentsWithResults in interface ITesseract
        Parameters:
        filenames - array of input files
        outputbases - array of output filenames without extension
        formats - types of renderer
        pageIteratorLevel - TessPageIteratorLevel enum
        Returns:
        list of OCR results
        Throws:
        TesseractException
      • dispose

        protected void dispose()
        Releases all of the native resources used by this instance.