setLanguage not working in TesseractOCRParser for Apache Tika

I am trying to use the setLanguage method of TesseractOCRParser in Apache Tika in Java. When I pass any Indian language, such as Hindi, Marathi, Tamil, etc., it doesn't work and still displays data in English.

I am passing an image that has both English and Hindi.

TesseractOCRParser tesserparser= new TesseractOCRParser();

tesserparser.setTessdataPath("C:\\Program Files\\Tesseract-OCR\\tessdata");

tesserparser.setLanguage("hin");

BodyContentHandler handler = new BodyContentHandler();

Metadata metadata = new Metadata();

ParseContext context = new ParseContext();

context.set(TesseractOCRParser.class, tesserparser);

// AutoDetectParser will examine the file type and invoke the OCR parser if it sees an image
AutoDetectParser parser = new AutoDetectParser();

parser.parse(input, handler, metadata, context);

My Java version is 17.0.14 Maven - 3.9.9 Tesseract - 5.5

Answer

When using the setLanguage method of TesseractOCRParser in Apache Tika to process images containing text in Indian languages such as Hindi, Marathi, Tamil, etc., there are a few key points to consider to ensure that the OCR process works correctly.

Steps to Ensure Proper Language Detection and OCR

Verify Tessdata Files: Ensure that the tessdata directory contains the language data files for the languages you want to recognize. For Hindi, the file should be named hin.traineddata. You can download the required language data files from the Tesseract OCR GitHub repository.
Correct Path Configuration: Make sure the path to the tessdata directory is correctly set. In your code, you have:
```
tesserparser.setTessdataPath("C:\\Program Files\\Tesseract-OCR\\tessdata");
```
Ensure that this path is accurate and that the tessdata directory contains the necessary .traineddata files.
Language Code: Verify that you are using the correct language code. For Hindi, the code is hin. Ensure that the language code matches the filename of the .traineddata file.
Combining Languages: If your image contains text in multiple languages (e.g., English and Hindi), you can combine language codes using the + symbol. For example:
```
tesserparser.setLanguage("eng+hin");
```
This tells Tesseract to recognize both English and Hindi text in the image.
Check Tesseract Version Compatibility: Ensure that the version of Tesseract you are using is compatible with the language data files. You mentioned using Tesseract 5.5, which should be compatible with the latest language data files.

Example Code

Here is an example of how you can modify your code to ensure proper language detection:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.ocr.TesseractOCRParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class OCRExample {
    public static void main(String[] args) {
        TesseractOCRParser tesserparser = new TesseractOCRParser();
        tesserparser.setTessdataPath("C:\\Program Files\\Tesseract-OCR\\tessdata");
        tesserparser.setLanguage("eng+hin"); // Combine languages if needed

        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
        context.set(TesseractOCRParser.class, tesserparser);

        AutoDetectParser parser = new AutoDetectParser();

        try (InputStream input = new FileInputStream(new File("path/to/your/image.png"))) {
            parser.parse(input, handler, metadata, context);
            System.out.println("Extracted Text: " + handler.toString());
        } catch (IOException | SAXException | TikaException e) {
            e.printStackTrace();
        }
    }
}

Image Quality: Ensure that the image quality is good. Blurry or low-resolution images can affect OCR accuracy.
Preprocessing: Consider preprocessing the image to enhance text recognition. This can include converting the image to grayscale, increasing contrast, or removing noise.
Debugging: Print out the recognized text to verify if the OCR is working correctly. This can help identify if the issue is with the OCR process or the language data files.

By following these steps, you should be able to correctly configure TesseractOCRParser to recognize text in Indian languages using Apache Tika.

Answer

Enjoyed this question?