[jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645070#comment-16645070 ]

ASF GitHub Bot commented on TIKA-2696:

tballison commented on issue #246: TIKA-2696 Add support for OSD output, contributed by @4U6U57
URL: https://github.com/apache/tika/pull/246#issuecomment-428600569
   And from the Windows build that is hot off the press (the June 2018 beta version threw an exception with -psm 0).
   tesseract v4.0.0-rc1.20181008
   Page number: 0
   Orientation in degrees: 0
   Rotate: 0
   Orientation confidence: 21.57
   Script: Latin
   Script confidence: 2.84

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[hidden email]

> Support output of Tesseract OSD output for psm mode 0
> -----------------------------------------------------
>                 Key: TIKA-2696
>                 URL: https://issues.apache.org/jira/browse/TIKA-2696
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: August Valera
>            Priority: Minor
> TIKA-2357 added support for additional PSM (page segmentation modes) for Tesseract OCR, including mode 0, which is {{Orientation and script detection (OSD) only}}, meaning it does not perform OCR, just outputs orientation and script information.
> An example usage of mode 0:
> {code:java}
> $ tesseract infile.png outfile --psm 0 -l osd
> {code}
> In this mode, the usual {{outfile.txt}} is not created. Instead, and similar to other modes that run OSD in addition to extraction, the result is an {{outfile.osd}} file, like so:
> {code:java}
> Page 1
> Warning. Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 212
> Page number: 0
> Orientation in degrees: 0
> Rotate: 0
> Orientation confidence: 13.73
> Script: Latin
> Script confidence: 4.78
> {code}
> However, {{TesseractOCRParser#parse(...)}} is [coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437] to only read the contents of {{outfile.txt}} (alternatively {{outfile.hocr}}) in all modes, so mode 0 outputs nothing regardless of input.
> This is consistent with Tika's goal to output extracted text, but against the intention of the user expecting OSD output.

This message was sent by Atlassian JIRA