[jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647098#comment-16647098 ]

ASF GitHub Bot commented on TIKA-2696:
--------------------------------------

4U6U57 commented on issue #246: TIKA-2696 Add support for OSD output, contributed by @4U6U57
URL: https://github.com/apache/tika/pull/246#issuecomment-429136781
 
 
   1. Interesting point in regards to Metadata, I think that it makes sense to go that route. I started this project initially with the intention of specifically supporting psm 0, which does not output any extracted content, but putting it in metadata would be more beneficial to other modes that also output OSD information.
   1. I haven't had a chance to test this on older versions, I was planning to take a look over the weekend, but in my `tesseract 4.0.0-beta.1` (latest on Ubuntu 18.04) and psm 0 I get the OSD info in `stdout`, not `stderr`, if the output format is specified as `stdout`, and in the `$out.osd` if output is `$out`. In all other modes I don't get OSD info.\
   <details><summary>OSD Output Testing</summary>
   
   ```bash
   [avalera:~/test]
   $ ls
   test.jpg
   
   [avalera:~/test]
   $ tesseract test.jpg stdout --psm 0 -l osd
   Warning. Invalid resolution 0 dpi. Using 70 instead.
   Estimating resolution as 227
   Page number: 0
   Orientation in degrees: 0
   Rotate: 0
   Orientation confidence: 16.99
   Script: Latin
   Script confidence: 0.16
   
   [avalera:~/test]
   $ tesseract test.jpg stdout --psm 0 -l osd 2>/dev/null
   Page number: 0
   Orientation in degrees: 0
   Rotate: 0
   Orientation confidence: 16.99
   Script: Latin
   Script confidence: 0.16
   
   [avalera:~/test]
   $ tesseract test.jpg stdout --psm 1
   Warning. Invalid resolution 0 dpi. Using 70 instead.
   Estimating resolution as 227
   
   
   CHAPTER 5
   
   IN THE MIDDLE OF THE NIGHT
   
   Tue next day Aunt Hetty took Pam and Peter and
   their cousin Brock in the pony-cart to the sea,
   which was about three miles away. This was such
   fun that the three children forgot all about Cliff
   Castle for a day or two. And then something hap-
   pened that reminded them of it.
   
   It was something that happened in the middle
   of the night. Pam woke up and felt very thirsty.
   She remembered that Aunt Hetty had left a jug
   of water and a tumbler on the mantelpiece and
   she got up to get it.
   
   She stood at the window, drinking the water. It
   
   39
   
   
   [avalera:~/test]
   $ ls
   test.jpg
   
   [avalera:~/test]
   $ tesseract test.jpg outfile --psm 0 -l osd
   Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
   Warning. Invalid resolution 0 dpi. Using 70 instead.
   Estimating resolution as 227
   
   [avalera:~/test]
   $ ls
   outfile.osd  test.jpg
   
   [avalera:~/test]
   $ rm outfile.osd
   
   [avalera:~/test]
   $ tesseract test.jpg outfile --psm 1
   Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
   Warning. Invalid resolution 0 dpi. Using 70 instead.
   Estimating resolution as 227
   
   [avalera:~/test]
   $ ls
   outfile.txt  test.jpg
   ```
   
   </details>
   
   3. Same as previous, haven't tested but looking at your two examples I'd imagine not much change except for the addition of new keys.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Support output of Tesseract OSD output for psm mode 0
> -----------------------------------------------------
>
>                 Key: TIKA-2696
>                 URL: https://issues.apache.org/jira/browse/TIKA-2696
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: August Valera
>            Priority: Minor
>
> TIKA-2357 added support for additional PSM (page segmentation modes) for Tesseract OCR, including mode 0, which is {{Orientation and script detection (OSD) only}}, meaning it does not perform OCR, just outputs orientation and script information.
> An example usage of mode 0:
> {code:java}
> $ tesseract infile.png outfile --psm 0 -l osd
> {code}
> In this mode, the usual {{outfile.txt}} is not created. Instead, and similar to other modes that run OSD in addition to extraction, the result is an {{outfile.osd}} file, like so:
> {code:java}
> Page 1
> Warning. Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 212
> Page number: 0
> Orientation in degrees: 0
> Rotate: 0
> Orientation confidence: 13.73
> Script: Latin
> Script confidence: 4.78
> {code}
> However, {{TesseractOCRParser#parse(...)}} is [coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437] to only read the contents of {{outfile.txt}} (alternatively {{outfile.hocr}}) in all modes, so mode 0 outputs nothing regardless of input.
> This is consistent with Tika's goal to output extracted text, but against the intention of the user expecting OSD output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)