[jira] Created: (TIKA-331) Windings font recognition in Tika parsing + spacing issue

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-331) Windings font recognition in Tika parsing + spacing issue

JIRA jira@apache.org
Windings font recognition in Tika parsing + spacing issue
---------------------------------------------------------

                 Key: TIKA-331
                 URL: https://issues.apache.org/jira/browse/TIKA-331
             Project: Tika
          Issue Type: Wish
          Components: parser
    Affects Versions: 0.4
         Environment: Windows XP / Java JDK 1.6.0_15
            Reporter: MRIT64


I have PDF files that include some characters in Windings font.
Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters (that is normal regarding these characters codes).
Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ?
(see http://www.alanwood.net/demos/wingdings.html for possible correspondences).

I will attach examples files when this issue will be created  (would it be possible to attach files directly when creating issues ?)


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-331) Windings font recognition in Tika parsing + spacing issue

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

MRIT64 updated TIKA-331:
------------------------

    Attachment: Parsing_Result1.txt
                test1.pdf

test1.pdf is a PDF file including Windings characters. Some are  commonly used by people, others less fequently.

Parsing_result1.txt is the text file produced by Tika.

> Windings font recognition in Tika parsing + spacing issue
> ---------------------------------------------------------
>
>                 Key: TIKA-331
>                 URL: https://issues.apache.org/jira/browse/TIKA-331
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.4
>         Environment: Windows XP / Java JDK 1.6.0_15
>            Reporter: MRIT64
>         Attachments: Parsing_Result1.txt, test1.pdf
>
>
> I have PDF files that include some characters in Windings font.
> Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters (that is normal regarding these characters codes).
> Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ?
> (see http://www.alanwood.net/demos/wingdings.html for possible correspondences).
> I will attach examples files when this issue will be created  (would it be possible to attach files directly when creating issues ?)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-331) Windings font recognition in Tika parsing + spacing issue

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

MRIT64 updated TIKA-331:
------------------------

    Attachment: Parsing_Result2.txt
                test2.pdf

Another example with the same WORD source file converted into PDF with another tool, and the Tika parsing result. Windings characters are translated into different Unicode characters than with the previous version.

> Windings font recognition in Tika parsing + spacing issue
> ---------------------------------------------------------
>
>                 Key: TIKA-331
>                 URL: https://issues.apache.org/jira/browse/TIKA-331
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.4
>         Environment: Windows XP / Java JDK 1.6.0_15
>            Reporter: MRIT64
>         Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, test2.pdf
>
>
> I have PDF files that include some characters in Windings font.
> Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters (that is normal regarding these characters codes).
> Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ?
> (see http://www.alanwood.net/demos/wingdings.html for possible correspondences).
> I will attach examples files when this issue will be created  (would it be possible to attach files directly when creating issues ?)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-331) Windings font recognition in Tika parsing + spacing issue

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782097#action_12782097 ]

MRIT64 commented on TIKA-331:
-----------------------------

Spacing issue
--------------------

Look at lines 10 and 11 in test2.pdf.
Look at  lines 11 and 12 in  Tika parsing result (Parsing_result2.txt) :

ðLocalisation des zones de livraison et de stockage
ðLocalisation des zones dangereuses

There is no space between ð and Localisation (ð is the translation of Winding's "Rightwards white arrow" by Tika).

If you copy and paste lines 10 and 11 in test2.pdf into a Notepad Window, you get :

ð Localisation des zones de livraison et de stockage
ð Localisation des zones dangereuses

...with a space between ð and Localisation.

In my case, the missing space after Tika parsing result in considering "ðLocalisation" as a word in following processes.

Regards

> Windings font recognition in Tika parsing + spacing issue
> ---------------------------------------------------------
>
>                 Key: TIKA-331
>                 URL: https://issues.apache.org/jira/browse/TIKA-331
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.4
>         Environment: Windows XP / Java JDK 1.6.0_15
>            Reporter: MRIT64
>         Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, test2.pdf
>
>
> I have PDF files that include some characters in Windings font.
> Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters (that is normal regarding these characters codes).
> Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ?
> (see http://www.alanwood.net/demos/wingdings.html for possible correspondences).
> I will attach examples files when this issue will be created  (would it be possible to attach files directly when creating issues ?)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-331) Windings font recognition in Tika parsing + spacing issue

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782100#action_12782100 ]

Ken Krugler commented on TIKA-331:
----------------------------------

I believe this is an issue for the PDF parser (PDFBox) that Tika "wraps".

Please check https://issues.apache.org/jira/browse/PDFBOX to see if this is already filed, and if not, refile it there.


> Windings font recognition in Tika parsing + spacing issue
> ---------------------------------------------------------
>
>                 Key: TIKA-331
>                 URL: https://issues.apache.org/jira/browse/TIKA-331
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.4
>         Environment: Windows XP / Java JDK 1.6.0_15
>            Reporter: MRIT64
>         Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf, test2.pdf
>
>
> I have PDF files that include some characters in Windings font.
> Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters (that is normal regarding these characters codes).
> Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ?
> (see http://www.alanwood.net/demos/wingdings.html for possible correspondences).
> I will attach examples files when this issue will be created  (would it be possible to attach files directly when creating issues ?)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.