[jira] Created: (TIKA-344) Charset hint in metadata

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-344) Charset hint in metadata

JIRA jira@apache.org
Charset hint in metadata
------------------------

                 Key: TIKA-344
                 URL: https://issues.apache.org/jira/browse/TIKA-344
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 0.6
            Reporter: Piotr B.
            Priority: Minor


It would be nice if TextParser and HtmlParser support Metadata.CONTENT_ENCODING hint.

In my application I always prefer that hint (if it is present) over the charset detector result, because charset detector is often wrong on short inputs (even if  match.confidence is 100) and I know that hint if present is right in 99%.

To be more general, user might be able to change default behaviour by override a function  F(hint, detectorResults) -> charset.
Other solution is to create some standard strategies and let user to choose one of them:
a) hint is most important
b) charset detector result is most important
c) create some heuristic using detectorResult.confidence, hint and maybe input length
Maybe the last heuristic method would be good enough for most cases.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-344) Charset hint in metadata

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789806#action_12789806 ]

Ken Krugler commented on TIKA-344:
----------------------------------

It would be useful for various detectors of charset & language to be able to (a) use different metadata keys for their results, and (b) include a confidence level. That way you could have a top-level resolver that combined the results with all knowledge, including incoming hints, to pick the best result.

Though note that for HTML pages, there's a patch to use the charset found in meta tags, which is usually pretty good (and definitely better than the server response header charset or auto-detected charset). See https://issues.apache.org/jira/browse/TIKA-332, as well as:

https://issues.apache.org/jira/browse/TIKA-333

https://issues.apache.org/jira/browse/TIKA-334

https://issues.apache.org/jira/browse/TIKA-335

https://issues.apache.org/jira/browse/TIKA-341

> Charset hint in metadata
> ------------------------
>
>                 Key: TIKA-344
>                 URL: https://issues.apache.org/jira/browse/TIKA-344
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Piotr B.
>            Priority: Minor
>
> It would be nice if TextParser and HtmlParser support Metadata.CONTENT_ENCODING hint.
> In my application I always prefer that hint (if it is present) over the charset detector result, because charset detector is often wrong on short inputs (even if  match.confidence is 100) and I know that hint if present is right in 99%.
> To be more general, user might be able to change default behaviour by override a function  F(hint, detectorResults) -> charset.
> Other solution is to create some standard strategies and let user to choose one of them:
> a) hint is most important
> b) charset detector result is most important
> c) create some heuristic using detectorResult.confidence, hint and maybe input length
> Maybe the last heuristic method would be good enough for most cases.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-344) Charset hint in metadata

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-344.
--------------------------------

    Resolution: Duplicate

Resolving this as a duplicate of all the related and  more specific issues filed by Ken.  It looks like after applying all his patches we've pretty much covered the use case expressed here.

> Charset hint in metadata
> ------------------------
>
>                 Key: TIKA-344
>                 URL: https://issues.apache.org/jira/browse/TIKA-344
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Piotr B.
>            Priority: Minor
>
> It would be nice if TextParser and HtmlParser support Metadata.CONTENT_ENCODING hint.
> In my application I always prefer that hint (if it is present) over the charset detector result, because charset detector is often wrong on short inputs (even if  match.confidence is 100) and I know that hint if present is right in 99%.
> To be more general, user might be able to change default behaviour by override a function  F(hint, detectorResults) -> charset.
> Other solution is to create some standard strategies and let user to choose one of them:
> a) hint is most important
> b) charset detector result is most important
> c) create some heuristic using detectorResult.confidence, hint and maybe input length
> Maybe the last heuristic method would be good enough for most cases.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.