[jira] [Updated] (NUTCH-2421) parse-html to prioritize HTML5 charset definitions

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (NUTCH-2421) parse-html to prioritize HTML5 charset definitions

Luís Filipe Nassif (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel updated NUTCH-2421:
-----------------------------------
    Affects Version/s: 1.15

> parse-html to prioritize HTML5 charset definitions
> --------------------------------------------------
>
>                 Key: NUTCH-2421
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2421
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.15
>            Reporter: Laurent Hervaud
>            Priority: Minor
>
> jira NUTCH-1733 add support to HTML5 charset definitions.
> In some case web site declare multiple meta element with different charset :
>     <meta charset="utf-8">
>     <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
> (ex : http://www.edga.fr/)
> In this case the second charset is detected (iso-8859-1).
> What about prioritize HTML5 charset definitions first ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)