[jira] Created: (NUTCH-91) empty encoding causes exception

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-91) empty encoding causes exception

Sebastian Nagel (Jira)
empty encoding causes exception
-------------------------------

         Key: NUTCH-91
         URL: http://issues.apache.org/jira/browse/NUTCH-91
     Project: Nutch
        Type: Bug
    Versions: 0.8-dev    
    Reporter: Michael Nebel


I found some sites, where the header says:  "Content-Type: text/html; charset=". This causes an exception in the HtmlParser. My suggestion:

Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
===================================================================
--- src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  (revision 279397)
+++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  (working copy)
@@ -120,7 +120,7 @@
       byte[] contentInOctets = content.getContent();
       InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
       String encoding = StringUtil.parseCharacterEncoding(contentType);
-      if (encoding!=null) {
+      if (encoding!=null && !"".equals(encoding)) {
         metadata.put("OriginalCharEncoding", encoding);
         if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
           metadata.put("CharEncodingForConversion", encoding);


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-91) empty encoding causes exception

Sebastian Nagel (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-91?page=all ]
     
Piotr Kosiorowski closed NUTCH-91:
----------------------------------

    Fix Version: 0.7.2-dev
                 0.8-dev
     Resolution: Fixed

Commited with small extension. Thanks.

> empty encoding causes exception
> -------------------------------
>
>          Key: NUTCH-91
>          URL: http://issues.apache.org/jira/browse/NUTCH-91
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Michael Nebel
>      Fix For: 0.7.2-dev, 0.8-dev

>
> I found some sites, where the header says:  "Content-Type: text/html; charset=". This causes an exception in the HtmlParser. My suggestion:
> Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
> ===================================================================
> --- src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  (revision 279397)
> +++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  (working copy)
> @@ -120,7 +120,7 @@
>        byte[] contentInOctets = content.getContent();
>        InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
>        String encoding = StringUtil.parseCharacterEncoding(contentType);
> -      if (encoding!=null) {
> +      if (encoding!=null && !"".equals(encoding)) {
>          metadata.put("OriginalCharEncoding", encoding);
>          if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
>            metadata.put("CharEncodingForConversion", encoding);

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira