[jira] Created: (NUTCH-57) text and html files unrecognized

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-57) text and html files unrecognized

Ayush Saxena (Jira)
text and html files unrecognized
--------------------------------

         Key: NUTCH-57
         URL: http://issues.apache.org/jira/browse/NUTCH-57
     Project: Nutch
        Type: Bug
  Components: indexer  
 Environment: Nutch 0.7Dev
    Reporter: Marc Delerue




While crawling :
http://XXX.XXX.XXX.XXX/yyyyy.txtorg.apache.nutch.util.mime.MimeTypeException : invalid Sub Type plain
and
http://XXX.XXX.XXX.XXX/yyyyy.htmlorg.apache.nutch.util.mime.MimeTypeException : invalid Sub Type html

The html and text files are fetched but not indexed.



--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-57) text and html files unrecognized

Ayush Saxena (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-57?page=all ]

Jerome Charron updated NUTCH-57:
--------------------------------

    Attachment: NUTCH-57-050509.patch

The problem was: ContentType optional parameters were not removed from the subtype. And when the validity of the subtype was checked an exception was raised.

The patch:
* Removes the optional parameters from the content-type subtype.
* Some unitary tests added to test the correction.

> text and html files unrecognized
> --------------------------------
>
>          Key: NUTCH-57
>          URL: http://issues.apache.org/jira/browse/NUTCH-57
>      Project: Nutch
>         Type: Bug
>   Components: indexer
>  Environment: Nutch 0.7Dev
>     Reporter: Marc Delerue
>  Attachments: NUTCH-57-050509.patch
>
> While crawling :
> http://XXX.XXX.XXX.XXX/yyyyy.txtorg.apache.nutch.util.mime.MimeTypeException : invalid Sub Type plain
> and
> http://XXX.XXX.XXX.XXX/yyyyy.htmlorg.apache.nutch.util.mime.MimeTypeException : invalid Sub Type html
> The html and text files are fetched but not indexed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-57) text and html files unrecognized

Ayush Saxena (Jira)
In reply to this post by Ayush Saxena (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-57?page=all ]
     
Andrzej Bialecki  closed NUTCH-57:
----------------------------------

    Resolution: Fixed

Applied.

> text and html files unrecognized
> --------------------------------
>
>          Key: NUTCH-57
>          URL: http://issues.apache.org/jira/browse/NUTCH-57
>      Project: Nutch
>         Type: Bug
>   Components: indexer
>  Environment: Nutch 0.7Dev
>     Reporter: Marc Delerue
>  Attachments: NUTCH-57-050509.patch
>
> While crawling :
> http://XXX.XXX.XXX.XXX/yyyyy.txtorg.apache.nutch.util.mime.MimeTypeException : invalid Sub Type plain
> and
> http://XXX.XXX.XXX.XXX/yyyyy.htmlorg.apache.nutch.util.mime.MimeTypeException : invalid Sub Type html
> The html and text files are fetched but not indexed.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira