Parser chokes on some documents

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Parser chokes on some documents

Kyle Gabhart
I have a large number of documents on our intranet (about 1000) that are
indexed by nutch (version 0.6).  On about 1/3 of those documents I get
the following error:

050529 011245 fetch okay, but can't parse PATH_TO_FILE, reason: Content
truncated at 65536 bytes. Parser can't handle incomplete msword file.

The same happens on some PDF files.  Any ideas?

-KG


Reply | Threaded
Open this post in threaded view
|

Re: Parser chokes on some documents

Sébastien LE CALLONNEC
Hi Kyle,


That message might help you:

http://www.mail-archive.com/nutch-user@.../msg00139.html


Regards,
Sebastien.

--- Kyle Gabhart <[hidden email]> a ?crit:

> I have a large number of documents on our intranet (about 1000) that
> are
> indexed by nutch (version 0.6).  On about 1/3 of those documents I
> get
> the following error:
>
> 050529 011245 fetch okay, but can't parse PATH_TO_FILE, reason:
> Content
> truncated at 65536 bytes. Parser can't handle incomplete msword file.
>
>
> The same happens on some PDF files.  Any ideas?
>
> -KG
>
>
>


       

       
               
_____________________________________________________________________________
D?couvrez le nouveau Yahoo! Mail : 1 Go d'espace de stockage pour vos mails, photos et vid?os !
Cr?ez votre Yahoo! Mail sur http://fr.mail.yahoo.com
Reply | Threaded
Open this post in threaded view
|

Re: Parser chokes on some documents

quovadis
In reply to this post by Kyle Gabhart
Its because the size of the maximum content size. Change
the content.limit values in your site configuration file.


On Tue, 31 May 2005 10:11:02 -0500
 "Kyle Gabhart" <[hidden email]> wrote:

> I have a large number of documents on our intranet (about
> 1000) that are indexed by nutch (version 0.6).  On about
> 1/3 of those documents I get the following error:
>
> 050529 011245 fetch okay, but can't parse PATH_TO_FILE,
> reason: Content truncated at 65536 bytes. Parser can't
> handle incomplete msword file.
> The same happens on some PDF files.  Any ideas?
>
> -KG
>
>

_____________________________________________________________________
For super low premiums, click here http://www.dialdirect.co.za/quote