[jira] Created: (NUTCH-168) setting http.content.limit to -1 seems to break text parsing on some files

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-168) setting http.content.limit to -1 seems to break text parsing on some files

JIRA jira@apache.org
setting http.content.limit to -1 seems to break text parsing on some files
--------------------------------------------------------------------------

         Key: NUTCH-168
         URL: http://issues.apache.org/jira/browse/NUTCH-168
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.7    
 Environment: Windows 2000
java version "1.4.2_05"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04)
Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode)
    Reporter: Jerry Russell


Setting http.content limit to -1 (which is supposed to mean no limit causes some pages not to index. I have seen this in some PDFs and this one URL in particular. The steps to reproduce are below:

Reproduce:

  1) install fresh nutch-0.7
  2) configure urlfilters to allow any URL
  3) create urllist with only the following URL: http://www.circuitsonline.net/circuits/view/71
  4) perform a crawl with a depth of 1
  5) do segread and see that the content is there
  6) change the http.content.limit to -1 in nutch-default.xml
  7) repeat the crawl to a new directory
  8) do segread and see that the content is not there

contact [hidden email] for more information.


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira