OutOfMemoryError/Restarting Crawl/Indexing what has already been crawled

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

OutOfMemoryError/Restarting Crawl/Indexing what has already been crawled

Richard Braman
I have nutch running on a Compaq DL 380 w/ 1GB of RAM, not my best
machine, but I am only doing a limited crawl of about 52 urls.  When I
do the crawl with depth = 3 or even 6, it completes, when I do it at 10,
it has been running out of memory.  
 
2 questions
 
1. how do i restart the crawl?  
I have seen the tuturial, whch says
"

 Recover the pages already fetched and than restart the fetcher. You'll
need to create a file fetcher.done in the segment directory an than:
updatedb, generate and fetch . Assuming your index is at /index

% touch /index/segments/2005somesegment/fetcher.done

% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/

% bin/nutch generate /index/db/ /index/segments/2005somesegment/

% bin/nutch fetch /index/segments/2005somesegment

All the pages that were not crawled will be re-generated for fetch. If
you fetched lots of pages, and don't want to have to re-fetch them
again, this is the best way.

",

but I have more than one segment, do I only need to do this for the last
one in time, or all of them?

2. how to I index what I have already crawled?
I have seen the indexing section in the tutorial, when I run bin/nutch
invertlinks it gives me a Exception in thread "main"
java.lang.NoClassDefFoundError: invertlinks
using cygwin
 
The fetcher exited with a
 
060302 165825 SEVERE error writing output:java.lang.OutOfMemoryError:
Java heap space
java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged.  Exiting fetcher.
 at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
 at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
 at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)

 

Richard Braman
mailto:[hidden email]
561.748.4002 (voice)

http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
Free Open Source Tax Software

 
Reply | Threaded
Open this post in threaded view
|

RE: OutOfMemoryError/Restarting Crawl/Indexing what has already been crawled

Richard Braman
I think this may be a bug.

-----Original Message-----
From: Richard Braman [mailto:[hidden email]]
Sent: Thursday, March 02, 2006 8:28 PM
To: [hidden email]
Subject: OutOfMemoryError/Restarting Crawl/Indexing what has already
been crawled


I have nutch running on a Compaq DL 380 w/ 1GB of RAM, not my best
machine, but I am only doing a limited crawl of about 52 urls.  When I
do the crawl with depth = 3 or even 6, it completes, when I do it at 10,
it has been running out of memory.  
 
2 questions
 
1. how do i restart the crawl?  
I have seen the tuturial, whch says
"

 Recover the pages already fetched and than restart the fetcher. You'll
need to create a file fetcher.done in the segment directory an than:
updatedb, generate and fetch . Assuming your index is at /index

% touch /index/segments/2005somesegment/fetcher.done

% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/

% bin/nutch generate /index/db/ /index/segments/2005somesegment/

% bin/nutch fetch /index/segments/2005somesegment

All the pages that were not crawled will be re-generated for fetch. If
you fetched lots of pages, and don't want to have to re-fetch them
again, this is the best way.

",

but I have more than one segment, do I only need to do this for the last
one in time, or all of them?

2. how to I index what I have already crawled?
I have seen the indexing section in the tutorial, when I run bin/nutch
invertlinks it gives me a Exception in thread "main"
java.lang.NoClassDefFoundError: invertlinks
using cygwin
 
The fetcher exited with a
 
060302 165825 SEVERE error writing output:java.lang.OutOfMemoryError:
Java heap space
java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged.  Exiting fetcher.  at
org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
 at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
 at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)

 

Richard Braman
mailto:[hidden email]
561.748.4002 (voice)

http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
Free Open Source Tax Software

 

Reply | Threaded
Open this post in threaded view
|

entrance point of Nutch search page

Michael Ji
hi,

Which JSP file is the entrance for Nutch search page.

I saw nutch using

search(Query query, int numHits, String dedupField,
String sortField, boolean reverse)

to get the search result.

But not sure which JSP triggers this function.

Is it in tomcat container?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com