Help with "bin/nutch server 8081 crawl"

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Help with "bin/nutch server 8081 crawl"

Monu Ogbe
Hello Team,

I am having a lot of fun evaluating 0.8-dev, and after following
Stefan's and the doc team's tutorials, have got everything working in
both local and multi-machine modes using hadoop.  

In single-machine mode, I have come unstuck, though, trying to expose
"nutch server" on port 8081 so as to be able to deploy multiple
searchers eventually.  

In summary, the *site.conf files, host folder, search-servers.txt are
configured and the server is running on port 8081.  However when I
perform a search from the front-end webapp, errors appear in the
server's console output.  

Here are the details:


Conf files hadoop-site.xml / nutch-site.xml contain:
===================================================

<property>
  <name>searcher.dir</name>
  <value>/usr/local/nutch/nutch-2006-03-02/monu-conf</value>
  <description>
  Path to root of index directories.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory "index" containing
  merged indexes, or the directory "segments" containing segment
  indexes.
  </description>
</property>

search-server.txt contains:
==========================
[root@nutch1 monu-conf]# cat
/usr/local/nutch/nutch-2006-03-02/monu-conf/search-servers.txt
193.203.244.233 8081

Issuing "bin/nutch server" on its own produces:
==============================================

DistributedSearch$Server <port> <index dir>

When "bin/nutch server" is started:
==================================

I usually use the relative path of the crawl directory, and the full
path works too, and the output below suggests that the server is looking
for crawldb, indexes, plugins, linkdb and segments in the right places.

# bin/nutch server 8081 /usr/local/nutch/nutch-2006-03-02/crawl

[root@nutch1 nutch-2006-03-02]# bin/nutch server 8081
/usr/local/nutch/nutch-2006-03-02/crawl
060306 191228 10 parsing
jar:file:/usr/local/nutch/nutch-2006-03-02/lib/hadoop-0.1-dev.jar!/hadoo
p-default.xml
060306 191228 10 parsing
file:/usr/local/nutch/nutch-2006-03-02/conf/nutch-default.xml
060306 191228 10 parsing
file:/usr/local/nutch/nutch-2006-03-02/conf/nutch-site.xml
060306 191228 10 parsing
file:/usr/local/nutch/nutch-2006-03-02/conf/hadoop-site.xml
060306 191228 10 opening indexes in
/usr/local/nutch/nutch-2006-03-02/crawl/indexes
060306 191228 10 Plugins: looking in:
/usr/local/nutch/nutch-2006-03-02/plugins
060306 191228 10 Plugin Auto-activation mode: [true]
060306 191228 10 Registered Plugins:
060306 191228 10        HTTP Framework (lib-http)
060306 191228 10        CyberNeko HTML Parser (lib-nekohtml)
060306 191228 10        URL Query Filter (query-url)
060306 191228 10        Site Query Filter (query-site)
060306 191228 10        Html Parse Plug-in (parse-html)
060306 191228 10        Http Protocol Plug-in (protocol-http)
060306 191228 10        the nutch core extension points
(nutch-extensionpoints)
060306 191228 10        Basic Indexing Filter (index-basic)
060306 191228 10        Text Parse Plug-in (parse-text)
060306 191228 10        JavaScript Parser (parse-js)
060306 191228 10        Regex URL Filter (urlfilter-regex)
060306 191228 10        Basic Query Filter (query-basic)
060306 191228 10 Registered Extension-Points:
060306 191228 10        Nutch Protocol
(org.apache.nutch.protocol.Protocol)
060306 191228 10        Nutch URL Filter
(org.apache.nutch.net.URLFilter)
060306 191228 10        HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
060306 191228 10        Nutch Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
060306 191228 10        Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
060306 191228 10        Nutch Content Parser
(org.apache.nutch.parse.Parser)
060306 191228 10        Ontology Model Loader
(org.apache.nutch.ontology.Ontology)
060306 191228 10        Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
060306 191228 10        Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
060306 191228 10 opening segments in
/usr/local/nutch/nutch-2006-03-02/crawl/segments
060306 191228 10 found resource common-terms.utf8 at
file:/usr/local/nutch/nutch-2006-03-02/conf/common-terms.utf8
060306 191228 10 opening linkdb in
/usr/local/nutch/nutch-2006-03-02/crawl/linkdb
060306 191228 11 Server listener on port 8081: starting
060306 191228 12 Server handler 0 on 8081: starting
060306 191228 13 Server handler 1 on 8081: starting
060306 191228 14 Server handler 2 on 8081: starting
060306 191228 15 Server handler 3 on 8081: starting
060306 191228 16 Server handler 4 on 8081: starting
060306 191228 17 Server handler 5 on 8081: starting
060306 191228 18 Server handler 6 on 8081: starting
060306 191228 19 Server handler 7 on 8081: starting
060306 191228 20 Server handler 8 on 8081: starting
060306 191228 21 Server handler 9 on 8081: starting

When a search is initiated in the webapp:
========================================

060306 191615 22 Server connection on port 8081 from 193.203.244.233:
starting
060306 191615 12 Call: getSegmentNames()
060306 191615 12 Return: [Ljava.lang.String;@1e859c0
060306 191615 22 Server connection on port 8081 from 193.203.244.233
caught: java.lang.RuntimeException: java.lang.InstantiationException:
org.apache.nutch.searcher.Query
java.lang.RuntimeException: java.lang.InstantiationException:
org.apache.nutch.searcher.Query
        at
org.apache.hadoop.io.WritableFactories.newInstance(WritableFactories.jav
a:47)
        at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:230)
        at org.apache.hadoop.ipc.RPC$Invocation.readFields(RPC.java:88)
        at org.apache.hadoop.ipc.Server$Connection.run(Server.java:138)
Caused by: java.lang.InstantiationException:
org.apache.nutch.searcher.Query
        at java.lang.Class.newInstance0(Unknown Source)
        at java.lang.Class.newInstance(Unknown Source)
        at
org.apache.hadoop.io.WritableFactories.newInstance(WritableFactories.jav
a:45)
        ... 3 more
060306 191615 22 Server connection on port 8081 from 193.203.244.233:
exiting
060306 191625 23 Server connection on port 8081 from 193.203.244.233:
starting
060306 191625 13 Call: getSegmentNames()
060306 191625 13 Return: [Ljava.lang.String;@1e859c0
060306 191635 12 Call: getSegmentNames()
060306 191635 12 Return: [Ljava.lang.String;@1e859c0
060306 191645 13 Call: getSegmentNames()
060306 191645 13 Return: [Ljava.lang.String;@1e859c0
060306 191655 16 Call: getSegmentNames()
060306 191655 16 Return: [Ljava.lang.String;@1e859c0
060306 191705 14 Call: getSegmentNames()
060306 191705 14 Return: [Ljava.lang.String;@1e859c0

Everything else has worked so well, and the self-same experiment works
fine under 0.7.1 - could this be a bug?

Can someone advise what to do?

Many thanks,

Monu Ogbe

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.375 / Virus Database: 268.1.2/274 - Release Date:
03/03/2006
 
Reply | Threaded
Open this post in threaded view
|

Re: Help with "bin/nutch server 8081 crawl"

Doug Cutting
Monu Ogbe wrote:
> Caused by: java.lang.InstantiationException:
> org.apache.nutch.searcher.Query
>         at java.lang.Class.newInstance0(Unknown Source)
>         at java.lang.Class.newInstance(Unknown Source)
>         at
> org.apache.hadoop.io.WritableFactories.newInstance(WritableFactories.jav

It looks like Query no longer has a no-arg constructor, probably since
the patch which makes all Configurations non-static.  A no-arg
constructor is required in order to pass something via an RPC.  The fix
might be as simple as adding the no-arg constructor, but perhaps not,
since the query would then have a null configuration.  At a glance, the
query execution code doesn't appear to use the configuration, so this
might work...

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Help with "bin/nutch server 8081 crawl"

Andrzej Białecki-2
Doug Cutting wrote:

> Monu Ogbe wrote:
>> Caused by: java.lang.InstantiationException:
>> org.apache.nutch.searcher.Query
>>         at java.lang.Class.newInstance0(Unknown Source)
>>         at java.lang.Class.newInstance(Unknown Source)
>>         at
>> org.apache.hadoop.io.WritableFactories.newInstance(WritableFactories.jav
>
> It looks like Query no longer has a no-arg constructor, probably since
> the patch which makes all Configurations non-static.  A no-arg
> constructor is required in order to pass something via an RPC.  The
> fix might be as simple as adding the no-arg constructor, but perhaps
> not, since the query would then have a null configuration.  At a
> glance, the query execution code doesn't appear to use the
> configuration, so this might work...

Configuration is used in Clause.toString() to check which fields are raw.

Does RPC set the current Configuration when it instantiates objects that
implement Configurable? Perhaps it should, using the current JobConf.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Help with "bin/nutch server 8081 crawl"

Marko Bauhardt-2
This two patches could fix the problem. The first patch is a hadoop  
patch and the other patch is a nutch patch. I dont know whether i  
should create a bug in the nutch-jira and hadoop-jira?
Anyway... here are the two patches.



Index: src/java/org/apache/hadoop/ipc/Server.java
===================================================================
--- src/java/org/apache/hadoop/ipc/Server.java (revision 383691)
+++ src/java/org/apache/hadoop/ipc/Server.java (working copy)
@@ -33,6 +33,7 @@
import java.util.logging.Level;
import org.apache.hadoop.util.LogFormatter;
+import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.UTF8;
@@ -59,6 +60,8 @@
    private LinkedList callQueue = new LinkedList(); // queued calls
    private Object callDequeued = new Object();     // used by wait/
notify
+  private Configuration conf;
+
    /** A call queued for handling. */
    private static class Call {
      private int id;                               // the client's  
call id
@@ -234,6 +237,7 @@
      this.handlerCount = handlerCount;
      this.maxQueuedCalls = handlerCount;
      this.timeout = conf.getInt("ipc.client.timeout",10000);
+    this.conf = conf;
    }
    /** Sets the timeout used for network i/o. */
@@ -280,6 +284,9 @@
      Writable param;                               // construct param
      try {
        param = (Writable)paramClass.newInstance();
+      if(param instanceof Configurable) {
+        ((Configurable) param).setConf(conf);
+      }
      } catch (InstantiationException e) {
        throw new RuntimeException(e.toString());
      } catch (IllegalAccessException e) {




Index: src/java/org/apache/nutch/searcher/Query.java
===================================================================
--- src/java/org/apache/nutch/searcher/Query.java (revision 376518)
+++ src/java/org/apache/nutch/searcher/Query.java (working copy)
@@ -26,13 +26,14 @@
import java.util.logging.Logger;
import org.apache.hadoop.util.LogFormatter;
+import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.analysis.NutchAnalysis;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.hadoop.io.Writable;
/** A Nutch query. */
-public final class Query implements Writable, Cloneable {
+public final class Query implements Writable, Cloneable, Configurable {
    public static final Logger LOG =
      LogFormatter.getLogger("org.apache.nutch.searcher.Query");
@@ -286,6 +287,8 @@
    public Query(Configuration conf) {
        this.conf = conf;
    }
+
+  public Query() { }
    /** Return all clauses. */
    public Clause[] getClauses() {
@@ -456,4 +459,12 @@
        System.out.println("Translated: " + new QueryFilters
(conf).filter(query));
      }
    }
+
+  public void setConf(Configuration arg0) {
+    this.conf = arg0;
+  }
+
+  public Configuration getConf() {
+    return this.conf;
+  }
}





Reply | Threaded
Open this post in threaded view
|

Re: Help with "bin/nutch server 8081 crawl"

Andrzej Białecki-2
Marko Bauhardt wrote:
> This two patches could fix the problem. The first patch is a hadoop
> patch and the other patch is a nutch patch. I dont know whether i
> should create a bug in the nutch-jira and hadoop-jira?

No need to do this for Nutch, I'm fixing a similar issue in
ParseSegment, I will apply this fix too.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Help with "bin/nutch server 8081 crawl"

sudhendra seshachala
An plans of if these patches be applied to nightly build?
Or do we have to do it manually on our copies.

Thanks
Sudhi

Andrzej Bialecki <[hidden email]> wrote: Marko Bauhardt wrote:
> This two patches could fix the problem. The first patch is a hadoop
> patch and the other patch is a nutch patch. I dont know whether i
> should create a bug in the nutch-jira and hadoop-jira?

No need to do this for Nutch, I'm fixing a similar issue in
ParseSegment, I will apply this fix too.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


               
---------------------------------
 Yahoo! Mail
 Use Photomail to share photos without annoying attachments.
Reply | Threaded
Open this post in threaded view
|

Re: Help with "bin/nutch server 8081 crawl"

Andrzej Białecki-2
sudhendra seshachala wrote:
> An plans of if these patches be applied to nightly build?
> Or do we have to do it manually on our copies.
>  

Already applied, you will need to build the latest version of Hadoop to
catch this.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com