Installing Nutch on Windows

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Installing Nutch on Windows

Matt Pasiewicz
I'm a nutch newbie.  Can anyone out there point me to some good
documentation for installing Nutch on windows?
 
I've got Tomcat and cygwin up and running, but I've not been able to get
beyond that.
 
 
Matt Pasiewicz
EDUCAUSE
4772 Walnut St. - Suite 206
Boulder, CO  80301-2538  USA

http://www.educause.edu
http://blog.educause.edu/mpasiewicz
<http://blog.educause.edu/mpasiewicz>  
[hidden email]
(303) 544-5679
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Installing Nutch on Windows

em-13
You should be all fine if you follow the procedures outlined in the
tutorial.
At what point with the installation are you stuck with?


Matt Pasiewicz wrote:

>I'm a nutch newbie.  Can anyone out there point me to some good
>documentation for installing Nutch on windows?
>
>I've got Tomcat and cygwin up and running, but I've not been able to get
>beyond that.
>
>
>Matt Pasiewicz
>EDUCAUSE
>4772 Walnut St. - Suite 206
>Boulder, CO  80301-2538  USA
>
>http://www.educause.edu
>http://blog.educause.edu/mpasiewicz
><http://blog.educause.edu/mpasiewicz>  
>[hidden email]
>(303) 544-5679
>
>
>
>  
>


Reply | Threaded
Open this post in threaded view
|

Re: Installing Nutch on Windows

Tim Archambault
I am interested in this thread as well. Thanks for the post.

On 6/1/05, EM <[hidden email]> wrote:

>
> You should be all fine if you follow the procedures outlined in the
> tutorial.
> At what point with the installation are you stuck with?
>
>
> Matt Pasiewicz wrote:
>
> >I'm a nutch newbie. Can anyone out there point me to some good
> >documentation for installing Nutch on windows?
> >
> >I've got Tomcat and cygwin up and running, but I've not been able to get
> >beyond that.
> >
> >
> >Matt Pasiewicz
> >EDUCAUSE
> >4772 Walnut St. - Suite 206
> >Boulder, CO 80301-2538 USA
> >
> >http://www.educause.edu
> >http://blog.educause.edu/mpasiewicz
> ><http://blog.educause.edu/mpasiewicz>
> >[hidden email]
> >(303) 544-5679
> >
> >
> >
> >
> >
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Installing Nutch on Windows

J B-2
In reply to this post by Matt Pasiewicz
Hi Matt,

I am myself quite new to Nutch, but I will try to help you since I have
managed to install Nutch on my pc. Exactly where does it go wrong?

Jon

>From: "Matt Pasiewicz" <[hidden email]>
>Reply-To: [hidden email]
>To: <[hidden email]>
>Subject: Installing Nutch on Windows
>Date: Wed, 1 Jun 2005 11:44:32 -0600
>
>I'm a nutch newbie.  Can anyone out there point me to some good
>documentation for installing Nutch on windows?
>
>I've got Tomcat and cygwin up and running, but I've not been able to get
>beyond that.
>
>
>Matt Pasiewicz
>EDUCAUSE
>4772 Walnut St. - Suite 206
>Boulder, CO  80301-2538  USA
>
>http://www.educause.edu
>http://blog.educause.edu/mpasiewicz
><http://blog.educause.edu/mpasiewicz>
>[hidden email]
>(303) 544-5679
>
>

_________________________________________________________________
Nyhet! MSN Messenger i Mobiltelefonen! http://mobile.msn.com/

Reply | Threaded
Open this post in threaded view
|

RE: Installing Nutch on Windows

Matt Pasiewicz
In reply to this post by Matt Pasiewicz
Well, thanks to Jon's ([hidden email]) Cygwin explanation, I
feel like I'm getting a little closer, but now I'm getting a bit of a
prob from the log below.  Cygwin seems to see the path to
NUTCH_JAVA_HOME (/cygdrive/c/PROGRA~1/java/jre1.5.0_03) just fine, but
something seems to be going wrong.  Any ideas?
 

 -----------------------------------


 
NUTCH_JAVA_HOME: not found

run java in /cygdrive/c/PROGRA~1/java/jre1.5.0_03

050601 154453 parsing file:/C:/cygwin/nutch/conf/nutch-default.xml
<outbind://20/C:/cygwin/nutch/conf/nutch-default.xml>

050601 154453 parsing file:/C:/cygwin/nutch/conf/crawl-tool.xml
<outbind://20/C:/cygwin/nutch/conf/crawl-tool.xml>

050601 154453 parsing file:/C:/cygwin/nutch/conf/nutch-site.xml
<outbind://20/C:/cygwin/nutch/conf/nutch-site.xml>

050601 154453 No FS indicated, using default:local Exception in thread
"main" java.lang.RuntimeException: crawl.text already exists.

at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:121)


Reply | Threaded
Open this post in threaded view
|

RE: Installing Nutch on Windows

J B-2
Matt,

I don't even set NUTCH_JAVA_HOME since this is only to override the normal
JAVA_HOME. If you unset/remove NUTCH_JAVA_HOME alltogether, Nutch should
default to JAVA_HOME, which is enough.

The error at the bottom of the stack,

>050601 154453 No FS indicated, using default:local Exception in thread
>"main" java.lang.RuntimeException: crawl.text already exists.

also suggests that you have not removed a previously generated crawl
directory.

Again, I am new to this so I could be very wrong...

Jon




>From: "Matt Pasiewicz" <[hidden email]>
>Reply-To: [hidden email]
>To: <[hidden email]>
>Subject: RE: Installing Nutch on Windows
>Date: Wed, 1 Jun 2005 15:52:06 -0600
>
>Well, thanks to Jon's  Cygwin explanation, I
>feel like I'm getting a little closer, but now I'm getting a bit of a
>prob from the log below.  Cygwin seems to see the path to
>NUTCH_JAVA_HOME (/cygdrive/c/PROGRA~1/java/jre1.5.0_03) just fine, but
>something seems to be going wrong.  Any ideas?
>
>
>  -----------------------------------
>
>
>
>NUTCH_JAVA_HOME: not found
>
>run java in /cygdrive/c/PROGRA~1/java/jre1.5.0_03
>
>050601 154453 parsing file:/C:/cygwin/nutch/conf/nutch-default.xml
><outbind://20/C:/cygwin/nutch/conf/nutch-default.xml>
>
>050601 154453 parsing file:/C:/cygwin/nutch/conf/crawl-tool.xml
><outbind://20/C:/cygwin/nutch/conf/crawl-tool.xml>
>
>050601 154453 parsing file:/C:/cygwin/nutch/conf/nutch-site.xml
><outbind://20/C:/cygwin/nutch/conf/nutch-site.xml>
>
>050601 154453 No FS indicated, using default:local Exception in thread
>"main" java.lang.RuntimeException: crawl.text already exists.
>
>at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:121)
>
>

_________________________________________________________________
L?ttare att hitta dr?mresan med MSN Resor http://www.msn.se/resor/

Reply | Threaded
Open this post in threaded view
|

RE: Installing Nutch on Windows

Matt Pasiewicz
In reply to this post by Matt Pasiewicz
Ah, yes, I'm inching ever closer now.  
Here is what I'm getting now.

--------------------------------------

        run java in /cygdrive/c/PROGRA~1/java/jre1.5.0_03
        050601 160524 parsing file:/C:/cygwin/nutch/conf/nutch-default.xml
        050601 160524 parsing file:/C:/cygwin/nutch/conf/crawl-tool.xml
        050601 160524 parsing file:/C:/cygwin/nutch/conf/nutch-site.xml
        050601 160524 No FS indicated, using default:local
        050601 160524 crawl started in: crawl.text
        050601 160524 rootUrlFile = urls
        050601 160524 threads = 10
        050601 160524 depth = 3
        050601 160525 Created webdb at LocalFS,C:\cygwin\nutch\crawl.text\db
        050601 160525 Starting URL processing
        050601 160525 Plugins: looking in: C:\cygwin\nutch\plugins
        050601 160525 not including: C:\cygwin\nutch\plugins\clustering-carrot2
        050601 160525 not including: C:\cygwin\nutch\plugins\creativecommons
        050601 160525 parsing: C:\cygwin\nutch\plugins\index-basic\plugin.xml
        050601 160525 impl: point=net.nutch.indexer.IndexingFilter class=net.nutch.indexer.basic.BasicIndexingFilter
        050601 160525 not including: C:\cygwin\nutch\plugins\index-more
        050601 160525 not including: C:\cygwin\nutch\plugins\language-identifier
        050601 160525 not including: C:\cygwin\nutch\plugins\ontology
        050601 160525 not including: C:\cygwin\nutch\plugins\parse-ext
        050601 160525 parsing: C:\cygwin\nutch\plugins\parse-html\plugin.xml
        050601 160525 impl: point=net.nutch.parse.Parser class=net.nutch.parse.html.HtmlParser
        050601 160525 not including: C:\cygwin\nutch\plugins\parse-mp3
        050601 160525 not including: C:\cygwin\nutch\plugins\parse-msword
        050601 160525 not including: C:\cygwin\nutch\plugins\parse-pdf
        050601 160525 not including: C:\cygwin\nutch\plugins\parse-rtf
        050601 160525 parsing: C:\cygwin\nutch\plugins\parse-text\plugin.xml
        050601 160525 impl: point=net.nutch.parse.Parser class=net.nutch.parse.text.TextParser
        050601 160525 not including: C:\cygwin\nutch\plugins\protocol-file
        050601 160525 not including: C:\cygwin\nutch\plugins\protocol-ftp
        050601 160525 parsing: C:\cygwin\nutch\plugins\protocol-http\plugin.xml
        050601 160525 impl: point=net.nutch.protocol.Protocol class=net.nutch.protocol.http.Http
        050601 160525 parsing: C:\cygwin\nutch\plugins\query-basic\plugin.xml
        050601 160525 impl: point=net.nutch.searcher.QueryFilter class=net.nutch.searcher.basic.BasicQueryFilter
        050601 160525 not including: C:\cygwin\nutch\plugins\query-more
        050601 160525 parsing: C:\cygwin\nutch\plugins\query-site\plugin.xml
        050601 160525 impl: point=net.nutch.indexer.IndexingFilter class=net.nutch.searcher.site.SiteIndexingFilter
        050601 160525 impl: point=net.nutch.searcher.QueryFilter class=net.nutch.searcher.site.SiteQueryFilter
        050601 160525 parsing: C:\cygwin\nutch\plugins\query-url\plugin.xml
        050601 160525 impl: point=net.nutch.searcher.QueryFilter class=net.nutch.searcher.url.URLQueryFilter
        050601 160525 not including: C:\cygwin\nutch\plugins\urlfilter-prefix
        050601 160525 not including: C:\cygwin\nutch\plugins\urlfilter-regex
        Exception in thread "main" java.lang.ExceptionInInitializerError
                at org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437)
                at org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378)
                at org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
                at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)
Caused by: java.lang.RuntimeException: org.apache.nutch.net.URLFilter not found.
        at org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:44)
        ... 4 more
 

> -----Original Message-----
> From: J B [mailto:[hidden email]]
> Sent: Wednesday, June 01, 2005 4:10 PM
> To: [hidden email]
> Subject: RE: Installing Nutch on Windows
>
> Matt,
>
> I don't even set NUTCH_JAVA_HOME since this is only to
> override the normal JAVA_HOME. If you unset/remove
> NUTCH_JAVA_HOME alltogether, Nutch should default to
> JAVA_HOME, which is enough.
>
> The error at the bottom of the stack,
>
> >050601 154453 No FS indicated, using default:local Exception
> in thread
> >"main" java.lang.RuntimeException: crawl.text already exists.
>
> also suggests that you have not removed a previously
> generated crawl directory.
>
> Again, I am new to this so I could be very wrong...
>
> Jon
>
>
>
>
> >From: "Matt Pasiewicz" <[hidden email]>
> >Reply-To: [hidden email]
> >To: <[hidden email]>
> >Subject: RE: Installing Nutch on Windows
> >Date: Wed, 1 Jun 2005 15:52:06 -0600
> >
> >Well, thanks to Jon's  Cygwin explanation, I feel like I'm getting a
> >little closer, but now I'm getting a bit of a prob from the
> log below.  
> >Cygwin seems to see the path to NUTCH_JAVA_HOME
> >(/cygdrive/c/PROGRA~1/java/jre1.5.0_03) just fine, but
> something seems
> >to be going wrong.  Any ideas?
> >
> >
> >  -----------------------------------
> >
> >
> >
> >NUTCH_JAVA_HOME: not found
> >
> >run java in /cygdrive/c/PROGRA~1/java/jre1.5.0_03
> >
> >050601 154453 parsing file:/C:/cygwin/nutch/conf/nutch-default.xml
> ><outbind://20/C:/cygwin/nutch/conf/nutch-default.xml>
> >
> >050601 154453 parsing file:/C:/cygwin/nutch/conf/crawl-tool.xml
> ><outbind://20/C:/cygwin/nutch/conf/crawl-tool.xml>
> >
> >050601 154453 parsing file:/C:/cygwin/nutch/conf/nutch-site.xml
> ><outbind://20/C:/cygwin/nutch/conf/nutch-site.xml>
> >
> >050601 154453 No FS indicated, using default:local Exception
> in thread
> >"main" java.lang.RuntimeException: crawl.text already exists.
> >
> >at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:121)
> >
> >
>
> _________________________________________________________________
> Lättare att hitta drömresan med MSN Resor http://www.msn.se/resor/
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Installing Nutch on Windows

Chirag Chaman
You don’t have the plugins in there.

- New nutch requires plugins to be compiled.
- To be certain also change the plugins path in the conf file to the
absolute path where plugins are located.

CC-


-----Original Message-----
From: Matt Pasiewicz [mailto:[hidden email]]
Sent: Wednesday, June 01, 2005 6:13 PM
To: [hidden email]
Subject: RE: Installing Nutch on Windows

Ah, yes, I'm inching ever closer now.  
Here is what I'm getting now.

--------------------------------------

        run java in /cygdrive/c/PROGRA~1/java/jre1.5.0_03
        050601 160524 parsing file:/C:/cygwin/nutch/conf/nutch-default.xml
        050601 160524 parsing file:/C:/cygwin/nutch/conf/crawl-tool.xml
        050601 160524 parsing file:/C:/cygwin/nutch/conf/nutch-site.xml
        050601 160524 No FS indicated, using default:local
        050601 160524 crawl started in: crawl.text
        050601 160524 rootUrlFile = urls
        050601 160524 threads = 10
        050601 160524 depth = 3
        050601 160525 Created webdb at LocalFS,C:\cygwin\nutch\crawl.text\db
        050601 160525 Starting URL processing
        050601 160525 Plugins: looking in: C:\cygwin\nutch\plugins
        050601 160525 not including:
C:\cygwin\nutch\plugins\clustering-carrot2
        050601 160525 not including: C:\cygwin\nutch\plugins\creativecommons
        050601 160525 parsing:
C:\cygwin\nutch\plugins\index-basic\plugin.xml
        050601 160525 impl: point=net.nutch.indexer.IndexingFilter
class=net.nutch.indexer.basic.BasicIndexingFilter
        050601 160525 not including: C:\cygwin\nutch\plugins\index-more
        050601 160525 not including:
C:\cygwin\nutch\plugins\language-identifier
        050601 160525 not including: C:\cygwin\nutch\plugins\ontology
        050601 160525 not including: C:\cygwin\nutch\plugins\parse-ext
        050601 160525 parsing: C:\cygwin\nutch\plugins\parse-html\plugin.xml
        050601 160525 impl: point=net.nutch.parse.Parser
class=net.nutch.parse.html.HtmlParser
        050601 160525 not including: C:\cygwin\nutch\plugins\parse-mp3
        050601 160525 not including: C:\cygwin\nutch\plugins\parse-msword
        050601 160525 not including: C:\cygwin\nutch\plugins\parse-pdf
        050601 160525 not including: C:\cygwin\nutch\plugins\parse-rtf
        050601 160525 parsing: C:\cygwin\nutch\plugins\parse-text\plugin.xml
        050601 160525 impl: point=net.nutch.parse.Parser
class=net.nutch.parse.text.TextParser
        050601 160525 not including: C:\cygwin\nutch\plugins\protocol-file
        050601 160525 not including: C:\cygwin\nutch\plugins\protocol-ftp
        050601 160525 parsing:
C:\cygwin\nutch\plugins\protocol-http\plugin.xml
        050601 160525 impl: point=net.nutch.protocol.Protocol
class=net.nutch.protocol.http.Http
        050601 160525 parsing:
C:\cygwin\nutch\plugins\query-basic\plugin.xml
        050601 160525 impl: point=net.nutch.searcher.QueryFilter
class=net.nutch.searcher.basic.BasicQueryFilter
        050601 160525 not including: C:\cygwin\nutch\plugins\query-more
        050601 160525 parsing: C:\cygwin\nutch\plugins\query-site\plugin.xml
        050601 160525 impl: point=net.nutch.indexer.IndexingFilter
class=net.nutch.searcher.site.SiteIndexingFilter
        050601 160525 impl: point=net.nutch.searcher.QueryFilter
class=net.nutch.searcher.site.SiteQueryFilter
        050601 160525 parsing: C:\cygwin\nutch\plugins\query-url\plugin.xml
        050601 160525 impl: point=net.nutch.searcher.QueryFilter
class=net.nutch.searcher.url.URLQueryFilter
        050601 160525 not including:
C:\cygwin\nutch\plugins\urlfilter-prefix
        050601 160525 not including: C:\cygwin\nutch\plugins\urlfilter-regex
        Exception in thread "main" java.lang.ExceptionInInitializerError
                at
org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437)
                at
org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378)
                at
org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
                at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)
Caused by: java.lang.RuntimeException: org.apache.nutch.net.URLFilter not
found.
        at org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:44)
        ... 4 more
 

> -----Original Message-----
> From: J B [mailto:[hidden email]]
> Sent: Wednesday, June 01, 2005 4:10 PM
> To: [hidden email]
> Subject: RE: Installing Nutch on Windows
>
> Matt,
>
> I don't even set NUTCH_JAVA_HOME since this is only to override the
> normal JAVA_HOME. If you unset/remove NUTCH_JAVA_HOME alltogether,
> Nutch should default to JAVA_HOME, which is enough.
>
> The error at the bottom of the stack,
>
> >050601 154453 No FS indicated, using default:local Exception
> in thread
> >"main" java.lang.RuntimeException: crawl.text already exists.
>
> also suggests that you have not removed a previously generated crawl
> directory.
>
> Again, I am new to this so I could be very wrong...
>
> Jon
>
>
>
>
> >From: "Matt Pasiewicz" <[hidden email]>
> >Reply-To: [hidden email]
> >To: <[hidden email]>
> >Subject: RE: Installing Nutch on Windows
> >Date: Wed, 1 Jun 2005 15:52:06 -0600
> >
> >Well, thanks to Jon's  Cygwin explanation, I feel like I'm getting a
> >little closer, but now I'm getting a bit of a prob from the
> log below.  
> >Cygwin seems to see the path to NUTCH_JAVA_HOME
> >(/cygdrive/c/PROGRA~1/java/jre1.5.0_03) just fine, but
> something seems
> >to be going wrong.  Any ideas?
> >
> >
> >  -----------------------------------
> >
> >
> >
> >NUTCH_JAVA_HOME: not found
> >
> >run java in /cygdrive/c/PROGRA~1/java/jre1.5.0_03
> >
> >050601 154453 parsing file:/C:/cygwin/nutch/conf/nutch-default.xml
> ><outbind://20/C:/cygwin/nutch/conf/nutch-default.xml>
> >
> >050601 154453 parsing file:/C:/cygwin/nutch/conf/crawl-tool.xml
> ><outbind://20/C:/cygwin/nutch/conf/crawl-tool.xml>
> >
> >050601 154453 parsing file:/C:/cygwin/nutch/conf/nutch-site.xml
> ><outbind://20/C:/cygwin/nutch/conf/nutch-site.xml>
> >
> >050601 154453 No FS indicated, using default:local Exception
> in thread
> >"main" java.lang.RuntimeException: crawl.text already exists.
> >
> >at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:121)
> >
> >
>
> _________________________________________________________________
> Lättare att hitta drömresan med MSN Resor http://www.msn.se/resor/
>
>