Nutch Result NULL

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch Result NULL

Michael Ji
Hi there,

I attached my Cataglina output.

thanks,

Michael,

Note: forwarded message attached.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
 
 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Jul 17, 2005 8:41:54 AM org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-8888
Starting service Tomcat-Standalone
Apache Tomcat/4.1.31
The scratchDir you specified: /home/fji/SE/tomcat4/work/Standalone/localhost/examples is unusable.
Jul 17, 2005 8:41:56 AM org.apache.struts.util.PropertyMessageResources <init>
INFO: Initializing, config='org.apache.struts.util.LocalStrings', returnNull=true
Jul 17, 2005 8:41:56 AM org.apache.struts.util.PropertyMessageResources <init>
INFO: Initializing, config='org.apache.struts.action.ActionResources', returnNull=true
Jul 17, 2005 8:41:56 AM org.apache.struts.util.PropertyMessageResources <init>
INFO: Initializing, config='org.apache.webapp.admin.ApplicationResources', returnNull=true
The scratchDir you specified: /home/fji/SE/tomcat4/work/Standalone/localhost/admin is unusable.
The scratchDir you specified: /home/fji/SE/tomcat4/work/Standalone/localhost/manager is unusable.
The scratchDir you specified: /home/fji/SE/tomcat4/work/Standalone/localhost/_ is unusable.
The scratchDir you specified: /home/fji/SE/tomcat4/work/Standalone/localhost/tomcat-docs is unusable.
The scratchDir you specified: /home/fji/SE/tomcat4/work/Standalone/localhost/webdav is unusable.
Jul 17, 2005 8:41:59 AM org.apache.coyote.http11.Http11Protocol start
INFO: Starting Coyote HTTP/1.1 on http-8888
Jul 17, 2005 8:41:59 AM org.apache.jk.common.ChannelSocket init
INFO: JK2: ajp13 listening on /0.0.0.0:8009
Jul 17, 2005 8:41:59 AM org.apache.jk.server.JkMain start
INFO: Jk running ID=0 time=1/110  config=/home/fji/SE/tomcat4/conf/jk2.properties
050717 084206 parsing file:/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/nutch-default.xml
050717 084206 parsing file:/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/nutch-site.xml
050717 084206 Plugins: looking in: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/clustering-carrot2
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/creativecommons
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/index-basic/plugin.xml
050717 084206 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/index-more
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/language-identifier
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/ontology
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-ext
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-html/plugin.xml
050717 084206 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-js/plugin.xml
050717 084206 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.js.JSParseFilter
050717 084206 impl: point=org.apache.nutch.parse.HtmlParseFilter class=org.apache.nutch.parse.js.JSParseFilter
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-msword
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-pdf
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-text/plugin.xml
050717 084206 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/protocol-file
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/protocol-ftp
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/protocol-http
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/protocol-httpclient/plugin.xml
050717 084206 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.httpclient.Http
050717 084206 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.httpclient.Http
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/query-basic/plugin.xml
050717 084206 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/query-more
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/query-site/plugin.xml
050717 084206 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/query-url/plugin.xml
050717 084206 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/urlfilter-prefix
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/urlfilter-regex/plugin.xml
050717 084206 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
050717 084206 11 creating new bean
050717 084206 11 opening segment indexes in /home/fji/SE/tomcat4/segments
050717 084207 11 query request from 127.0.0.1
050717 084207 11 query: commer
050717 084207 11 searching for 20 raw hits
050717 084207 11 total hits: 0
Stopping service Tomcat-Standalone


Jul 17, 2005 8:41:54 AM org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-8888
Starting service Tomcat-Standalone
Apache Tomcat/4.1.31
The scratchDir you specified: /home/fji/SE/tomcat4/work/Standalone/localhost/examples is unusable.
Jul 17, 2005 8:41:56 AM org.apache.struts.util.PropertyMessageResources <init>
INFO: Initializing, config='org.apache.struts.util.LocalStrings', returnNull=true
Jul 17, 2005 8:41:56 AM org.apache.struts.util.PropertyMessageResources <init>
INFO: Initializing, config='org.apache.struts.action.ActionResources', returnNull=true
Jul 17, 2005 8:41:56 AM org.apache.struts.util.PropertyMessageResources <init>
INFO: Initializing, config='org.apache.webapp.admin.ApplicationResources', returnNull=true
The scratchDir you specified: /home/fji/SE/tomcat4/work/Standalone/localhost/admin is unusable.
The scratchDir you specified: /home/fji/SE/tomcat4/work/Standalone/localhost/manager is unusable.
The scratchDir you specified: /home/fji/SE/tomcat4/work/Standalone/localhost/_ is unusable.
The scratchDir you specified: /home/fji/SE/tomcat4/work/Standalone/localhost/tomcat-docs is unusable.
The scratchDir you specified: /home/fji/SE/tomcat4/work/Standalone/localhost/webdav is unusable.
Jul 17, 2005 8:41:59 AM org.apache.coyote.http11.Http11Protocol start
INFO: Starting Coyote HTTP/1.1 on http-8888
Jul 17, 2005 8:41:59 AM org.apache.jk.common.ChannelSocket init
INFO: JK2: ajp13 listening on /0.0.0.0:8009
Jul 17, 2005 8:41:59 AM org.apache.jk.server.JkMain start
INFO: Jk running ID=0 time=1/110  config=/home/fji/SE/tomcat4/conf/jk2.properties
050717 084206 parsing file:/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/nutch-default.xml
050717 084206 parsing file:/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/nutch-site.xml
050717 084206 Plugins: looking in: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/clustering-carrot2
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/creativecommons
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/index-basic/plugin.xml
050717 084206 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/index-more
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/language-identifier
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/ontology
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-ext
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-html/plugin.xml
050717 084206 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-js/plugin.xml
050717 084206 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.js.JSParseFilter
050717 084206 impl: point=org.apache.nutch.parse.HtmlParseFilter class=org.apache.nutch.parse.js.JSParseFilter
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-msword
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-pdf
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-text/plugin.xml
050717 084206 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/protocol-file
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/protocol-ftp
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/protocol-http
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/protocol-httpclient/plugin.xml
050717 084206 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.httpclient.Http
050717 084206 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.httpclient.Http
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/query-basic/plugin.xml
050717 084206 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/query-more
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/query-site/plugin.xml
050717 084206 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/query-url/plugin.xml
050717 084206 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
050717 084206 not including: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/urlfilter-prefix
050717 084206 parsing: /home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/urlfilter-regex/plugin.xml
050717 084206 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
050717 084206 11 creating new bean
050717 084206 11 opening segment indexes in /home/fji/SE/tomcat4/segments
050717 084207 11 query request from 127.0.0.1
050717 084207 11 query: commer
050717 084207 11 searching for 20 raw hits
050717 084207 11 total hits: 0
Stopping service Tomcat-Standalone

commer_cmd (208 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Nutch returning NULL Result

Michael Ji
Hi there,

I am a new commer to the Nutch world.

After install Nutch and Tomcat in my linux box, I
tried to crawl a single url.

Using command of
"
bin/nutch crawl url2 -dir crawl2 -depth 3 >&
crawl2.log
"

my url2 is a plain text file with content of
"http://www.nutch.org/" and I do change
"urlfilter.txt"

But, after crawling, I checked the crawl.log, seems it
didn't fetch anything.

"
050717 083918 DONE indexing segment 20050717083916:
total 0 records in 0.19 s (NaN rec/s).
"

And the search result is return NULL in web UI.

Any suggestion will be very helpful,

thanks,

Michael,

FYI, I attached the catalina log file for the search
hit;

>
/home/fji/SE/tomcat4/work/Standalone/localhost/examples

> is unusable.
> Jul 17, 2005 8:41:56 AM
> org.apache.struts.util.PropertyMessageResources
> <init>
> INFO: Initializing,
> config='org.apache.struts.util.LocalStrings',
> returnNull=true
> Jul 17, 2005 8:41:56 AM
> org.apache.struts.util.PropertyMessageResources
> <init>
> INFO: Initializing,
> config='org.apache.struts.action.ActionResources',
> returnNull=true
> Jul 17, 2005 8:41:56 AM
> org.apache.struts.util.PropertyMessageResources
> <init>
> INFO: Initializing,
>
config='org.apache.webapp.admin.ApplicationResources',
> returnNull=true
> The scratchDir you specified:
> /home/fji/SE/tomcat4/work/Standalone/localhost/admin
> is unusable.
> The scratchDir you specified:
>
/home/fji/SE/tomcat4/work/Standalone/localhost/manager
> is unusable.
> The scratchDir you specified:
> /home/fji/SE/tomcat4/work/Standalone/localhost/_ is
> unusable.
> The scratchDir you specified:
>
/home/fji/SE/tomcat4/work/Standalone/localhost/tomcat-docs
> is unusable.
> The scratchDir you specified:
>
/home/fji/SE/tomcat4/work/Standalone/localhost/webdav

> is unusable.
> Jul 17, 2005 8:41:59 AM
> org.apache.coyote.http11.Http11Protocol start
> INFO: Starting Coyote HTTP/1.1 on http-8888
> Jul 17, 2005 8:41:59 AM
> org.apache.jk.common.ChannelSocket init
> INFO: JK2: ajp13 listening on /0.0.0.0:8009
> Jul 17, 2005 8:41:59 AM org.apache.jk.server.JkMain
> start
> INFO: Jk running ID=0 time=1/110
> config=/home/fji/SE/tomcat4/conf/jk2.properties
> 050717 084206 parsing
>
file:/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/nutch-default.xml
> 050717 084206 parsing
>
file:/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/nutch-site.xml
> 050717 084206 Plugins: looking in:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/clustering-carrot2
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/creativecommons
> 050717 084206 parsing:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/index-basic/plugin.xml
> 050717 084206 impl:
> point=org.apache.nutch.indexer.IndexingFilter
>
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/index-more
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/language-identifier
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/ontology
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-ext
> 050717 084206 parsing:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-html/plugin.xml
> 050717 084206 impl:
> point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.html.HtmlParser
> 050717 084206 parsing:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-js/plugin.xml
> 050717 084206 impl:
> point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.js.JSParseFilter
> 050717 084206 impl:
> point=org.apache.nutch.parse.HtmlParseFilter
> class=org.apache.nutch.parse.js.JSParseFilter
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-msword
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-pdf
> 050717 084206 parsing:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/parse-text/plugin.xml
> 050717 084206 impl:
> point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.text.TextParser
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/protocol-file
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/protocol-ftp
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/protocol-http
> 050717 084206 parsing:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/protocol-httpclient/plugin.xml
> 050717 084206 impl:
> point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.httpclient.Http
> 050717 084206 impl:
> point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.httpclient.Http
> 050717 084206 parsing:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/query-basic/plugin.xml
> 050717 084206 impl:
> point=org.apache.nutch.searcher.QueryFilter
>
class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/query-more
> 050717 084206 parsing:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/query-site/plugin.xml
> 050717 084206 impl:
> point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.site.SiteQueryFilter
> 050717 084206 parsing:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/query-url/plugin.xml
> 050717 084206 impl:
> point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.url.URLQueryFilter
> 050717 084206 not including:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/urlfilter-prefix
> 050717 084206 parsing:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins/urlfilter-regex/plugin.xml

> 050717 084206 impl:
> point=org.apache.nutch.net.URLFilter
> class=org.apache.nutch.net.RegexURLFilter
> 050717 084206 11 creating new bean
> 050717 084206 11 opening segment indexes in
> /home/fji/SE/tomcat4/segments
> 050717 084207 11 query request from 127.0.0.1
> 050717 084207 11 query: commer
> 050717 084207 11 searching for 20 raw hits
> 050717 084207 11 total hits: 0
> Stopping service Tomcat-Standalone
>
>
> Jul 17, 2005 8:41:54 AM
> org.apache.coyote.http11.Http11Protocol init
> INFO: Initializing Coyote HTTP/1.1 on http-8888
> Starting service Tomcat-Standalone
> Apache Tomcat/4.1.31
> The scratchDir you specified:
>
/home/fji/SE/tomcat4/work/Standalone/localhost/examples

> is unusable.
> Jul 17, 2005 8:41:56 AM
> org.apache.struts.util.PropertyMessageResources
> <init>
> INFO: Initializing,
> config='org.apache.struts.util.LocalStrings',
> returnNull=true
> Jul 17, 2005 8:41:56 AM
> org.apache.struts.util.PropertyMessageResources
> <init>
> INFO: Initializing,
> config='org.apache.struts.action.ActionResources',
> returnNull=true
> Jul 17, 2005 8:41:56 AM
> org.apache.struts.util.PropertyMessageResources
> <init>
> INFO: Initializing,
>
config='org.apache.webapp.admin.ApplicationResources',
> returnNull=true
> The scratchDir you specified:
> /home/fji/SE/tomcat4/work/Standalone/localhost/admin
> is unusable.
> The scratchDir you specified:
>
/home/fji/SE/tomcat4/work/Standalone/localhost/manager
> is unusable.
> The scratchDir you specified:
> /home/fji/SE/tomcat4/work/Standalone/localhost/_ is
> unusable.
> The scratchDir you specified:
>
/home/fji/SE/tomcat4/work/Standalone/localhost/tomcat-docs
> is unusable.
> The scratchDir you specified:
>
/home/fji/SE/tomcat4/work/Standalone/localhost/webdav

> is unusable.
> Jul 17, 2005 8:41:59 AM
> org.apache.coyote.http11.Http11Protocol start
> INFO: Starting Coyote HTTP/1.1 on http-8888
> Jul 17, 2005 8:41:59 AM
> org.apache.jk.common.ChannelSocket init
> INFO: JK2: ajp13 listening on /0.0.0.0:8009
> Jul 17, 2005 8:41:59 AM org.apache.jk.server.JkMain
> start
> INFO: Jk running ID=0 time=1/110
> config=/home/fji/SE/tomcat4/conf/jk2.properties
> 050717 084206 parsing
>
file:/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/nutch-default.xml
> 050717 084206 parsing
>
file:/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/nutch-site.xml
> 050717 084206 Plugins: looking in:
>
/home/fji/SE/tomcat4/webapps/ROOT/WEB-INF/classes/plugins
>
=== message truncated ===



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

search result

Michael Ji
In reply to this post by Michael Ji
Hi there:

I have a question about the crawling depth VS search
result. I attached part of my log information;

"
050722 181508 fetching
http://www.committemuse.com/content/committees.asp
:
:
050722 181508 fetching
:
050722 181508 status: segment 20050722181440, 100
pages, 4 errors, 1952888 bytes, 26204 ms
"

And I see segment in my tomcat box.

But when I do search the specific word in that page,
it return 0.

Is that because the page is written in "asp"?

thanks,

Michael,



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: search result

Fredrik Andersson-2-2
Hi Michael.

Have you indexed the crawl/segment? Easy to forget sometimes : ) Also,
check the crawler-tools.xml or whatever it's called, so that ASP pages
aren't blocked or anything. The Nutch crawler doesn't by default
handle parameters (committees.asp?viewPerson=Ji), I guess that could
be an issue as well. No errors or funny stuff in the logs?

Fredrik

On 7/23/05, Feng (Michael) Ji <[hidden email]> wrote:

> Hi there:
>
> I have a question about the crawling depth VS search
> result. I attached part of my log information;
>
> "
> 050722 181508 fetching
> http://www.committemuse.com/content/committees.asp
> :
> :
> 050722 181508 fetching
> :
> 050722 181508 status: segment 20050722181440, 100
> pages, 4 errors, 1952888 bytes, 26204 ms
> "
>
> And I see segment in my tomcat box.
>
> But when I do search the specific word in that page,
> it return 0.
>
> Is that because the page is written in "asp"?
>
> thanks,
>
> Michael,
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
Reply | Threaded
Open this post in threaded view
|

Re: search result

Michael Ji
hi Fredrik:

Actually, I use nutch/crawl command as following:
"
bin/nutch crawl urls -dir crawl-s -depth 1 >&
crawl-s.log
"
I guess I don't need to do index explicitly after
crawl. Is it right?

My sample crawling doesn't go deeply and only stop at
the home page of the URL.

I guess the -depth is defined for crawling the website
which is pointed out from initial page, is it right?

One thing I found, the result in /segments/ will have
the same number of sub-dir (which are all time stamped
number) as the -depth parameter,

thanks a lot,

Michael,

--- Fredrik Andersson <[hidden email]>
wrote:

> Hi Michael.
>
> Have you indexed the crawl/segment? Easy to forget
> sometimes : ) Also,
> check the crawler-tools.xml or whatever it's called,
> so that ASP pages
> aren't blocked or anything. The Nutch crawler
> doesn't by default
> handle parameters (committees.asp?viewPerson=Ji), I
> guess that could
> be an issue as well. No errors or funny stuff in the
> logs?
>
> Fredrik
>
> On 7/23/05, Feng (Michael) Ji <[hidden email]>
> wrote:
> > Hi there:
> >
> > I have a question about the crawling depth VS
> search
> > result. I attached part of my log information;
> >
> > "
> > 050722 181508 fetching
> > http://www.committemuse.com/content/committees.asp
> > :
> > :
> > 050722 181508 fetching
> > :
> > 050722 181508 status: segment 20050722181440, 100
> > pages, 4 errors, 1952888 bytes, 26204 ms
> > "
> >
> > And I see segment in my tomcat box.
> >
> > But when I do search the specific word in that
> page,
> > it return 0.
> >
> > Is that because the page is written in "asp"?
> >
> > thanks,
> >
> > Michael,
> >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: search result

Fredrik Andersson-2-2
No, I think you're right that indexing is done automatically after
intranet crawls. Just try "bin/nutch index yourSegment", if it says
that 'index.done exists already',then well.. you get the point. I
don't know what platform you're using, but try doing a "grep -r <some
text in your crawled site> *". The grep command should match on both
your segment data and the binary index that have been built.
I have run in to a similar problem, where the Websearch thingie does
not work, but a manual search using the IndexSearcher class does work.
Also, try opening your index from the LUKE program if you haven't
already. It's a very handy tool for validating and test-searching your
data.

Good luck,
Fredrik

On 7/23/05, Feng (Michael) Ji <[hidden email]> wrote:

> hi Fredrik:
>
> Actually, I use nutch/crawl command as following:
> "
> bin/nutch crawl urls -dir crawl-s -depth 1 >&
> crawl-s.log
> "
> I guess I don't need to do index explicitly after
> crawl. Is it right?
>
> My sample crawling doesn't go deeply and only stop at
> the home page of the URL.
>
> I guess the -depth is defined for crawling the website
> which is pointed out from initial page, is it right?
>
> One thing I found, the result in /segments/ will have
> the same number of sub-dir (which are all time stamped
> number) as the -depth parameter,
>
> thanks a lot,
>
> Michael,
>
> --- Fredrik Andersson <[hidden email]>
> wrote:
>
> > Hi Michael.
> >
> > Have you indexed the crawl/segment? Easy to forget
> > sometimes : ) Also,
> > check the crawler-tools.xml or whatever it's called,
> > so that ASP pages
> > aren't blocked or anything. The Nutch crawler
> > doesn't by default
> > handle parameters (committees.asp?viewPerson=Ji), I
> > guess that could
> > be an issue as well. No errors or funny stuff in the
> > logs?
> >
> > Fredrik
> >
> > On 7/23/05, Feng (Michael) Ji <[hidden email]>
> > wrote:
> > > Hi there:
> > >
> > > I have a question about the crawling depth VS
> > search
> > > result. I attached part of my log information;
> > >
> > > "
> > > 050722 181508 fetching
> > > http://www.committemuse.com/content/committees.asp
> > > :
> > > :
> > > 050722 181508 fetching
> > > :
> > > 050722 181508 status: segment 20050722181440, 100
> > > pages, 4 errors, 1952888 bytes, 26204 ms
> > > "
> > >
> > > And I see segment in my tomcat box.
> > >
> > > But when I do search the specific word in that
> > page,
> > > it return 0.
> > >
> > > Is that because the page is written in "asp"?
> > >
> > > thanks,
> > >
> > > Michael,
> > >
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam?  Yahoo! Mail has the best spam
> > protection around
> > > http://mail.yahoo.com
> > >
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com 
>
Reply | Threaded
Open this post in threaded view
|

Re: search result

Michael Ji
Hi Fredrik:

the command
"
bin/nutch crawl * -dir * -depth d
"
is only working for intranet? means only can fetch
within one particular domain?

I want to do a global fetching, but only limited to a
limited web list, so, should I use command set of
"bin/nutch admin db -create
...
"
instead?

but,

I will try your "bin/nutch index" command,

thanks,

Michael,

--- Fredrik Andersson <[hidden email]>
wrote:

> No, I think you're right that indexing is done
> automatically after
> intranet crawls. Just try "bin/nutch index
> yourSegment", if it says
> that 'index.done exists already',then well.. you get
> the point. I
> don't know what platform you're using, but try doing
> a "grep -r <some
> text in your crawled site> *". The grep command
> should match on both
> your segment data and the binary index that have
> been built.
> I have run in to a similar problem, where the
> Websearch thingie does
> not work, but a manual search using the
> IndexSearcher class does work.
> Also, try opening your index from the LUKE program
> if you haven't
> already. It's a very handy tool for validating and
> test-searching your
> data.
>
> Good luck,
> Fredrik
>
> On 7/23/05, Feng (Michael) Ji <[hidden email]>
> wrote:
> > hi Fredrik:
> >
> > Actually, I use nutch/crawl command as following:
> > "
> > bin/nutch crawl urls -dir crawl-s -depth 1 >&
> > crawl-s.log
> > "
> > I guess I don't need to do index explicitly after
> > crawl. Is it right?
> >
> > My sample crawling doesn't go deeply and only stop
> at
> > the home page of the URL.
> >
> > I guess the -depth is defined for crawling the
> website
> > which is pointed out from initial page, is it
> right?
> >
> > One thing I found, the result in /segments/ will
> have
> > the same number of sub-dir (which are all time
> stamped
> > number) as the -depth parameter,
> >
> > thanks a lot,
> >
> > Michael,
> >
> > --- Fredrik Andersson <[hidden email]>
> > wrote:
> >
> > > Hi Michael.
> > >
> > > Have you indexed the crawl/segment? Easy to
> forget
> > > sometimes : ) Also,
> > > check the crawler-tools.xml or whatever it's
> called,
> > > so that ASP pages
> > > aren't blocked or anything. The Nutch crawler
> > > doesn't by default
> > > handle parameters
> (committees.asp?viewPerson=Ji), I
> > > guess that could
> > > be an issue as well. No errors or funny stuff in
> the
> > > logs?
> > >
> > > Fredrik
> > >
> > > On 7/23/05, Feng (Michael) Ji <[hidden email]>
> > > wrote:
> > > > Hi there:
> > > >
> > > > I have a question about the crawling depth VS
> > > search
> > > > result. I attached part of my log information;
> > > >
> > > > "
> > > > 050722 181508 fetching
> > > >
> http://www.committemuse.com/content/committees.asp
> > > > :
> > > > :
> > > > 050722 181508 fetching
> > > > :
> > > > 050722 181508 status: segment 20050722181440,
> 100
> > > > pages, 4 errors, 1952888 bytes, 26204 ms
> > > > "
> > > >
> > > > And I see segment in my tomcat box.
> > > >
> > > > But when I do search the specific word in that
> > > page,
> > > > it return 0.
> > > >
> > > > Is that because the page is written in "asp"?
> > > >
> > > > thanks,
> > > >
> > > > Michael,
> > > >
> > > >
> > > >
> > > >
> __________________________________________________
> > > > Do You Yahoo!?
> > > > Tired of spam?  Yahoo! Mail has the best spam
> > > protection around
> > > > http://mail.yahoo.com
> > > >
> > >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com 
> >
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: search result

Michael Ji
In reply to this post by Fredrik Andersson-2-2
hi Fredrik:

After I did crawling in Nutch, I copy segments to root
of tomcat.

I wonder if I need to do the same thing for index and
db directory.

thanks,

Michael,

--- Fredrik Andersson <[hidden email]>
wrote:

> No, I think you're right that indexing is done
> automatically after
> intranet crawls. Just try "bin/nutch index
> yourSegment", if it says
> that 'index.done exists already',then well.. you get
> the point. I
> don't know what platform you're using, but try doing
> a "grep -r <some
> text in your crawled site> *". The grep command
> should match on both
> your segment data and the binary index that have
> been built.
> I have run in to a similar problem, where the
> Websearch thingie does
> not work, but a manual search using the
> IndexSearcher class does work.
> Also, try opening your index from the LUKE program
> if you haven't
> already. It's a very handy tool for validating and
> test-searching your
> data.
>
> Good luck,
> Fredrik
>
> On 7/23/05, Feng (Michael) Ji <[hidden email]>
> wrote:
> > hi Fredrik:
> >
> > Actually, I use nutch/crawl command as following:
> > "
> > bin/nutch crawl urls -dir crawl-s -depth 1 >&
> > crawl-s.log
> > "
> > I guess I don't need to do index explicitly after
> > crawl. Is it right?
> >
> > My sample crawling doesn't go deeply and only stop
> at
> > the home page of the URL.
> >
> > I guess the -depth is defined for crawling the
> website
> > which is pointed out from initial page, is it
> right?
> >
> > One thing I found, the result in /segments/ will
> have
> > the same number of sub-dir (which are all time
> stamped
> > number) as the -depth parameter,
> >
> > thanks a lot,
> >
> > Michael,
> >
> > --- Fredrik Andersson <[hidden email]>
> > wrote:
> >
> > > Hi Michael.
> > >
> > > Have you indexed the crawl/segment? Easy to
> forget
> > > sometimes : ) Also,
> > > check the crawler-tools.xml or whatever it's
> called,
> > > so that ASP pages
> > > aren't blocked or anything. The Nutch crawler
> > > doesn't by default
> > > handle parameters
> (committees.asp?viewPerson=Ji), I
> > > guess that could
> > > be an issue as well. No errors or funny stuff in
> the
> > > logs?
> > >
> > > Fredrik
> > >
> > > On 7/23/05, Feng (Michael) Ji <[hidden email]>
> > > wrote:
> > > > Hi there:
> > > >
> > > > I have a question about the crawling depth VS
> > > search
> > > > result. I attached part of my log information;
> > > >
> > > > "
> > > > 050722 181508 fetching
> > > >
> http://www.committemuse.com/content/committees.asp
> > > > :
> > > > :
> > > > 050722 181508 fetching
> > > > :
> > > > 050722 181508 status: segment 20050722181440,
> 100
> > > > pages, 4 errors, 1952888 bytes, 26204 ms
> > > > "
> > > >
> > > > And I see segment in my tomcat box.
> > > >
> > > > But when I do search the specific word in that
> > > page,
> > > > it return 0.
> > > >
> > > > Is that because the page is written in "asp"?
> > > >
> > > > thanks,
> > > >
> > > > Michael,
> > > >
> > > >
> > > >
> > > >
> __________________________________________________
> > > > Do You Yahoo!?
> > > > Tired of spam?  Yahoo! Mail has the best spam
> > > protection around
> > > > http://mail.yahoo.com
> > > >
> > >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com 
> >
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

search depth

Michael Ji
In reply to this post by Fredrik Andersson-2-2

hi there,

I found some weid behavior of Nutch when I use depth =
10.

For example,

"
bin/nutch crawl url1 -dir crawl2 -depth 10 >&
crawl2.log
"

I did a crawling based on one single web site, with
depth 10. Seems it can't finish.

The subdir with segments/ is 7, if it finished
successfully, should be 10.

And I checked the log, it ends in fetching a website
and didn't go on.

Is there a depth limitation for Nutch?

thanks,

Michael,



               
__________________________________
Do you Yahoo!?
Read only the mail you want - Yahoo! Mail SpamGuard.
http://promotions.yahoo.com/new_mail 
Reply | Threaded
Open this post in threaded view
|

Re: search result

Roger Dunk
In reply to this post by Michael Ji
Michael,

You DON'T need to copy the segments or db to the root of tomcat, but you DO
need to start tomcat from the directory directly above the segments
directory (or from the crawl directory if you've done intranet crawling).

e.g. if you have /usr/local/nutch/segments,  you might type:
> cd /usr/local/nutch
> /usr/local/tomcat/bin/catalina start
to start tomcat.

It's all explained in the tutorial at
http://lucene.apache.org/nutch/tutorial.html
Just follow it step by step and you should be ok.

Cheers...
Roger

----- Original Message -----
From: "Feng (Michael) Ji" <[hidden email]>
To: <[hidden email]>; "Fredrik Andersson"
<[hidden email]>
Sent: Sunday, July 24, 2005 1:36 AM
Subject: Re: search result


> hi Fredrik:
>
> After I did crawling in Nutch, I copy segments to root
> of tomcat.
>
> I wonder if I need to do the same thing for index and
> db directory.
>
> thanks,
>
> Michael,
>
> --- Fredrik Andersson <[hidden email]>
> wrote:
>
>> No, I think you're right that indexing is done
>> automatically after
>> intranet crawls. Just try "bin/nutch index
>> yourSegment", if it says
>> that 'index.done exists already',then well.. you get
>> the point. I
>> don't know what platform you're using, but try doing
>> a "grep -r <some
>> text in your crawled site> *". The grep command
>> should match on both
>> your segment data and the binary index that have
>> been built.
>> I have run in to a similar problem, where the
>> Websearch thingie does
>> not work, but a manual search using the
>> IndexSearcher class does work.
>> Also, try opening your index from the LUKE program
>> if you haven't
>> already. It's a very handy tool for validating and
>> test-searching your
>> data.
>>
>> Good luck,
>> Fredrik
>>
>> On 7/23/05, Feng (Michael) Ji <[hidden email]>
>> wrote:
>> > hi Fredrik:
>> >
>> > Actually, I use nutch/crawl command as following:
>> > "
>> > bin/nutch crawl urls -dir crawl-s -depth 1 >&
>> > crawl-s.log
>> > "
>> > I guess I don't need to do index explicitly after
>> > crawl. Is it right?
>> >
>> > My sample crawling doesn't go deeply and only stop
>> at
>> > the home page of the URL.
>> >
>> > I guess the -depth is defined for crawling the
>> website
>> > which is pointed out from initial page, is it
>> right?
>> >
>> > One thing I found, the result in /segments/ will
>> have
>> > the same number of sub-dir (which are all time
>> stamped
>> > number) as the -depth parameter,
>> >
>> > thanks a lot,
>> >
>> > Michael,
>> >
>> > --- Fredrik Andersson <[hidden email]>
>> > wrote:
>> >
>> > > Hi Michael.
>> > >
>> > > Have you indexed the crawl/segment? Easy to
>> forget
>> > > sometimes : ) Also,
>> > > check the crawler-tools.xml or whatever it's
>> called,
>> > > so that ASP pages
>> > > aren't blocked or anything. The Nutch crawler
>> > > doesn't by default
>> > > handle parameters
>> (committees.asp?viewPerson=Ji), I
>> > > guess that could
>> > > be an issue as well. No errors or funny stuff in
>> the
>> > > logs?
>> > >
>> > > Fredrik
>> > >
>> > > On 7/23/05, Feng (Michael) Ji <[hidden email]>
>> > > wrote:
>> > > > Hi there:
>> > > >
>> > > > I have a question about the crawling depth VS
>> > > search
>> > > > result. I attached part of my log information;
>> > > >
>> > > > "
>> > > > 050722 181508 fetching
>> > > >
>> http://www.committemuse.com/content/committees.asp
>> > > > :
>> > > > :
>> > > > 050722 181508 fetching
>> > > > :
>> > > > 050722 181508 status: segment 20050722181440,
>> 100
>> > > > pages, 4 errors, 1952888 bytes, 26204 ms
>> > > > "
>> > > >
>> > > > And I see segment in my tomcat box.
>> > > >
>> > > > But when I do search the specific word in that
>> > > page,
>> > > > it return 0.
>> > > >
>> > > > Is that because the page is written in "asp"?
>> > > >
>> > > > thanks,
>> > > >
>> > > > Michael,
>> > > >
>> > > >
>> > > >
>> > > >
>> __________________________________________________
>> > > > Do You Yahoo!?
>> > > > Tired of spam?  Yahoo! Mail has the best spam
>> > > protection around
>> > > > http://mail.yahoo.com
>> > > >
>> > >
>> >
>> >
>> > __________________________________________________
>> > Do You Yahoo!?
>> > Tired of spam?  Yahoo! Mail has the best spam
>> protection around
>> > http://mail.yahoo.com
>> >
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Reply | Threaded
Open this post in threaded view
|

fetching behavior of Nutch

Michael Ji
In reply to this post by Michael Ji
hi there,

1)
I did several testing running to fetch page from two
website. The fetching depth is 10.

After checking log files, I found the actual fetched
page linkage is very different for two sites.

In one site with lots of news, only first two depth
fetching running well and only fetching 5 linkages.
The actual linkages in that site is far beyond that.

The other site can fetch till 10 rounds and fetched
100's linkage.

I wonder if any one has similar experience. Should I
setup configure files in /conf/?

2)
Also, in Nutch/conf/ directory, I found several
configuration files. Actually, I only modify
crawl-urlfilter.txt to let it accept all the url
(*.*).

Is it proper?

I really doesn't touch other conf files. Is there a
guideline how I use these files?

thanks,

Michael,



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Nutch's intranet VS internet crawling

Michael Ji

I wonder if there is any difference between these two?
Or intranet crawling must indicate an intranet site
explicitly in crawl-urlfilter.txt under /conf ?

thanks,

Michael,


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

fetcher blocked

em-13
I'm using 0.7 from a few weeks ago.
I was fetching 204 456 pages,

Nutch -segread tells that there are:
"segments\20050723140812 is corrupt, using only 207126 entries."

Here's what I have with ctrl-break:

Full thread dump Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode):

"MultiThreadedHttpConnectionManager cleanup" daemon prio=5 tid=0x032605e8
nid=0x1034 in Object.wait() [3b0f000..3b0fd8c]
        at java.lang.Object.wait(Native Method)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:111)
        - locked <0x150622f0> (a java.lang.ref.ReferenceQueue$Lock)
        at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ReferenceQu
eueThread.run(MultiThreadedHttpConnectionManager.java:1100)

"fetcher8" prio=5 tid=0x03230888 nid=0x1d04 in Object.wait()
[354f000..354fd8c]
        at java.lang.Object.wait(Native Method)
        at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnec
tion(MultiThreadedHttpConnectionManager.java:509)
        - locked <0x15063090> (a
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionP
ool)
        at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnecti
onWithTimeout(MultiThreadedHttpConnectionManager.java:394)
        at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDir
ector.java:152)
        at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:393)
        at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)
        at
org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:7
6)
        at
org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:213)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)

"Signal Dispatcher" daemon prio=10 tid=0x009ff8a8 nid=0xfdc waiting on
condition [0..0]

"Finalizer" daemon prio=9 tid=0x009fcd20 nid=0x900 in Object.wait()
[2f7f000..2f7fd8c]
        at java.lang.Object.wait(Native Method)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:111)
        - locked <0x14ee5b08> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:127)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x009fb9a0 nid=0xeb4 in Object.wait()
[2f3f000..2f3fd8c]
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:429)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:115)
        - locked <0x14ee5b70> (a java.lang.ref.Reference$Lock)

"main" prio=5 tid=0x000366e0 nid=0x10d0 waiting on condition [7f000..7fc38]
        at java.lang.Thread.sleep(Native Method)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:342)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:479)

"VM Thread" prio=5 tid=0x00a3b720 nid=0x1230 runnable

"VM Periodic Task Thread" prio=10 tid=0x00a3d360 nid=0xae0 waiting on
condition
"Suspend Checker Thread" prio=10 tid=0x009feeb0 nid=0x27c runnable



Reply | Threaded
Open this post in threaded view
|

RE: fetching behavior of Nutch

Howie Wang
In reply to this post by Michael Ji
There are probably two settings you'll need to tweak
in nutch-default.xml

http.content.limit -- by default it's 64K, if the page is
larger than that, then it essentially truncates the file.
You could be missing lots of links that appear later in
the page.

max.outlinks.per.page -- by default it's 100. You might
want to increase this since for pages with something like
a nested navigation sidebar with tons of links, it won't
get any links from the main part of the page.

The *.xml files are fairly descriptive. So just reading through
them can be pretty helpful. I don't know if there is a full
guide to the config files.

Howie



>
>1)
>I did several testing running to fetch page from two
>website. The fetching depth is 10.
>
>After checking log files, I found the actual fetched
>page linkage is very different for two sites.
>
>In one site with lots of news, only first two depth
>fetching running well and only fetching 5 linkages.
>The actual linkages in that site is far beyond that.
>
>The other site can fetch till 10 rounds and fetched
>100's linkage.
>
>I wonder if any one has similar experience. Should I
>setup configure files in /conf/?
>
>2)
>Also, in Nutch/conf/ directory, I found several
>configuration files. Actually, I only modify
>crawl-urlfilter.txt to let it accept all the url
>(*.*).
>
>Is it proper?
>
>I really doesn't touch other conf files. Is there a
>guideline how I use these files?
>
>thanks,
>
>Michael,
>
>
>
>__________________________________________________
>Do You Yahoo!?
>Tired of spam?  Yahoo! Mail has the best spam protection around
>http://mail.yahoo.com


Reply | Threaded
Open this post in threaded view
|

RE: fetching behavior of Nutch

Michael Ji
thanks Howie,

that guides me,

Michael,

--- Howie Wang <[hidden email]> wrote:

> There are probably two settings you'll need to tweak
> in nutch-default.xml
>
> http.content.limit -- by default it's 64K, if the
> page is
> larger than that, then it essentially truncates the
> file.
> You could be missing lots of links that appear later
> in
> the page.
>
> max.outlinks.per.page -- by default it's 100. You
> might
> want to increase this since for pages with something
> like
> a nested navigation sidebar with tons of links, it
> won't
> get any links from the main part of the page.
>
> The *.xml files are fairly descriptive. So just
> reading through
> them can be pretty helpful. I don't know if there is
> a full
> guide to the config files.
>
> Howie
>
>
>
> >
> >1)
> >I did several testing running to fetch page from
> two
> >website. The fetching depth is 10.
> >
> >After checking log files, I found the actual
> fetched
> >page linkage is very different for two sites.
> >
> >In one site with lots of news, only first two depth
> >fetching running well and only fetching 5 linkages.
> >The actual linkages in that site is far beyond
> that.
> >
> >The other site can fetch till 10 rounds and fetched
> >100's linkage.
> >
> >I wonder if any one has similar experience. Should
> I
> >setup configure files in /conf/?
> >
> >2)
> >Also, in Nutch/conf/ directory, I found several
> >configuration files. Actually, I only modify
> >crawl-urlfilter.txt to let it accept all the url
> >(*.*).
> >
> >Is it proper?
> >
> >I really doesn't touch other conf files. Is there a
> >guideline how I use these files?
> >
> >thanks,
> >
> >Michael,
> >
> >
> >
> >__________________________________________________
> >Do You Yahoo!?
> >Tired of spam?  Yahoo! Mail has the best spam
> protection around
> >http://mail.yahoo.com
>
>
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: Nutch's intranet VS internet crawling

Jack.Tang
In reply to this post by Michael Ji
Hi Michael

I think some information in intranet crawling is privacy while
internet is public.

/Jack

On 7/24/05, Feng (Michael) Ji <[hidden email]> wrote:

>
> I wonder if there is any difference between these two?
> Or intranet crawling must indicate an intranet site
> explicitly in crawl-urlfilter.txt under /conf ?
>
> thanks,
>
> Michael,
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

http.max.delays

Michael Ji
In reply to this post by Michael Ji
Hi there:

I checked log file, found some site link met error
as"Exceeded http.max.delays: retry later"

I change the corresponding value in conf file,
nutch-default.xml, I changed it to 300, seems still
not enough. Will that affect performance of crawling?

Any idea?

thanks,

Michael

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com