why TOTAL urls: 1

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

why TOTAL urls: 1

Olive g
Hello everyone,

I am also running distributed crawl on .8.0 (some dev version) and somehow
the stats always
returned TOTAL urls as 1 while I was search some sites such as
www.yahoo.com!
My filter file allows everything. What might be the problem? There was no
obvious error
in log files and the job was completed successfully.
Thank you

060308 064420  map 0%
060308 064427  map 100%
060308 064433  reduce 100%
060308 064433 Job complete: job_ljydgp
060308 064434 parsing file:/root/nutch/conf/nutch- default.xml
060308 064434 parsing file:/root/nutch/conf/nutch-site.xml
060308 064436 Statistics for CrawlDb:
/user/root/crawl-20060307224144/crawldb
060308 064436 TOTAL urls:       1
060308 064436 avg score:        1.0
060308 064436 max score:        1.0
060308 064436 min score:        1.0
060308 064436 retry 0:  1
060308 064436 status 2 (DB_fetched):    1
060308 064437 CrawlDb statistics: done

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Reply | Threaded
Open this post in threaded view
|

Re: why TOTAL urls: 1

Stefan Groschupf-2
I guess yahoo.com has a robot.txt to block crawling the complete page.
Also check the level depth you use.


Am 08.03.2006 um 17:53 schrieb Olive g:

> Hello everyone,
>
> I am also running distributed crawl on .8.0 (some dev version) and  
> somehow the stats always
> returned TOTAL urls as 1 while I was search some sites such as  
> www.yahoo.com!
> My filter file allows everything. What might be the problem? There  
> was no obvious error
> in log files and the job was completed successfully.
> Thank you
>
> 060308 064420  map 0%
> 060308 064427  map 100%
> 060308 064433  reduce 100%
> 060308 064433 Job complete: job_ljydgp
> 060308 064434 parsing file:/root/nutch/conf/nutch- default.xml
> 060308 064434 parsing file:/root/nutch/conf/nutch-site.xml
> 060308 064436 Statistics for CrawlDb: /user/root/
> crawl-20060307224144/crawldb
> 060308 064436 TOTAL urls:       1
> 060308 064436 avg score:        1.0
> 060308 064436 max score:        1.0
> 060308 064436 min score:        1.0
> 060308 064436 retry 0:  1
> 060308 064436 status 2 (DB_fetched):    1
> 060308 064437 CrawlDb statistics: done
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today -  
> it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/ 
> direct/01/
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net