Crawl www.yahoo.com using nutch 0.9

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawl www.yahoo.com using nutch 0.9

Meryl Silverburgh
i am trying to setup nutch 0.9 to crawl www.yahoo.com.
I am using this command "bin/nutch crawl urls -dir crawl -depth 3".

But after the command, no links have been fetch.

Is that something I need to setup before www.yahoo.com can be crawled?
Thanks for any help. I have struggled with this problem for days.
And I have tried using nutch 0.8.1 and I have the same problem. I am
able to crawl www.cnn.com with the same setup


Here is the output:
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070416230326
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20070416230326
Fetcher: threads: 10
fetching http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20070416230326]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070416230338
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20070416230326
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20070416230326
 Indexing [http://www.yahoo.com/] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@e64686 (null)
Optimizing index.
merging segments _ram_0 (1 docs) into _0 (1 docs)
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Dedup: done
merging indexes to: crawl/index
Adding crawl/indexes/part-00000
done merging
crawl finished: crawl
CrawlDb topN: starting (topN=25, min=0.0)
CrawlDb db: crawl/crawldb
CrawlDb topN: collecting topN scores.
CrawlDb topN: done
Match
Reply | Threaded
Open this post in threaded view
|

Re: Crawl www.yahoo.com using nutch 0.9

Tanmoy Kumar Mukherjee
did u change the regular expression in the url-filter.txt???? That
could be the only problem.


Tanmoy
Reply | Threaded
Open this post in threaded view
|

Re: Crawl www.yahoo.com using nutch 0.9

Meryl Silverburgh
On 4/18/07, Tanmoy Kumar Mukherjee <[hidden email]> wrote:
> did u change the regular expression in the url-filter.txt???? That
> could be the only problem.
>
>

Yes. I did. I have this in my crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME
+^http://([a-zA-Z0-9]*\.)*(cnn.com|yahoo.com)/



> Tanmoy
>
Reply | Threaded
Open this post in threaded view
|

Re: Crawl www.yahoo.com using nutch 0.9

Tanmoy Kumar Mukherjee
instead of the complete expression just try http://yahoo.com

----- Original Message -----
From: Meryl Silverburgh <[hidden email]>
Date: Thursday, April 19, 2007 8:34 am
Subject: Re: Crawl www.yahoo.com using nutch 0.9
To: [hidden email]

> On 4/18/07, Tanmoy Kumar Mukherjee <[hidden email]> wrote:
> > did u change the regular expression in the url-filter.txt???? That
> > could be the only problem.
> >
> >
>
> Yes. I did. I have this in my crawl-urlfilter.txt
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-zA-Z0-9]*\.)*(cnn.com|yahoo.com)/
>
>
>
> > Tanmoy
> >
>