Not able to crawl local file system: need help

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Not able to crawl local file system: need help

Garnier Garnier

Nutch experts:

Here’s the problem:

1. downloaded Nutch 0.9 from site.
2. Modified the required files to crawl on Linux.
3. http crawl successful and index was created.
4. Modified the files to run a local filesystem crawl.
5. Googled to find the following links:

http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
http://www.folge2.de/tp/search/a/crawling-the-local-filesystem-with-nutch
6. Modified the files as mentioned.
7. Crawl fails with the following error.
(The config file format seems to be fine. Not able to debug the error. Went through changes.txt and it mentions that the following has been fixed :
53. NUTCH-384 - Protocol-file plugin does not allow the parse plugins
    framework to operate properly (Heiko Dietze via mattmann)


Not sure why local crawl fails. May I request the experts for help?

Regards,
Garnier

Error:
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 1000
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080228114148
Generator: filtering: false
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080228114148
Fetcher: threads: 10
fetching file:///hm/garnier/TOOLTOX/Builds/user-docs
fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080228114148]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080228114155
Generator: filtering: false
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080228114155
Fetcher: threads: 10
fetching file:///hm/garnier/TOOLTOX/Builds/user-docs
fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080228114155]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080228114201
Generator: filtering: false
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080228114201
Fetcher: threads: 10
fetching file:///hm/garnier/TOOLTOX/Builds/user-docs
fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080228114201]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080228114148
LinkDb: adding segment: crawl/segments/20080228114155
LinkDb: adding segment: crawl/segments/20080228114201
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20080228114148
Indexer: adding segment: crawl/segments/20080228114155
Indexer: adding segment: crawl/segments/20080228114201
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)





      Share files, take polls, and discuss your passions - all under one roof. Go to http://in.promos.yahoo.com/groups

Reply | Threaded
Open this post in threaded view
|

Re: Not able to crawl local file system: need help

Ismael Hasan Romero
Hello. I think your problem might be the same exposed (and answered) here:

http://www.nabble.com/Exception-in-DeleteDuplicates.java-td14781941.html

As an advice when you have an error you could google only the error, answers
are easily found in that way if it happens that somebody got the same
problem.

Good luck!


2008/2/28, Garnier Garnier <[hidden email]>:

>
>
> Nutch experts:
>
> Here's the problem:
>
> 1.      downloaded Nutch 0.9 from site.
> 2.      Modified the required files to crawl on Linux.
> 3.      http crawl successful and index was created.
> 4.      Modified the files to run a local filesystem crawl.
> 5.      Googled to find the following links:
>
>
> http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
> http://www.folge2.de/tp/search/a/crawling-the-local-filesystem-with-nutch
> 6.      Modified the files as mentioned.
> 7.      Crawl fails with the following error.
> (The config file format seems to be fine. Not able to debug the error.
> Went through changes.txt and it mentions that the following has been fixed
> :
> 53. NUTCH-384 - Protocol-file plugin does not allow the parse plugins
>     framework to operate properly (Heiko Dietze via mattmann)
>
>
> Not sure why local crawl fails. May I request the experts for help?
>
> Regards,
> Garnier
>
> Error:
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 1000
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080228114148
> Generator: filtering: false
> Generator: topN: 1000
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080228114148
> Fetcher: threads: 10
> fetching file:///hm/garnier/TOOLTOX/Builds/user-docs
> fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
> url=file
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080228114148]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080228114155
> Generator: filtering: false
> Generator: topN: 1000
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080228114155
> Fetcher: threads: 10
> fetching file:///hm/garnier/TOOLTOX/Builds/user-docs
> fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
> url=file
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080228114155]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080228114201
> Generator: filtering: false
> Generator: topN: 1000
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080228114201
> Fetcher: threads: 10
> fetching file:///hm/garnier/TOOLTOX/Builds/user-docs
> fetch of file:///hm/garnier/TOOLTOX/Builds/user-docs failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
> url=file
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080228114201]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20080228114148
> LinkDb: adding segment: crawl/segments/20080228114155
> LinkDb: adding segment: crawl/segments/20080228114201
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20080228114148
> Indexer: adding segment: crawl/segments/20080228114155
> Indexer: adding segment: crawl/segments/20080228114201
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.indexer.DeleteDuplicates.dedup(
> DeleteDuplicates.java:439)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
>
>
>
>
>       Share files, take polls, and discuss your passions - all under one
> roof. Go to http://in.promos.yahoo.com/groups
>
>