Nutch not indexing all fetched sites

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Nutch not indexing all fetched sites

dominik81
Hi,

with nutch-2008-06-26_04-01-58 I'm trying to index a few pages from the Microsoft support knowledge base. I put the URLs in a file called 'urlall' which looks like this:


http://support.microsoft.com/kb/317507/en-us
http://support.microsoft.com/kb/295115/en-us
http://support.microsoft.com/kb/295117/en-us
http://support.microsoft.com/kb/840701/en-us
http://support.microsoft.com/kb/924611/en-us
http://support.microsoft.com/kb/158509/en-us
http://support.microsoft.com/kb/259258/en-us
http://support.microsoft.com/kb/287070/en-us

I want to index those 8 pages only. Now I run the following command to crawl the sites:

bin/nutch crawl /Users/dominik/Documents/MastersThesis/nutch/urls -dir /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl -depth 1 -topN 100 -threads 100

When the crawl is finished only 5 of 8 pages are indexed. Can you tell me why, or what to change so that all sites from 'urlall' get indexed?

Thank you!


Here's the output from the crawl command:

crawl started in: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl
rootUrlDir = /Users/dominik/Documents/MastersThesis/nutch/urls
threads = 100
depth = 1
topN = 100
Injector: starting
Injector: crawlDb: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/crawldb
Injector: urlDir: /Users/dominik/Documents/MastersThesis/nutch/urls
Injector: Converting injected urls to crawl db entries.
Skipping {\rtf1\ansi\ansicpg1252\cocoartf949\cocoasubrtf330:java.net.MalformedURLException: no protocol: {\rtf1\ansi\ansicpg1252\cocoartf949\cocoasubrtf330
Skipping {\fonttbl\f0\fswiss\fcharset0 Helvetica;}:java.net.MalformedURLException: no protocol: {\fonttbl\f0\fswiss\fcharset0 Helvetica;}
Skipping {\colortbl;\red255\green255\blue255;}:java.net.MalformedURLException: no protocol: {\colortbl;\red255\green255\blue255;}
Skipping \paperw11900\paperh16840\margl1440\margr1440\vieww9000\viewh8400\viewkind0:java.net.MalformedURLException: no protocol: \paperw11900\paperh16840\margl1440\margr1440\vieww9000\viewh8400\viewkind0
Skipping \pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural:java.net.MalformedURLException: no protocol: \pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural
Skipping \f0\fs24 \cf0 \:java.net.MalformedURLException: no protocol: \f0\fs24 \cf0 \
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652
Generator: filtering: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652
Fetcher: threads: 100
fetching http://support.microsoft.com/kb/259258/en-us\
fetching http://support.microsoft.com/kb/317507/en-us\
fetching http://support.microsoft.com/kb/295117/en-us\
fetching http://support.microsoft.com/kb/158509/en-us\
fetching http://support.microsoft.com/kb/295115/en-us\
fetching http://support.microsoft.com/kb/287070/en-us}
fetching http://support.microsoft.com/kb/840701/en-us\
fetching http://support.microsoft.com/kb/924611/en-us\
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/crawldb
CrawlDb update: segments: [/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652
LinkDb: done
Indexer: starting
Indexer: linkdb: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/linkdb
Indexer: adding segment: file:/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652
IFD [Thread-151]: setInfoStream deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@79f5f7
IW 0 [Thread-151]: setInfoStream: dir=org.apache.lucene.store.FSDirectory@/private/tmp/hadoop-dominik/mapred/local/index/_-173514222 autoCommit=true mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@596e13 mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@49d560 ramBufferSizeMB=16.0 maxBuffereDocs=50 maxBuffereDeleteTerms=-1 maxFieldLength=10000 index=
 Indexing [http://support.microsoft.com/kb/287070/en-us}] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@9f6a5e (null)
 Indexing [http://support.microsoft.com/kb/295115/en-us\] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@9f6a5e (null)
 Indexing [http://support.microsoft.com/kb/317507/en-us\] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@9f6a5e (null)
 Indexing [http://support.microsoft.com/kb/840701/en-us\] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@9f6a5e (null)
 Indexing [http://support.microsoft.com/kb/924611/en-us\] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@9f6a5e (null)
Optimizing index.
IW 0 [Thread-151]: optimize: index now
IW 0 [Thread-151]:   flush: segment=_0 docStoreSegment=_0 docStoreOffset=0 flushDocs=true flushDeletes=false flushDocStores=true numDocs=5 numBufDelTerms=0
IW 0 [Thread-151]:   index before flush
flush postings as segment _0 numDocs=5

closeDocStore: 2 files to flush to segment _0

oldRAMSize=76608 newFlushedSize=30888 docs/MB=169.738 new/old=40.32%
IW 0 [Thread-151]: checkpoint: wrote segments file "segments_2"
IFD [Thread-151]: now checkpoint "segments_2" [1 segments ; isCommit = true]
IFD [Thread-151]: deleteCommits: now remove commit "segments_1"
IFD [Thread-151]: delete "segments_1"
IW 0 [Thread-151]: LMP: findMerges: 1 segments
IW 0 [Thread-151]: LMP:   level -1.0 to 2.6517506: 1 segments
IW 0 [Thread-151]: CMS: now merge
IW 0 [Thread-151]: CMS:   index: _0:C5
IW 0 [Thread-151]: CMS:   no more merges pending; now return
IW 0 [Thread-151]: CMS: now merge
IW 0 [Thread-151]: CMS:   index: _0:C5
IW 0 [Thread-151]: CMS:   no more merges pending; now return
IW 0 [Thread-151]: now flush at close
IW 0 [Thread-151]:   flush: segment=null docStoreSegment=null docStoreOffset=0 flushDocs=false flushDeletes=false flushDocStores=false numDocs=0 numBufDelTerms=0
IW 0 [Thread-151]:   index before flush _0:C5
IW 0 [Thread-151]: at close: _0:C5
Indexer: done
Dedup: starting
Dedup: adding indexes in: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/indexes
Dedup: done
merging indexes to: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/index
Adding file:/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/indexes/part-00000
done merging
crawl finished: /Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl