[jira] Created: (NUTCH-177) Default installation seems to produce working entity of nutch

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-177) Default installation seems to produce working entity of nutch

JIRA jira@apache.org
Default installation seems to produce working entity of nutch
-------------------------------------------------------------

         Key: NUTCH-177
         URL: http://issues.apache.org/jira/browse/NUTCH-177
     Project: Nutch
        Type: Bug
    Versions: 0.7.1    
 Environment: Linux SUSE 9.3
    Reporter: Matthias Günter
    Priority: Minor


I downloaded 0.7.1 and installed it.
Then changed crawl-urlfilter.txt for apache.org
Then I added an urllist.txt  and tried scanning.
Apparently the URL has been ignored, even when it matched the rule in the crawl-url-filter.txt

guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin> sh ./nutch crawl ../../urllist.txt
060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-default.xml
060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-tool.xml
060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-site.xml
060115 141534 No FS indicated, using default:local
060115 141534 crawl started in: crawl-20060115141534
060115 141534 rootUrlFile = ../../urllist.txt
060115 141534 threads = 10
060115 141534 depth = 5
060115 141535 Created webdb at LocalFS,/home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141535 Starting URL processing
060115 141535 Plugins: looking in: /home/guenter/workspace/lucene/nutch-0.7.1/plugins
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-more
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-site/plugin.xml
060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-html/plugin.xml
060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-text/plugin.xml
060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-ext
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-pdf
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-rss
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-basic/plugin.xml
060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-more
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-js
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
060115 141535 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-ftp
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-msword
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/creativecommons
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/ontology
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-file
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-http/plugin.xml
060115 141535 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/clustering-carrot2
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/language-identifier
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-prefix
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-url/plugin.xml
060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-basic/plugin.xml
060115 141535 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-httpclient
060115 141535 found resource crawl-urlfilter.txt at file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-urlfilter.txt
..060115 141535 Added 0 pages
060115 141535 FetchListTool started
060115 141535 Overall processing: Sorted 0 entries in 0.0 seconds.
060115 141535 Overall processing: Sorted NaN entries/second
060115 141535 FetchListTool completed
060115 141536 logging at INFO
060115 141537 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141537 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
060115 141537 Finishing update
060115 141537 Update finished
060115 141537 FetchListTool started
060115 141537 Overall processing: Sorted 0 entries in 0.0 seconds.
060115 141537 Overall processing: Sorted NaN entries/second
060115 141537 FetchListTool completed
060115 141537 logging at INFO
060115 141538 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141538 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
060115 141538 Finishing update
060115 141538 Update finished
060115 141538 FetchListTool started
060115 141538 Overall processing: Sorted 0 entries in 0.0 seconds.
060115 141538 Overall processing: Sorted NaN entries/second
060115 141538 FetchListTool completed
060115 141538 logging at INFO
060115 141539 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141539 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
060115 141539 Finishing update
060115 141539 Update finished
060115 141539 FetchListTool started
060115 141540 Overall processing: Sorted 0 entries in 0.0 seconds.
060115 141540 Overall processing: Sorted NaN entries/second
060115 141540 FetchListTool completed
060115 141540 logging at INFO
060115 141541 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141541 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
060115 141541 Finishing update
060115 141541 Update finished
060115 141541 FetchListTool started
060115 141541 Overall processing: Sorted 0 entries in 0.0 seconds.
060115 141541 Overall processing: Sorted NaN entries/second
060115 141541 FetchListTool completed
060115 141541 logging at INFO
060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141542 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
060115 141542 Finishing update
060115 141542 Update finished
060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
060115 141542 Sorting pages by url...
060115 141542 Getting updated scores and anchors from db...
060115 141542 Sorting updates by segment...
060115 141542 Updating segments...
060115 141542 Done updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
060115 141542 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
060115 141542 * Opening segment 20060115141535
060115 141542 * Indexing segment 20060115141535
060115 141542 * Optimizing index...
060115 141542 * Moving index to NFS if needed...
060115 141542 DONE indexing segment 20060115141535: total 0 records in 0.035 s (NaN rec/s).
060115 141543 done indexing
060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
060115 141543 * Opening segment 20060115141537
060115 141543 * Indexing segment 20060115141537
060115 141543 * Optimizing index...
060115 141543 * Moving index to NFS if needed...
060115 141543 DONE indexing segment 20060115141537: total 0 records in 0.076 s (NaN rec/s).
060115 141543 done indexing
060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
060115 141543 * Opening segment 20060115141538
060115 141543 * Indexing segment 20060115141538
060115 141543 * Optimizing index...
060115 141543 * Moving index to NFS if needed...
060115 141543 DONE indexing segment 20060115141538: total 0 records in 0.012 s (NaN rec/s).
060115 141543 done indexing
060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
060115 141543 * Opening segment 20060115141539
060115 141543 * Indexing segment 20060115141539
060115 141543 * Optimizing index...
060115 141543 * Moving index to NFS if needed...
060115 141543 DONE indexing segment 20060115141539: total 0 records in 0.013 s (NaN rec/s).
060115 141543 done indexing
060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
060115 141543 * Opening segment 20060115141541
060115 141543 * Indexing segment 20060115141541
060115 141543 * Optimizing index...
060115 141543 * Moving index to NFS if needed...
060115 141543 DONE indexing segment 20060115141541: total 0 records in 0.02 s (NaN rec/s).
060115 141543 done indexing
060115 141543 Reading url hashes...
060115 141543 Sorting url hashes...
060115 141543 Deleting url duplicates...
060115 141543 Deleted 0 url duplicates.
060115 141543 Reading content hashes...
060115 141543 Sorting content hashes...
060115 141543 Deleting content duplicates...
060115 141543 Deleted 0 content duplicates.
060115 141543 Duplicate deletion complete locally.  Now returning to NFS...
060115 141543 DeleteDuplicates complete
060115 141543 Merging segment indexes...
060115 141543 crawl finished: crawl-20060115141534
guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin>  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-177) Default installation seems to produce working entity of nutch

JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-177?page=all ]

Matthias Günter updated NUTCH-177:
----------------------------------

    Attachment: crawl-urlfilter.txt

The crawl-filter with a change for apache.org

> Default installation seems to produce working entity of nutch
> -------------------------------------------------------------
>
>          Key: NUTCH-177
>          URL: http://issues.apache.org/jira/browse/NUTCH-177
>      Project: Nutch
>         Type: Bug
>     Versions: 0.7.1
>  Environment: Linux SUSE 9.3
>     Reporter: Matthias Günter
>     Priority: Minor
>  Attachments: crawl-urlfilter.txt, urllist.txt
>
> I downloaded 0.7.1 and installed it.
> Then changed crawl-urlfilter.txt for apache.org
> Then I added an urllist.txt  and tried scanning.
> Apparently the URL has been ignored, even when it matched the rule in the crawl-url-filter.txt
> guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin> sh ./nutch crawl ../../urllist.txt
> 060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-default.xml
> 060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-tool.xml
> 060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-site.xml
> 060115 141534 No FS indicated, using default:local
> 060115 141534 crawl started in: crawl-20060115141534
> 060115 141534 rootUrlFile = ../../urllist.txt
> 060115 141534 threads = 10
> 060115 141534 depth = 5
> 060115 141535 Created webdb at LocalFS,/home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141535 Starting URL processing
> 060115 141535 Plugins: looking in: /home/guenter/workspace/lucene/nutch-0.7.1/plugins
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-more
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-site/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-html/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-text/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-ext
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-pdf
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-rss
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-basic/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-more
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-js
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-ftp
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-msword
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/creativecommons
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/ontology
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-file
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-http/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/clustering-carrot2
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/language-identifier
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-prefix
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-url/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-basic/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-httpclient
> 060115 141535 found resource crawl-urlfilter.txt at file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-urlfilter.txt
> ..060115 141535 Added 0 pages
> 060115 141535 FetchListTool started
> 060115 141535 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141535 Overall processing: Sorted NaN entries/second
> 060115 141535 FetchListTool completed
> 060115 141536 logging at INFO
> 060115 141537 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141537 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141537 Finishing update
> 060115 141537 Update finished
> 060115 141537 FetchListTool started
> 060115 141537 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141537 Overall processing: Sorted NaN entries/second
> 060115 141537 FetchListTool completed
> 060115 141537 logging at INFO
> 060115 141538 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141538 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141538 Finishing update
> 060115 141538 Update finished
> 060115 141538 FetchListTool started
> 060115 141538 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141538 Overall processing: Sorted NaN entries/second
> 060115 141538 FetchListTool completed
> 060115 141538 logging at INFO
> 060115 141539 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141539 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141539 Finishing update
> 060115 141539 Update finished
> 060115 141539 FetchListTool started
> 060115 141540 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141540 Overall processing: Sorted NaN entries/second
> 060115 141540 FetchListTool completed
> 060115 141540 logging at INFO
> 060115 141541 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141541 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141541 Finishing update
> 060115 141541 Update finished
> 060115 141541 FetchListTool started
> 060115 141541 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141541 Overall processing: Sorted NaN entries/second
> 060115 141541 FetchListTool completed
> 060115 141541 logging at INFO
> 060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141542 Finishing update
> 060115 141542 Update finished
> 060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141542 Sorting pages by url...
> 060115 141542 Getting updated scores and anchors from db...
> 060115 141542 Sorting updates by segment...
> 060115 141542 Updating segments...
> 060115 141542 Done updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141542 * Opening segment 20060115141535
> 060115 141542 * Indexing segment 20060115141535
> 060115 141542 * Optimizing index...
> 060115 141542 * Moving index to NFS if needed...
> 060115 141542 DONE indexing segment 20060115141535: total 0 records in 0.035 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141543 * Opening segment 20060115141537
> 060115 141543 * Indexing segment 20060115141537
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141537: total 0 records in 0.076 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141543 * Opening segment 20060115141538
> 060115 141543 * Indexing segment 20060115141538
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141538: total 0 records in 0.012 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141543 * Opening segment 20060115141539
> 060115 141543 * Indexing segment 20060115141539
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141539: total 0 records in 0.013 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141543 * Opening segment 20060115141541
> 060115 141543 * Indexing segment 20060115141541
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141541: total 0 records in 0.02 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 Reading url hashes...
> 060115 141543 Sorting url hashes...
> 060115 141543 Deleting url duplicates...
> 060115 141543 Deleted 0 url duplicates.
> 060115 141543 Reading content hashes...
> 060115 141543 Sorting content hashes...
> 060115 141543 Deleting content duplicates...
> 060115 141543 Deleted 0 content duplicates.
> 060115 141543 Duplicate deletion complete locally.  Now returning to NFS...
> 060115 141543 DeleteDuplicates complete
> 060115 141543 Merging segment indexes...
> 060115 141543 crawl finished: crawl-20060115141534
> guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin>  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-177) Default installation seems to produce working entity of nutch

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-177?page=all ]

Matthias Günter updated NUTCH-177:
----------------------------------

    Attachment: urllist.txt

URL-List used..

> Default installation seems to produce working entity of nutch
> -------------------------------------------------------------
>
>          Key: NUTCH-177
>          URL: http://issues.apache.org/jira/browse/NUTCH-177
>      Project: Nutch
>         Type: Bug
>     Versions: 0.7.1
>  Environment: Linux SUSE 9.3
>     Reporter: Matthias Günter
>     Priority: Minor
>  Attachments: crawl-urlfilter.txt, urllist.txt
>
> I downloaded 0.7.1 and installed it.
> Then changed crawl-urlfilter.txt for apache.org
> Then I added an urllist.txt  and tried scanning.
> Apparently the URL has been ignored, even when it matched the rule in the crawl-url-filter.txt
> guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin> sh ./nutch crawl ../../urllist.txt
> 060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-default.xml
> 060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-tool.xml
> 060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-site.xml
> 060115 141534 No FS indicated, using default:local
> 060115 141534 crawl started in: crawl-20060115141534
> 060115 141534 rootUrlFile = ../../urllist.txt
> 060115 141534 threads = 10
> 060115 141534 depth = 5
> 060115 141535 Created webdb at LocalFS,/home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141535 Starting URL processing
> 060115 141535 Plugins: looking in: /home/guenter/workspace/lucene/nutch-0.7.1/plugins
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-more
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-site/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-html/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-text/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-ext
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-pdf
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-rss
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-basic/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-more
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-js
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-ftp
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-msword
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/creativecommons
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/ontology
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-file
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-http/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/clustering-carrot2
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/language-identifier
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-prefix
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-url/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-basic/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-httpclient
> 060115 141535 found resource crawl-urlfilter.txt at file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-urlfilter.txt
> ..060115 141535 Added 0 pages
> 060115 141535 FetchListTool started
> 060115 141535 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141535 Overall processing: Sorted NaN entries/second
> 060115 141535 FetchListTool completed
> 060115 141536 logging at INFO
> 060115 141537 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141537 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141537 Finishing update
> 060115 141537 Update finished
> 060115 141537 FetchListTool started
> 060115 141537 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141537 Overall processing: Sorted NaN entries/second
> 060115 141537 FetchListTool completed
> 060115 141537 logging at INFO
> 060115 141538 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141538 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141538 Finishing update
> 060115 141538 Update finished
> 060115 141538 FetchListTool started
> 060115 141538 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141538 Overall processing: Sorted NaN entries/second
> 060115 141538 FetchListTool completed
> 060115 141538 logging at INFO
> 060115 141539 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141539 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141539 Finishing update
> 060115 141539 Update finished
> 060115 141539 FetchListTool started
> 060115 141540 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141540 Overall processing: Sorted NaN entries/second
> 060115 141540 FetchListTool completed
> 060115 141540 logging at INFO
> 060115 141541 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141541 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141541 Finishing update
> 060115 141541 Update finished
> 060115 141541 FetchListTool started
> 060115 141541 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141541 Overall processing: Sorted NaN entries/second
> 060115 141541 FetchListTool completed
> 060115 141541 logging at INFO
> 060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141542 Finishing update
> 060115 141542 Update finished
> 060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141542 Sorting pages by url...
> 060115 141542 Getting updated scores and anchors from db...
> 060115 141542 Sorting updates by segment...
> 060115 141542 Updating segments...
> 060115 141542 Done updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141542 * Opening segment 20060115141535
> 060115 141542 * Indexing segment 20060115141535
> 060115 141542 * Optimizing index...
> 060115 141542 * Moving index to NFS if needed...
> 060115 141542 DONE indexing segment 20060115141535: total 0 records in 0.035 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141543 * Opening segment 20060115141537
> 060115 141543 * Indexing segment 20060115141537
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141537: total 0 records in 0.076 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141543 * Opening segment 20060115141538
> 060115 141543 * Indexing segment 20060115141538
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141538: total 0 records in 0.012 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141543 * Opening segment 20060115141539
> 060115 141543 * Indexing segment 20060115141539
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141539: total 0 records in 0.013 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141543 * Opening segment 20060115141541
> 060115 141543 * Indexing segment 20060115141541
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141541: total 0 records in 0.02 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 Reading url hashes...
> 060115 141543 Sorting url hashes...
> 060115 141543 Deleting url duplicates...
> 060115 141543 Deleted 0 url duplicates.
> 060115 141543 Reading content hashes...
> 060115 141543 Sorting content hashes...
> 060115 141543 Deleting content duplicates...
> 060115 141543 Deleted 0 content duplicates.
> 060115 141543 Duplicate deletion complete locally.  Now returning to NFS...
> 060115 141543 DeleteDuplicates complete
> 060115 141543 Merging segment indexes...
> 060115 141543 crawl finished: crawl-20060115141534
> guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin>  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-177) Default installation seems to produce working entity of nutch

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-177?page=all ]
     
Doug Cutting resolved NUTCH-177:
--------------------------------

    Fix Version: 0.8-dev
     Resolution: Fixed

The problem is that your seed url does not end in a slash, yet your url filter requires a slash.  In 0.8-dev (aka trunk) this is fixed, since urls are normalized before filtering, which adds a slash after the hostname.

> Default installation seems to produce working entity of nutch
> -------------------------------------------------------------
>
>          Key: NUTCH-177
>          URL: http://issues.apache.org/jira/browse/NUTCH-177
>      Project: Nutch
>         Type: Bug
>     Versions: 0.7.1
>  Environment: Linux SUSE 9.3
>     Reporter: Matthias Günter
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: crawl-urlfilter.txt, urllist.txt
>
> I downloaded 0.7.1 and installed it.
> Then changed crawl-urlfilter.txt for apache.org
> Then I added an urllist.txt  and tried scanning.
> Apparently the URL has been ignored, even when it matched the rule in the crawl-url-filter.txt
> guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin> sh ./nutch crawl ../../urllist.txt
> 060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-default.xml
> 060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-tool.xml
> 060115 141534 parsing file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-site.xml
> 060115 141534 No FS indicated, using default:local
> 060115 141534 crawl started in: crawl-20060115141534
> 060115 141534 rootUrlFile = ../../urllist.txt
> 060115 141534 threads = 10
> 060115 141534 depth = 5
> 060115 141535 Created webdb at LocalFS,/home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141535 Starting URL processing
> 060115 141535 Plugins: looking in: /home/guenter/workspace/lucene/nutch-0.7.1/plugins
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-more
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-site/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-html/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-text/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-ext
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-pdf
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-rss
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-basic/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-more
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-js
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-ftp
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-msword
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/creativecommons
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/ontology
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-file
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-http/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/clustering-carrot2
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/language-identifier
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-prefix
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-url/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
> 060115 141535 parsing: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-basic/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
> 060115 141535 not including: /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-httpclient
> 060115 141535 found resource crawl-urlfilter.txt at file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-urlfilter.txt
> ..060115 141535 Added 0 pages
> 060115 141535 FetchListTool started
> 060115 141535 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141535 Overall processing: Sorted NaN entries/second
> 060115 141535 FetchListTool completed
> 060115 141536 logging at INFO
> 060115 141537 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141537 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141537 Finishing update
> 060115 141537 Update finished
> 060115 141537 FetchListTool started
> 060115 141537 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141537 Overall processing: Sorted NaN entries/second
> 060115 141537 FetchListTool completed
> 060115 141537 logging at INFO
> 060115 141538 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141538 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141538 Finishing update
> 060115 141538 Update finished
> 060115 141538 FetchListTool started
> 060115 141538 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141538 Overall processing: Sorted NaN entries/second
> 060115 141538 FetchListTool completed
> 060115 141538 logging at INFO
> 060115 141539 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141539 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141539 Finishing update
> 060115 141539 Update finished
> 060115 141539 FetchListTool started
> 060115 141540 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141540 Overall processing: Sorted NaN entries/second
> 060115 141540 FetchListTool completed
> 060115 141540 logging at INFO
> 060115 141541 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141541 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141541 Finishing update
> 060115 141541 Update finished
> 060115 141541 FetchListTool started
> 060115 141541 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060115 141541 Overall processing: Sorted NaN entries/second
> 060115 141541 FetchListTool completed
> 060115 141541 logging at INFO
> 060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542 Updating for /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141542 Finishing update
> 060115 141542 Update finished
> 060115 141542 Updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141542  reading /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141542 Sorting pages by url...
> 060115 141542 Getting updated scores and anchors from db...
> 060115 141542 Sorting updates by segment...
> 060115 141542 Updating segments...
> 060115 141542 Done updating /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments from /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141542 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141535
> 060115 141542 * Opening segment 20060115141535
> 060115 141542 * Indexing segment 20060115141535
> 060115 141542 * Optimizing index...
> 060115 141542 * Moving index to NFS if needed...
> 060115 141542 DONE indexing segment 20060115141535: total 0 records in 0.035 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141537
> 060115 141543 * Opening segment 20060115141537
> 060115 141543 * Indexing segment 20060115141537
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141537: total 0 records in 0.076 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141538
> 060115 141543 * Opening segment 20060115141538
> 060115 141543 * Indexing segment 20060115141538
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141538: total 0 records in 0.012 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141539
> 060115 141543 * Opening segment 20060115141539
> 060115 141543 * Indexing segment 20060115141539
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141539: total 0 records in 0.013 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 indexing segment: /home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/segments/20060115141541
> 060115 141543 * Opening segment 20060115141541
> 060115 141543 * Indexing segment 20060115141541
> 060115 141543 * Optimizing index...
> 060115 141543 * Moving index to NFS if needed...
> 060115 141543 DONE indexing segment 20060115141541: total 0 records in 0.02 s (NaN rec/s).
> 060115 141543 done indexing
> 060115 141543 Reading url hashes...
> 060115 141543 Sorting url hashes...
> 060115 141543 Deleting url duplicates...
> 060115 141543 Deleted 0 url duplicates.
> 060115 141543 Reading content hashes...
> 060115 141543 Sorting content hashes...
> 060115 141543 Deleting content duplicates...
> 060115 141543 Deleted 0 content duplicates.
> 060115 141543 Duplicate deletion complete locally.  Now returning to NFS...
> 060115 141543 DeleteDuplicates complete
> 060115 141543 Merging segment indexes...
> 060115 141543 crawl finished: crawl-20060115141534
> guenter@deimos:~/workspace/lucene/nutch-0.7.1/bin>  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira