dedup error,help me!!!

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

dedup error,help me!!!

Ensheng Wang
I got a error when I run nutch dedup,help me pls!
  My segments about 20G,too big? or some data is bad?
  I don't know, help me pls.
  thanks a lot!
   
   
  [wangensh@pc110 crawl]$ nutch dedup segments/ tmp
060511 013951 parsing file:/home/wangensh/nutch-0.7.2/conf/nutch-default.xml
060511 013951 parsing file:/home/wangensh/nutch-0.7.2/conf/nutch-site.xml
060511 013951 No FS indicated, using default:local
060511 013951 Clearing old deletions in segments/20060507135947/index(segments/20060507135947/index)
060511 013951 Clearing old deletions in segments/20060501150328/index(segments/20060501150328/index)
060511 013951 Clearing old deletions in segments/20060501232333/index(segments/20060501232333/index)
060511 013951 Clearing old deletions in segments/20060502161811/index(segments/20060502161811/index)
060511 013951 Clearing old deletions in segments/20060427204139/index(segments/20060427204139/index)
060511 013951 Clearing old deletions in segments/20060502215251/index(segments/20060502215251/index)
060511 013951 Clearing old deletions in segments/20060428074316/index(segments/20060428074316/index)
060511 013951 Clearing old deletions in segments/20060428153029/index(segments/20060428153029/index)
060511 013951 Clearing old deletions in segments/20060428235858/index(segments/20060428235858/index)
060511 013951 Clearing old deletions in segments/20060429051429/index(segments/20060429051429/index)
060511 013951 Clearing old deletions in segments/20060503043601/index(segments/20060503043601/index)
060511 013951 Clearing old deletions in segments/20060429113057/index(segments/20060429113057/index)
060511 013951 Clearing old deletions in segments/20060429180029/index(segments/20060429180029/index)
060511 013951 Clearing old deletions in segments/20060430010104/index(segments/20060430010104/index)
060511 013951 Clearing old deletions in segments/20060430055919/index(segments/20060430055919/index)
060511 013951 Clearing old deletions in segments/20060430111242/index(segments/20060430111242/index)
060511 013951 Clearing old deletions in segments/20060430201343/index(segments/20060430201343/index)
060511 013951 Clearing old deletions in segments/20060501025132/index(segments/20060501025132/index)
060511 013951 Clearing old deletions in segments/20060503125346/index(segments/20060503125346/index)
060511 013951 Clearing old deletions in segments/20060503185355/index(segments/20060503185355/index)
060511 013951 Clearing old deletions in segments/20060504001824/index(segments/20060504001824/index)
060511 013951 Clearing old deletions in segments/20060504091608/index(segments/20060504091608/index)
060511 013951 Clearing old deletions in segments/20060504174715/index(segments/20060504174715/index)
060511 013951 Clearing old deletions in segments/20060505012951/index(segments/20060505012951/index)
060511 013951 Clearing old deletions in segments/20060505110206/index(segments/20060505110206/index)
060511 013951 Clearing old deletions in segments/20060505171002/index(segments/20060505171002/index)
060511 013951 Clearing old deletions in segments/20060506001003/index(segments/20060506001003/index)
060511 013951 Clearing old deletions in segments/20060507144825/index(segments/20060507144825/index)
060511 013951 Reading url hashes...
060511 014026 Sorting url hashes...
060511 014032 Deleting url duplicates...
060511 014033 Deleted 147805 url duplicates.
060511 014033 Reading content hashes...
Exception in thread "Main Thread" java.lang.RuntimeException: Not a hex character: g
        at org.apache.nutch.io.MD5Hash.charToNibble(MD5Hash.java:194)
        at org.apache.nutch.io.MD5Hash.setDigest(MD5Hash.java:180)
        at org.apache.nutch.indexer.DeleteDuplicates$1.updateHash(DeleteDuplicates.java:163)
        at org.apache.nutch.indexer.DeleteDuplicates.computeHashes(DeleteDuplicates.java:226)
        at org.apache.nutch.indexer.DeleteDuplicates.deleteContentDuplicates(DeleteDuplicates.java:160)
        at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:350)


               
---------------------------------
抢注雅虎免费邮箱-3.5G容量,20M附件!