Map-reduce based SegmentReader

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Map-reduce based SegmentReader

radu mateescu
Hello,
 
Attached is the simplified version of SegmentReader using map-reduce.
 
Synthax: ./nutch org.apache.nutch.crawl.SegmentReader segment
 
It creates a segdump directory under segment structure which holds all individual dump files along with the large file obtained through concatenation of individual pieces. This file has the name given by segment.dump.filename property (defaulted to dump).
 
The structure of each dumped record is:
Recno::
CrawlDatum::
Content::
ParseData::
ParseText::
 
Comments are welcome
 
Thanks,
Radu

SegmentReader.java (7K) Download Attachment