After taking some time to look into the nutch source
code (v0.7.1), I notice that the current file format
for storing page content may not be very efficient.
If I understand correctly, to retrieve the content of
a page with a docID, say, 20, the code check the
"index" file first, since the default indexInterval is
128, so it starts from 0 and then loops for 20 times,
with reading and comparing the header data in each
IMO, this way of loading the content is not very
efficient. Here're my questions and suggestions
* Why not storing different pages in different files,
whose file names are their docID? And each file is
compressed with gzip. It will create lots of small
files. But loading them would be faster.
* If files need to be appended, is it a better way to
use the same size (6k for example) for each page? If
the page is larger than 6k, the rest of its content
will be stored in another file. The searching will be
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around