I have a lot of questions about QBE in Nutch!
So first does Nutch support QBE
through the (implemented?) More Like This function?
If so, can anyone explain
briefly the algorithm to do that, how the similarity
between WebPages is computed?
The one used by Google is described in this paper
http://citeseer.ist.psu.edu/dean99finding.html but it
only shows the 31 similar pages (31? Still don’t
have an authoritative explanation about that number:
probably for the sake of relevant concise answer
instead of ranking the thousands of query results)
for well known sites which are supposed to have a
non-obscure content (sites like nytimes.com, cnn.com, google.com)
rather than personal web pages or other less popular web pages.
Is there any benchmark testing a state of the art
WebPages similarity functions?