I found a setting that solves my first problem. Setting
db.ignore.internal.links to false will generate all the links in a web site.
Still I couldn't find any clue about the second one. Why nutch page anaysis
module compute contributionForOutlinkers? There is nothing like this in the
usual PageRank algorithm. Any idea about this? I am forwading the first mail
sent to nutch-user.
Let's say we have a site with diamond like link structure. There are 4 pages
r (root), 1, 2, and 3. r has outlinks to 1 and 2; and both 1 and 2 have
outlinks to 3. When we crawl this site, the links in webdb ignores the link
from 2 to 3. At the end there are only 3 links in db. 2 from r pointing to 1
and 2; one from 1 to 3.
This will surely effects PageRank calculations. Is this a bug or am i
considering something wrong?
Also, in the link analysis module (DistributedAnalysisTool.java) there are
some extra score contributions named contributionForOutlinkers. This
contribution considers the links to pages which have also links to other
pages. I couldn't find references to this way of calculating pagerank in the
literature. Basic pagerank calculation considers only the outlinks. Nutch's
way of calculation will find different scores from the basic Pagerank
calculation. So, what's the use of contribution for outlinkers? Do you have
any idea or references that explains this?