Fwd: links in db and pagerank calculation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Fwd: links in db and pagerank calculation

orkunt.sabuncu
Hi,

I found a setting that solves my first problem. Setting
db.ignore.internal.links to false will generate all the links in a web site.

Still I couldn't find any clue about the second one. Why nutch page anaysis
module compute contributionForOutlinkers? There is nothing like this in the
usual PageRank algorithm. Any idea about this? I am forwading the first mail
sent to nutch-user.

Thanks in advance,
-orkunt.

----------  Forwarded Message  ----------

Subject: links in db and pagerank calculation
Date: Monday 11 July 2005 11:17
From: Orkunt Sabuncu <[hidden email]>
To: [hidden email]

Hi,

Let's say we have a site with diamond like link structure. There are 4 pages
 r (root), 1, 2, and 3. r has outlinks to 1 and 2; and both 1 and 2 have
 outlinks to 3. When we crawl this site, the links in webdb ignores the link
 from 2 to 3. At the end there are only 3 links in db. 2 from r pointing to 1
 and 2; one from 1 to 3.

This will surely effects PageRank calculations. Is this a bug or am i
considering something wrong?

Also, in the link analysis module (DistributedAnalysisTool.java) there are
some extra score contributions named contributionForOutlinkers. This
contribution considers the links to pages which have also links to other
pages. I couldn't find references to this way of calculating pagerank in the
literature. Basic pagerank calculation considers only the outlinks. Nutch's
way of calculation will find different scores from the basic Pagerank
calculation. So, what's the use of contribution for outlinkers? Do you have
any idea or references that explains this?

I am using Nutch-0.6

Thanks,
-orkunt.

-------------------------------------------------------