OPICScoringFilter

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

OPICScoringFilter

Marko Bauhardt-2
Hi all,
I read the paper http://www2003.org/cdrom/papers/refereed/p007/p7- 
abiteboul.html and try to understand the implementation of this paper  
(opic-scoring plugin).
I looked at the code of the OpicScoringFilter. I understand the  
method distributeScoreToOutlink. The current page distribute his  
score (or his cash) to his outlinks.

But i can not find the algorithm from method updateDbScore in this  
paper. I think this method sets the history of a page? In this case:
In the paper is written: the history of a page is: oldHistory
+scoreOfCurrentPage
But in the code: the score (or history ?)  of a page is set to:  
oldScore + sumOfScoreOfAllInlinks.
ok this method sounds logical. A score of a page is the sum of all  
the scores of his inlinks. But i can not found this algorithm in the  
paper.

A second point is the indexerScore method. This method returns pow
(scoreOfPage, scorePower). But in the paper is the relevance of a  
page: historyOfPage+scoreOfPage/( historyOfAllPages + 1)

Is the implementation (opic-scoring plugin)  an other version of the  
OPIC Algorithm or overlook i something? I think i must read the paper  
again ;-) .

Thanks for every clarification.

Marko



Reply | Threaded
Open this post in threaded view
|

Re: OPICScoringFilter

kkrugler
Hi Marko,

>I read the paper
>http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html and
>try to understand the implementation of this paper (opic-scoring
>plugin).
>I looked at the code of the OpicScoringFilter. I understand the
>method distributeScoreToOutlink. The current page distribute his
>score (or his cash) to his outlinks.
>
>But i can not find the algorithm from method updateDbScore in this
>paper. I think this method sets the history of a page? In this case:
>In the paper is written: the history of a page is:
>oldHistory+scoreOfCurrentPage
>But in the code: the score (or history ?)  of a page is set to:
>oldScore + sumOfScoreOfAllInlinks.
>ok this method sounds logical. A score of a page is the sum of all
>the scores of his inlinks. But i can not found this algorithm in the
>paper.
>
>A second point is the indexerScore method. This method returns
>pow(scoreOfPage, scorePower). But in the paper is the relevance of a
>page: historyOfPage+scoreOfPage/( historyOfAllPages + 1)
>
>Is the implementation (opic-scoring plugin)  an other version of the
>OPIC Algorithm or overlook i something? I think i must read the
>paper again ;-) .

The currently Nutch implementation of OPIC is really only appropriate
for optimized crawling of pages. As you note, the algorithm described
by the paper on Adaptive OPIC that you reference has a number of
differences from what's currently in Nutch, primarily:

a. A per-page "history" cash value and time of update.

b. One special virtual page that every page links to.

c. One global score (cash).

And then a number of changes to how page scores are calculated.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"