[jira] Created: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

Michael Gibney (Jira)
OPIC score for outlinks should be based on # of valid links, not total # of links.
----------------------------------------------------------------------------------

         Key: NUTCH-230
         URL: http://issues.apache.org/jira/browse/NUTCH-230
     Project: Nutch
        Type: Improvement
    Versions: 0.8-dev    
    Reporter: Ken Krugler
    Priority: Minor


In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks:

          score /= links.length;

It then loops over the links, and any that pass the normalize/filter gauntlet get added to the crawl output.

But this means that any filtered links result in some amount of the page's OPIC score being "lost".

For Nutch 0.7, I built a list of valid (post-filter) links, and then used that to determine the per-link OPIC score, after which I iterated over the list, adding entries to the crawl output.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370356 ]

Andrzej Bialecki  commented on NUTCH-230:
-----------------------------------------

Hmmm, this is a deeply philosophical question... Should you spread out the OPIC score to all links that a page sports, or just to the links that you are interested in? Which option is closer to the real meaning of the OPIC score?

Let's consider this argument: the OPIC score is a "cash value", and it represents an intrinsic value of a page, or its usefulness. If a page contains useless links, it should lose some "cash" over those links, i.e. because of them the value of the page and its outlinks should be lowered. That's the effect we achieve in the current code.

On the other hand, if we were to change the calculation the way you propose, pages with a lot of bad links would heavily promote those few good links that they have. This seems to contradict the idea of OPIC, which is that "good" pages should promote all outlink-ed pages. If we follow your proposal, bad pages would promote more agressively than good pages...

> OPIC score for outlinks should be based on # of valid links, not total # of links.
> ----------------------------------------------------------------------------------
>
>          Key: NUTCH-230
>          URL: http://issues.apache.org/jira/browse/NUTCH-230
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Ken Krugler
>     Priority: Minor

>
> In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks:
>           score /= links.length;
> It then loops over the links, and any that pass the normalize/filter gauntlet get added to the crawl output.
> But this means that any filtered links result in some amount of the page's OPIC score being "lost".
> For Nutch 0.7, I built a list of valid (post-filter) links, and then used that to determine the per-link OPIC score, after which I iterated over the list, adding entries to the crawl output.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370381 ]

Doug Cutting commented on NUTCH-230:
------------------------------------

Andrzej, that's true if we think links that are filtered are bad links, but if we instead think of them as non-links then this fix is correct.

I don't have a strong intuition about which is best.  Perhaps we should make it configurable, and let folks experiment?

Ken, do you see a marked improvement in scores when you make this change?  Can you provide some examples of cases where it makes a difference?


> OPIC score for outlinks should be based on # of valid links, not total # of links.
> ----------------------------------------------------------------------------------
>
>          Key: NUTCH-230
>          URL: http://issues.apache.org/jira/browse/NUTCH-230
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Ken Krugler
>     Priority: Minor

>
> In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks:
>           score /= links.length;
> It then loops over the links, and any that pass the normalize/filter gauntlet get added to the crawl output.
> But this means that any filtered links result in some amount of the page's OPIC score being "lost".
> For Nutch 0.7, I built a list of valid (post-filter) links, and then used that to determine the per-link OPIC score, after which I iterated over the list, adding entries to the crawl output.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370424 ]

Ken Krugler commented on NUTCH-230:
-----------------------------------

So Doug beat me to this comment :)

I was going to describe the two cases we'd run into...

1. There's a great page, but most of the links are queries, and we currently skip them. So they aren't "bad" links, just links that we don't yet handle. And thus the value of the page gets diluted, because the few non-query links get very low OPIC scores "given" to the pages they reference.

2. There's a great blog post, but spam software added bogus links to adult sites. We blacklist them, but as with #1, the pages referenced by good links on the page suffer the consequences.

The way I think about the OPIC score is that the set of pages we've fetched so far has an energy level (sum of each page score), and OPIC redistributes this energy to better account for link info when determining page fetch order. So the current code effectively loses some of this energy via bad links.

Anyway, I was also going to propose a config setting if Andrzej or others felt strongly that pages should be penalized for filtered links. Otherwise always using the count of "approved" (maybe that's a better term than good/bad) links to divide up the page score makes sense to me.

> OPIC score for outlinks should be based on # of valid links, not total # of links.
> ----------------------------------------------------------------------------------
>
>          Key: NUTCH-230
>          URL: http://issues.apache.org/jira/browse/NUTCH-230
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Ken Krugler
>     Priority: Minor

>
> In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks:
>           score /= links.length;
> It then loops over the links, and any that pass the normalize/filter gauntlet get added to the crawl output.
> But this means that any filtered links result in some amount of the page's OPIC score being "lost".
> For Nutch 0.7, I built a list of valid (post-filter) links, and then used that to determine the per-link OPIC score, after which I iterated over the list, adding entries to the crawl output.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370426 ]

Andrzej Bialecki  commented on NUTCH-230:
-----------------------------------------

Yes, these are good examples - I'll prepare a patch to make this a boolean setting; if false (default) the calculation will be as it is now, if true all filtered out links won't count.

> OPIC score for outlinks should be based on # of valid links, not total # of links.
> ----------------------------------------------------------------------------------
>
>          Key: NUTCH-230
>          URL: http://issues.apache.org/jira/browse/NUTCH-230
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Ken Krugler
>     Priority: Minor

>
> In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks:
>           score /= links.length;
> It then loops over the links, and any that pass the normalize/filter gauntlet get added to the crawl output.
> But this means that any filtered links result in some amount of the page's OPIC score being "lost".
> For Nutch 0.7, I built a list of valid (post-filter) links, and then used that to determine the per-link OPIC score, after which I iterated over the list, adding entries to the crawl output.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-230?page=all ]

Andrzej Bialecki  updated NUTCH-230:
------------------------------------

    Attachment: patch.txt

Please review this patch, if it's ok I'll commit it.

> OPIC score for outlinks should be based on # of valid links, not total # of links.
> ----------------------------------------------------------------------------------
>
>          Key: NUTCH-230
>          URL: http://issues.apache.org/jira/browse/NUTCH-230
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Ken Krugler
>     Priority: Minor
>  Attachments: patch.txt
>
> In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks:
>           score /= links.length;
> It then loops over the links, and any that pass the normalize/filter gauntlet get added to the crawl output.
> But this means that any filtered links result in some amount of the page's OPIC score being "lost".
> For Nutch 0.7, I built a list of valid (post-filter) links, and then used that to determine the per-link OPIC score, after which I iterated over the list, adding entries to the crawl output.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-230?page=all ]
     
Andrzej Bialecki  closed NUTCH-230:
-----------------------------------

    Resolution: Fixed

Patch applied.

> OPIC score for outlinks should be based on # of valid links, not total # of links.
> ----------------------------------------------------------------------------------
>
>          Key: NUTCH-230
>          URL: http://issues.apache.org/jira/browse/NUTCH-230
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Ken Krugler
>     Priority: Minor
>  Attachments: patch.txt
>
> In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks:
>           score /= links.length;
> It then loops over the links, and any that pass the normalize/filter gauntlet get added to the crawl output.
> But this means that any filtered links result in some amount of the page's OPIC score being "lost".
> For Nutch 0.7, I built a list of valid (post-filter) links, and then used that to determine the per-link OPIC score, after which I iterated over the list, adding entries to the crawl output.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira