Missing pages & anchor text

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Missing pages & anchor text

Doug Cook
Hi, folks,

I have just started digging into relevance issues with Nutch, and I'm running into some mysteries. Before I dig too deep, I wanted to check to see if these were known issues (a quick search of the email archives and of JIRA didn't turn up anything). I'm running 0.8 with a handful of patches.

I'm frequently finding root pages of sites missing from my index, despite the fact that they have been fetched. In my admittedly short investigation I have found two classes of cases:

1. Root URL is not a redirect, but there is a root-level index.html page. The index.html page is in the index, but the root page is not. Unfortunately, most of the anchor text points to the root page, not the /index.html page, and the anchor text has gone "missing" along with its associated page, so relevance is poor.

2. Root URL is a redirect to another page. Again, this other page is in the index, the but the root page, along with its anchor text, has gone "missing."

I have a deduped index. Both of these cases could result from dedup throwing out the wrong URL, i.e. the one with more anchor text, although one might expect dedup to merge the two anchor texts (at least in the case of pages which commonly normalize to the same URL, e.g. / and /index.html).

The second case might result from the root URL somehow being normalized to its redirect target, but in that case (incorrect, in any case) I would expect the anchor text to also be attached to the redirect target, and it is not.

I'm about to rebuild with no deduping and see what I find.

Thanks for your help & comments-

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Missing pages & anchor text

Stefan Groschupf
Hi Doug,
I'm pretty sure that your problem is related to the deduping of your  
index.
In general the hash of the content of a page is used as key for the  
dedub tool.
We ran into the the forwarding problem also in a other case.
https://issues.apache.org/jira/browse/NUTCH-353
So may be we should think about a general solution of the forwarding  
problem.

Greetings,
Stefan


Am 28.08.2006 um 11:33 schrieb Doug Cook:

>
> Hi, folks,
>
> I have just started digging into relevance issues with Nutch, and I'm
> running into some mysteries. Before I dig too deep, I wanted to  
> check to see
> if these were known issues (a quick search of the email archives  
> and of JIRA
> didn't turn up anything). I'm running 0.8 with a handful of patches.
>
> I'm frequently finding root pages of sites missing from my index,  
> despite
> the fact that they have been fetched. In my admittedly short  
> investigation I
> have found two classes of cases:
>
> 1. Root URL is not a redirect, but there is a root-level index.html  
> page.
> The index.html page is in the index, but the root page is not.
> Unfortunately, most of the anchor text points to the root page, not  
> the
> /index.html page, and the anchor text has gone "missing" along with  
> its
> associated page, so relevance is poor.
>
> 2. Root URL is a redirect to another page. Again, this other page  
> is in the
> index, the but the root page, along with its anchor text, has gone
> "missing."
>
> I have a deduped index. Both of these cases could result from dedup  
> throwing
> out the wrong URL, i.e. the one with more anchor text, although one  
> might
> expect dedup to merge the two anchor texts (at least in the case of  
> pages
> which commonly normalize to the same URL, e.g. / and /index.html).
>
> The second case might result from the root URL somehow being  
> normalized to
> its redirect target, but in that case (incorrect, in any case) I would
> expect the anchor text to also be attached to the redirect target,  
> and it is
> not.
>
> I'm about to rebuild with no deduping and see what I find.
>
> Thanks for your help & comments-
>
> Doug
> --
> View this message in context: http://www.nabble.com/Missing-pages--- 
> anchor-text-tf2179049.html#a6025652
> Sent from the Nutch - Dev forum at Nabble.com.
>
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com



Reply | Threaded
Open this post in threaded view
|

Re: Missing pages & anchor text

Doug Cook
Hi Stefan,

Yes, you're right. The index built without deduping does not have the first instance of the problem (though of course, it's also filled with duplicates, so it has other problems). It still shows the problems with missing redirects, though this could be something else (will investigate that next).

A little digging has turned up more information:

1) Dedup throws away content matches, and decides which one to pick based upon score. This leads it to dump the wrong page, because:

http://www.x.com/
    score: 1.2
http://www.x.com/index.html
    score: 1.8

I see two problems.

First, there is clearly a scoring problem (possibly my fault somehow; could this have resulted from my failing to build the index properly?). The root page actually has 9 inlinks; the index.html page has none. I can't see anything that would warrant the index.html getting a higher score, even were these actually different pages. Seems like this could be related to the problems you've already discovered. One (perhaps just short term?) possibility would be to use the inbound linkcount for deciding which page becomes the "canonical" version of a duplicate set, since this is probably more stable than the scores.

Second, these are in fact the same page. Regardless of which page "wins" by score, dedup should actually merge the two entries since this is a safe normalization, given that we know the content fingerprints are the same. The anchor texts and the scores should be combined. We can't necessarily do this for the general dedup case -- a page shouldn't necessarily benefit just because there are multiple copies of it -- though even there we may be able to combine some anchor text. But in this case these are not multiple copies; they are the same page.

In any case, we should work hard not to lose anchor text unless it is completely justified (e.g. for spam). For relevance purposes, anchor text is more important than any other page feature, score included. And especially in our world of small, focused crawls, it is a precious, scarce resource.

Thoughts? Comments?

-Doug

Stefan Groschupf-2 wrote
Hi Doug,
I'm pretty sure that your problem is related to the deduping of your  
index.
In general the hash of the content of a page is used as key for the  
dedub tool.
We ran into the the forwarding problem also in a other case.
https://issues.apache.org/jira/browse/NUTCH-353
So may be we should think about a general solution of the forwarding  
problem.

Greetings,
Stefan


Am 28.08.2006 um 11:33 schrieb Doug Cook:

>
> Hi, folks,
>
> I have just started digging into relevance issues with Nutch, and I'm
> running into some mysteries. Before I dig too deep, I wanted to  
> check to see
> if these were known issues (a quick search of the email archives  
> and of JIRA
> didn't turn up anything). I'm running 0.8 with a handful of patches.
>
> I'm frequently finding root pages of sites missing from my index,  
> despite
> the fact that they have been fetched. In my admittedly short  
> investigation I
> have found two classes of cases:
>
> 1. Root URL is not a redirect, but there is a root-level index.html  
> page.
> The index.html page is in the index, but the root page is not.
> Unfortunately, most of the anchor text points to the root page, not  
> the
> /index.html page, and the anchor text has gone "missing" along with  
> its
> associated page, so relevance is poor.
>
> 2. Root URL is a redirect to another page. Again, this other page  
> is in the
> index, the but the root page, along with its anchor text, has gone
> "missing."
>
> I have a deduped index. Both of these cases could result from dedup  
> throwing
> out the wrong URL, i.e. the one with more anchor text, although one  
> might
> expect dedup to merge the two anchor texts (at least in the case of  
> pages
> which commonly normalize to the same URL, e.g. / and /index.html).
>
> The second case might result from the root URL somehow being  
> normalized to
> its redirect target, but in that case (incorrect, in any case) I would
> expect the anchor text to also be attached to the redirect target,  
> and it is
> not.
>
> I'm about to rebuild with no deduping and see what I find.
>
> Thanks for your help & comments-
>
> Doug
> --
> View this message in context: http://www.nabble.com/Missing-pages--- 
> anchor-text-tf2179049.html#a6025652
> Sent from the Nutch - Dev forum at Nabble.com.
>
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com


Reply | Threaded
Open this post in threaded view
|

Re: Missing pages & anchor text

Doug Cook
I'm thinking I should file issues on the following-

1. The scoring bug. Not sure what to file here, since such things are hard to pin down. But defining an "inversion" as
        score(hostname/(index|default|home).(html|jsp|asp|cfm|etc)) > score(hostname)
on a ~2.5Mdoc database, where I have about 8100 such pairs, 6558 were inversions and only 1585 were "okay." Is this likely to a correct behavior for OPIC scores? Is this a likely manifestation of a known bug? It doesn't seem correct, but then, it's early and I still need more coffee ;-) In any case, this causes the "wrong" versions of the pages to be selected most of the time during dedup, and I've lost >6500 of the most important, most anchor-text-rich pages, in my index -- a significant relevance issue.

2. When "duplicates" really refer to the same page (e.g. X/ vs. X/index.html) , entries should be merged. Really, these are just after-the-fact normalizations, but they are a class of normalizations which can't be done without comparing page fingerprints, since they are not true for all web servers.

3. Redirects. The index keeps the redirect target, but marks the source as unfetched. This is unfortunate behavior, at least for the class of redirects where www.x.com redirects to www.x.com/y, which, like the above combination of issues, causes the root pages, and thus much of the important anchor text, to be dropped from the index. This seems related to, if not the same as, NUTCH-273 (https://issues.apache.org/jira/browse/NUTCH-273). I was simply planning to add these comments to that issue, unless someone hollers.

Any comments or thoughts before I file the above issues?

For all of the cases where we ignore/drop pages, we should think about what happens to the inbound anchor text. We should work very very hard to keep all the anchor text we have, it's by far the most important page feature for relevance.

-doug

Doug Cook wrote
Hi Stefan,

Yes, you're right. The index built without deduping does not have the first instance of the problem (though of course, it's also filled with duplicates, so it has other problems). It still shows the problems with missing redirects, though this could be something else (will investigate that next).

A little digging has turned up more information:

1) Dedup throws away content matches, and decides which one to pick based upon score. This leads it to dump the wrong page, because:

http://www.x.com/
    score: 1.2
http://www.x.com/index.html
    score: 1.8

I see two problems.

First, there is clearly a scoring problem (possibly my fault somehow; could this have resulted from my failing to build the index properly?). The root page actually has 9 inlinks; the index.html page has none. I can't see anything that would warrant the index.html getting a higher score, even were these actually different pages. Seems like this could be related to the problems you've already discovered. One (perhaps just short term?) possibility would be to use the inbound linkcount for deciding which page becomes the "canonical" version of a duplicate set, since this is probably more stable than the scores.

Second, these are in fact the same page. Regardless of which page "wins" by score, dedup should actually merge the two entries since this is a safe normalization, given that we know the content fingerprints are the same. The anchor texts and the scores should be combined. We can't necessarily do this for the general dedup case -- a page shouldn't necessarily benefit just because there are multiple copies of it -- though even there we may be able to combine some anchor text. But in this case these are not multiple copies; they are the same page.

In any case, we should work hard not to lose anchor text unless it is completely justified (e.g. for spam). For relevance purposes, anchor text is more important than any other page feature, score included. And especially in our world of small, focused crawls, it is a precious, scarce resource.

Thoughts? Comments?

-Doug

Stefan Groschupf-2 wrote
Hi Doug,
I'm pretty sure that your problem is related to the deduping of your  
index.
In general the hash of the content of a page is used as key for the  
dedub tool.
We ran into the the forwarding problem also in a other case.
https://issues.apache.org/jira/browse/NUTCH-353
So may be we should think about a general solution of the forwarding  
problem.

Greetings,
Stefan


Am 28.08.2006 um 11:33 schrieb Doug Cook:

>
> Hi, folks,
>
> I have just started digging into relevance issues with Nutch, and I'm
> running into some mysteries. Before I dig too deep, I wanted to  
> check to see
> if these were known issues (a quick search of the email archives  
> and of JIRA
> didn't turn up anything). I'm running 0.8 with a handful of patches.
>
> I'm frequently finding root pages of sites missing from my index,  
> despite
> the fact that they have been fetched. In my admittedly short  
> investigation I
> have found two classes of cases:
>
> 1. Root URL is not a redirect, but there is a root-level index.html  
> page.
> The index.html page is in the index, but the root page is not.
> Unfortunately, most of the anchor text points to the root page, not  
> the
> /index.html page, and the anchor text has gone "missing" along with  
> its
> associated page, so relevance is poor.
>
> 2. Root URL is a redirect to another page. Again, this other page  
> is in the
> index, the but the root page, along with its anchor text, has gone
> "missing."
>
> I have a deduped index. Both of these cases could result from dedup  
> throwing
> out the wrong URL, i.e. the one with more anchor text, although one  
> might
> expect dedup to merge the two anchor texts (at least in the case of  
> pages
> which commonly normalize to the same URL, e.g. / and /index.html).
>
> The second case might result from the root URL somehow being  
> normalized to
> its redirect target, but in that case (incorrect, in any case) I would
> expect the anchor text to also be attached to the redirect target,  
> and it is
> not.
>
> I'm about to rebuild with no deduping and see what I find.
>
> Thanks for your help & comments-
>
> Doug
> --
> View this message in context: http://www.nabble.com/Missing-pages--- 
> anchor-text-tf2179049.html#a6025652
> Sent from the Nutch - Dev forum at Nabble.com.
>
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com


Reply | Threaded
Open this post in threaded view
|

Re: Missing pages & anchor text

Andrzej Białecki-2
Doug Cook wrote:

> I'm thinking I should file issues on the following-
>
> 1. The scoring bug. Not sure what to file here, since such things are hard
> to pin down. But defining an "inversion" as
>         score(hostname/(index|default|home).(html|jsp|asp|cfm|etc)) >
> score(hostname)
> on a ~2.5Mdoc database, where I have about 8100 such pairs, 6558 were
> inversions and only 1585 were "okay." Is this likely to a correct behavior
> for OPIC scores? Is this a likely manifestation of a known bug? It doesn't
> seem correct, but then, it's early and I still need more coffee ;-) In any
> case, this causes the "wrong" versions of the pages to be selected most of
> the time during dedup, and I've lost >6500 of the most important, most
> anchor-text-rich pages, in my index -- a significant relevance issue.
>  

The default scoring-opic is admittedly buggy (even if the original
algorithm is suitable for page scoring, which is not obvious at all).
However, the inversion problem that you see may stem from the way these
sites are interlinked - perhaps there really is a lot of inlinks
pointing to sub-pages instead of roots of the sites?

Anyway, if you feel that shorter urls should get a higher score, then
you can add a scoring filter to the chain, and in it boost the score
based on the url length.

> 2. When "duplicates" really refer to the same page (e.g. X/ vs.
> X/index.html) , entries should be merged. Really, these are just
> after-the-fact normalizations, but they are a class of normalizations which
> can't be done without comparing page fingerprints, since they are not true
> for all web servers.
>  

This should already happen when you run DeleteDuplicates (dedup). Dedup
selects pages with the same fingerprint, and then retains only newest
version if urls are the same, OR a version with shorter url if urls are
different.


> 3. Redirects. The index keeps the redirect target, but marks the source as
> unfetched. This is unfortunate behavior, at least for the class of redirects
> where www.x.com redirects to www.x.com/y, which, like the above combination
> of issues, causes the root pages, and thus much of the important anchor
> text, to be dropped from the index. This seems related to, if not the same
> as, NUTCH-273 (https://issues.apache.org/jira/browse/NUTCH-273). I was
> simply planning to add these comments to that issue, unless someone hollers.
>  

Yes, as I indicated in that issue, pages we are redirected from should
be marked as GONE, and definitely should be marked as fetched. Please
add you comments if any aspect of what you just said is still missing
from that issue.

> For all of the cases where we ignore/drop pages, we should think about what
> happens to the inbound anchor text. We should work very very hard to keep
> all the anchor text we have, it's by far the most important page feature for
> relevance.
>  

Agreed. This may not be so easy in some cases, due to the way Nutch
works at the moment, but we should then discuss how to refactor it to
support this.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Missing pages & anchor text

Doug Cook
Hi, Andrzej.

Thanks for the quick response!

> Andrzej Bialecki wrote:
> Doug Cook wrote:
> > I'm thinking I should file issues on the following-
> >
> > 1. The scoring bug. Not sure what to file here, since such things are hard
> > to pin down. But defining an "inversion" as
> >         score(hostname/(index|default|home).(html|jsp|asp|cfm|etc)) >
> > score(hostname)
> > on a ~2.5Mdoc database, where I have about 8100 such pairs, 6558 were
> > inversions and only 1585 were "okay." Is this likely to a correct behavior
> > for OPIC scores? Is this a likely manifestation of a known bug? It doesn't
> > seem correct, but then, it's early and I still need more coffee ;-) In any
> > case, this causes the "wrong" versions of the pages to be selected most of
> > the time during dedup, and I've lost >6500 of the most important, most
> > anchor-text-rich pages, in my index -- a significant relevance issue.
> >  
>
> The default scoring-opic is admittedly buggy (even if the original
> algorithm is suitable for page scoring, which is not obvious at all).
> However, the inversion problem that you see may stem from the way these
> sites are interlinked - perhaps there really is a lot of inlinks
> pointing to sub-pages instead of roots of the sites?

I thought of that, but at least at a cursory examination, the root pages I looked at had more inbound anchor text, which leads me to believe that they have more (at least external) links. I'll investigate further and let you know what I find.

> Anyway, if you feel that shorter urls should get a higher score, then
> you can add a scoring filter to the chain, and in it boost the score
> based on the url length.

I'm not sure that "shorter URLs" is necessarily the right way to do it. Within a host, that probably works fairly well. But imagine a host X and its mirror X' - one of these two will generally be the "canonical" form of the hostname, and it may or may not be the one with the shorter name. The more linked-to version is probably the right one. Though perhaps the way to solve that is to think about it as a normalization problem, and build a fast "mirror table" into the normalizer. And as we dedup, if we see a lot of duplicate sets of the form:
    X/blahblah
    X'/blahblah
then we identify (X,X') as "mirror candidates" and put them in the mirror table with a hypothesis for which is the canonical version. Then we never have to dedup them again, and all of the anchor text issues are solved as well.

> > 2. When "duplicates" really refer to the same page (e.g. X/ vs.
> > X/index.html) , entries should be merged. Really, these are just
> > after-the-fact normalizations, but they are a class of normalizations which
> > can't be done without comparing page fingerprints, since they are not true
> > for all web servers.
> >  
>
> This should already happen when you run DeleteDuplicates (dedup). Dedup
> selects pages with the same fingerprint, and then retains only newest
> version if urls are the same, OR a version with shorter url if urls are
> different.

I'm not sure I follow you.  I thought that dedup used the score -- in which case the www.x.com/index.html will win out over www.x.com/ when one of aforementioned "score inversions" takes place. And I also thought that the "losing" URL was simply dropped, thus effectively losing its anchor text. What I meant when I said "merged" above was that the anchor text from the "losing" version of the URL is effectively merged into that of the "winning" URL when the two are found to be not just copies of the same document, but actually the same document, so that no anchor text is lost. To give a concrete example:

pre-dedup:
http://www.x.com/
    12 inbound links & anchor text
http://www.x.com/index.html
    3 inbound links & anchor text

post-dedup (ideally):
http://www.x.com/
    15 inbound links & anchor text

post-dedup (currently):
http://www.x.com/index.html
    3 inbound links & anchor text

Something more or less identical to this should also happen in the (fairly common) case where the root page is a redirect to an "internal home page." (as, for example, on http://www.diageo.com/; see below in the redirect discussion).  We may also want to do something like this for the case of site mirrors -- handling these as normalizations would automatically do this.

Please pardon me if I'm misunderstanding -- I'm just going from the behavior I see and the documentation & code comments; I haven't yet done a detailed read-through of the code!

> > 3. Redirects. The index keeps the redirect target, but marks the source as
> > unfetched. This is unfortunate behavior, at least for the class of redirects
> > where www.x.com redirects to www.x.com/y, which, like the above combination
> > of issues, causes the root pages, and thus much of the important anchor
> > text, to be dropped from the index. This seems related to, if not the same
> > as, NUTCH-273 (https://issues.apache.org/jira/browse/NUTCH-273). I was
> > simply planning to add these comments to that issue, unless someone hollers.
> >  
>
> Yes, as I indicated in that issue, pages we are redirected from should
> be marked as GONE, and definitely should be marked as fetched. Please
> add you comments if any aspect of what you just said is still missing
> from that issue.

A redirect origin should not necessarily be considered GONE. In many cases, the redirect origin is the "canonical version" of the page, and the target is the "transitory version," as with most internal root-page redirects (see the Diageo example above). We should keep those versions of the page. If a user searches for Diageo, they expect to get www.diageo.com, not some long complicated subpage URL.

Again, just to clarify what is happening here, I'm seeing something like:

In my crawl/link databases:
    http://www.diageo.com/
        30 anchor text strings
        Marked as UNFETCHED because it is a redirect
    http://www.diageo.com/en-row/homepage.htm
        3 anchor text strings
        Marked as FETCHED

In an IDEAL index:
    http://www.diageo.com/
    (and maybe http://www.diageo.com/en-row/homepage.htm indexed as an alias of this)
        33 inbound anchor text strings

In my CURRENT index:
    http://www.diageo.com/en-row/homepage.htm
        3 anchor text strings:

This is obviously not optimal for relevance!

I like Doug C's -- oh shoot, too many Doug Cs around here! ;-) -- Doug Cutting's idea (on NUTCH-273) that we want to remember all of the redirects to a given page at index time. We should also remember all of the metadata/anchor text for those pages, and then we should make an intelligent decision at index time about which anchor text to include, and even what URL(s) to call this within the index. Thus we could arrive at the "ideal" index above.

I'm struggling to get the most out of the meager anchor text in my relatively small index. Handling dups, mirrors, and redirects in a way that allows us to use all of the anchor text will be a significant relevance boost.  Thanks for listening to my rant -- and apologies again for any misunderstandings I may have, I'm getting up the (steep) Nutch learning curve.

Doug