redirect treatment

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

redirect treatment

waterwheel
How are redirects listed in version 0.7?  If the crawler finds a link like:
www.domain.com/?code.aspx&redirect=445454
and that link redirects through to www.another-domain.com, which of
those two links will show up in nutch?

(I'm wondering if I can use nutch to crawl sites with a lot of
redirects, and still end up with the correct redirected domain in the
listings).

Reply | Threaded
Open this post in threaded view
|

Re: redirect treatment

Dennis Kubes
Protocol level redirects (asp redirects), meaning the server sends a
redirect response 3xx code, work correctly in Nutch 0.8 dev.  It
processes it as a completely new page.  If you are doing asp forwards I
believe that the original page
(www.domain.com/?code.aspx&redirect=445454) would be the URL that shows
up in the search because Nutch doesn't know what is going on behind the
scenes in the ASP code.  It knows url and content recieved.

As of right now in 0.8 dev meta level redirects (meta refesh tags) don't
work correctly.  They did in 0.7 but I don't think that functionality
has been ported.

Dennis

Insurance Squared Inc. wrote:

> How are redirects listed in version 0.7?  If the crawler finds a link
> like:
> www.domain.com/?code.aspx&redirect=445454
> and that link redirects through to www.another-domain.com, which of
> those two links will show up in nutch?
>
> (I'm wondering if I can use nutch to crawl sites with a lot of
> redirects, and still end up with the correct redirected domain in the
> listings).
>
Reply | Threaded
Open this post in threaded view
|

Re: redirect treatment

waterwheel
Perhaps a point of clarification - I'm assuming that the
www.domain.com/?code.asp&redirect=444 actually sends a redirect header
to the new page.  In that case (I don't know enough about protocols
personally to be sure) it seems that nutch would have to recognize that
it's being redirected and refetch at the new location.  Am I correct?  
And if so, wouldn't nutch then index and display the new, redirected page?

I'm using version .7 btw.

thanks,
Glenn


Dennis Kubes wrote:

> Protocol level redirects (asp redirects), meaning the server sends a
> redirect response 3xx code, work correctly in Nutch 0.8 dev.  It
> processes it as a completely new page.  If you are doing asp forwards
> I believe that the original page
> (www.domain.com/?code.aspx&redirect=445454) would be the URL that
> shows up in the search because Nutch doesn't know what is going on
> behind the scenes in the ASP code.  It knows url and content recieved.
> As of right now in 0.8 dev meta level redirects (meta refesh tags)
> don't work correctly.  They did in 0.7 but I don't think that
> functionality has been ported.
>
> Dennis
>
> Insurance Squared Inc. wrote:
>
>> How are redirects listed in version 0.7?  If the crawler finds a link
>> like:
>> www.domain.com/?code.aspx&redirect=445454
>> and that link redirects through to www.another-domain.com, which of
>> those two links will show up in nutch?
>>
>> (I'm wondering if I can use nutch to crawl sites with a lot of
>> redirects, and still end up with the correct redirected domain in the
>> listings).
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: redirect treatment

Dennis Kubes
There are three kinds of "redirects".  One is where the server behind
the scenes forwards to a different page and returns the output.  This is
usually called a forward.  Two is where the server send a redirect code
(usually in the 300 range).  The browser then requests the page it was
redirected to.  This is usually called a protocol redirect or just a
redirect in JSP and ASP terms.  Three is where the page has a
meta-refresh tag in the header.  This is known as a content redirect or
a meta redirect.  Here the client doesn't get a redirect code from the
header but after a certain amount of time will request the page in the
url section of the meta-refresh tag.

If (www.domain.com/?code.asp&redirect=444) sends a forward then nutch
doesn't know anything about it and will just index the content returned
under the original url.  If it sends a protocol redirect, then nutch
goes and requests the new page and will index the new page under the new
url.  Nutch will follow redirects up to http.redirect.max times.  So if
the redirect page redirects again Nutch will follow that one as well up
to the max times.  If the url variable "redirect" is used to populate a
meta-refresh tag then as of right now Nutch won't follow the redirect.  
I think it fails with a NullPointer right now.

The meta-refresh was working in 7.2 but is broken in 0.8.  Andrzej
Bialecki said he was looking into fixing it.  Hope this helps you
understand what is happening with the fetch.

Dennis

Insurance Squared Inc. wrote:

> Perhaps a point of clarification - I'm assuming that the
> www.domain.com/?code.asp&redirect=444 actually sends a redirect header
> to the new page.  In that case (I don't know enough about protocols
> personally to be sure) it seems that nutch would have to recognize
> that it's being redirected and refetch at the new location.  Am I
> correct?  And if so, wouldn't nutch then index and display the new,
> redirected page?
> I'm using version .7 btw.
>
> thanks,
> Glenn
>
>
> Dennis Kubes wrote:
>
>> Protocol level redirects (asp redirects), meaning the server sends a
>> redirect response 3xx code, work correctly in Nutch 0.8 dev.  It
>> processes it as a completely new page.  If you are doing asp forwards
>> I believe that the original page
>> (www.domain.com/?code.aspx&redirect=445454) would be the URL that
>> shows up in the search because Nutch doesn't know what is going on
>> behind the scenes in the ASP code.  It knows url and content recieved.
>> As of right now in 0.8 dev meta level redirects (meta refesh tags)
>> don't work correctly.  They did in 0.7 but I don't think that
>> functionality has been ported.
>>
>> Dennis
>>
>> Insurance Squared Inc. wrote:
>>
>>> How are redirects listed in version 0.7?  If the crawler finds a
>>> link like:
>>> www.domain.com/?code.aspx&redirect=445454
>>> and that link redirects through to www.another-domain.com, which of
>>> those two links will show up in nutch?
>>>
>>> (I'm wondering if I can use nutch to crawl sites with a lot of
>>> redirects, and still end up with the correct redirected domain in
>>> the listings).
>>>
>>
Reply | Threaded
Open this post in threaded view
|

RE: redirect treatment

Dalton, Jeffery
In reply to this post by waterwheel
Let's see if I understand this.  First, let's focus on the protocol
redirect 3xx: Nutch goes and requests the new page under the new URL.
If this is the case, I believe there are times when it may be less
desirable and that more complicated treatment may be necessary.   Please
see this recent SE Watch article on the topic:
http://blog.searchenginewatch.com/blog/050801-130330

It references Yahoo's redirect policy for Slurp:
http://help.yahoo.com/help/us/ysearch/slurp/slurp-11.html

Yahoo considers a meta-refresh (referred to as a "forward") to be
different depending on the delay time: 302 (short delay) or a 301 (long
delay).

Nutch's policy differs from Yahoo in that Nutch treats all 3xxs and
meta-refreshes identically, always indexing the content under the target
URL.  Yahoo is more selective.  There are several cases when Yahoo
indexes the target content under the source URL (like a pointer):
on-site 302s and short, on-site meta-refreshes.  We might want to
consider modifying Nutch to follow these conventions being set by Yahoo
(and to some extent, Google).  The rationale behind all of this is that
most people prefer and link to shorter (homepage) URLs than to longer
URLs (try the IEEE homepage).

Thoughts?

- Jeff

> -----Original Message-----
> From: Dennis Kubes [mailto:[hidden email]]
> Sent: Saturday, April 15, 2006 3:58 PM
> To: [hidden email]
> Subject: Re: redirect treatment
>
> There are three kinds of "redirects".  One is where the
> server behind the scenes forwards to a different page and
> returns the output.  This is usually called a forward.  Two
> is where the server send a redirect code (usually in the 300
> range).  The browser then requests the page it was redirected
> to.  This is usually called a protocol redirect or just a
> redirect in JSP and ASP terms.  Three is where the page has a
> meta-refresh tag in the header.  This is known as a content
> redirect or a meta redirect.  Here the client doesn't get a
> redirect code from the header but after a certain amount of
> time will request the page in the url section of the meta-refresh tag.
>
> If (www.domain.com/?code.asp&redirect=444) sends a forward
> then nutch doesn't know anything about it and will just index
> the content returned under the original url.  If it sends a
> protocol redirect, then nutch goes and requests the new page
> and will index the new page under the new url.  Nutch will
> follow redirects up to http.redirect.max times.  So if the
> redirect page redirects again Nutch will follow that one as
> well up to the max times.  If the url variable "redirect" is
> used to populate a meta-refresh tag then as of right now
> Nutch won't follow the redirect.  
> I think it fails with a NullPointer right now.
>
> The meta-refresh was working in 7.2 but is broken in 0.8.  
> Andrzej Bialecki said he was looking into fixing it.  Hope
> this helps you understand what is happening with the fetch.
>
> Dennis
>
> Insurance Squared Inc. wrote:
> > Perhaps a point of clarification - I'm assuming that the
> > www.domain.com/?code.asp&redirect=444 actually sends a
> redirect header
> > to the new page.  In that case (I don't know enough about protocols
> > personally to be sure) it seems that nutch would have to recognize
> > that it's being redirected and refetch at the new location.  Am I
> > correct?  And if so, wouldn't nutch then index and display the new,
> > redirected page?
> > I'm using version .7 btw.
> >
> > thanks,
> > Glenn
> >
> >
> > Dennis Kubes wrote:
> >
> >> Protocol level redirects (asp redirects), meaning the
> server sends a
> >> redirect response 3xx code, work correctly in Nutch 0.8 dev.  It
> >> processes it as a completely new page.  If you are doing
> asp forwards
> >> I believe that the original page
> >> (www.domain.com/?code.aspx&redirect=445454) would be the URL that
> >> shows up in the search because Nutch doesn't know what is going on
> >> behind the scenes in the ASP code.  It knows url and
> content recieved.
> >> As of right now in 0.8 dev meta level redirects (meta refesh tags)
> >> don't work correctly.  They did in 0.7 but I don't think that
> >> functionality has been ported.
> >>
> >> Dennis
> >>
> >> Insurance Squared Inc. wrote:
> >>
> >>> How are redirects listed in version 0.7?  If the crawler finds a
> >>> link like:
> >>> www.domain.com/?code.aspx&redirect=445454
> >>> and that link redirects through to
> www.another-domain.com, which of
> >>> those two links will show up in nutch?
> >>>
> >>> (I'm wondering if I can use nutch to crawl sites with a lot of
> >>> redirects, and still end up with the correct redirected domain in
> >>> the listings).
> >>>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: redirect treatment

Andrzej Białecki-2
In reply to this post by Dennis Kubes
Dennis Kubes wrote:
> The meta-refresh was working in 7.2 but is broken in 0.8.  Andrzej
> Bialecki said he was looking into fixing it.  Hope this helps you
> understand what is happening with the fetch.

It should work again, please test revisions later than r393297. (Note:
I'm away till 24th, so I may not respond before that).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com