Http Max Delays

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Http Max Delays

Christophe Noel
Hello,

When I'm fetching , I really have too many Http Timeout with default
nutch parameters.

Does anyone have tips to improve that point ?

Thanks very much.

Christophe No?l.
www.cetic.be

=====

org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
        at
org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133)
        at
org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)
org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
        at
org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133)
        at
org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)
Reply | Threaded
Open this post in threaded view
|

ranking algorithm

em-13
Is there a chance that the ranking algorithm in Analyze would give higher
value to a subpage than the root domain page?

For example:
http://abc.com  <- 34.432
http://abc.com/something.html <- 50


Is the above scenario possible, or does nutch always rank root pages
highest?

Regards,
EM

Reply | Threaded
Open this post in threaded view
|

Re: ranking algorithm

Fredrik Andersson-2-2
Nutch determines the pages' scores from the number of inbound links
and the authority value of those links. HITS-ish algorithm. If a
sub-level page has more inbound links and/or more important ones,
it'll outscore the front page, which usually has a high score. A nice
solution would be to modify the weighting step, where
internal/external links are weighted to a score, and add consideration
of the depth of the page as well... that should be relatively
painless. You could also hack the segment manually after you created
it (SegmentReader/SegmentWriter), giving fake inbound links to certain
pages or just modifying their score.

Anywho, maybe that gives you an idea or two. I have pretty poor
knowledge of the actual ranking algorithm, so perhaps someone will
come up with better suggestions...

Fredrik

On 7/28/05, EM <[hidden email]> wrote:

> Is there a chance that the ranking algorithm in Analyze would give higher
> value to a subpage than the root domain page?
>
> For example:
> http://abc.com  <- 34.432
> http://abc.com/something.html <- 50
>
>
> Is the above scenario possible, or does nutch always rank root pages
> highest?
>
> Regards,
> EM
>
>
Reply | Threaded
Open this post in threaded view
|

Re: ranking algorithm

Jay Pound
is there a whitepaper on the algorithm for nutch, or some in-depth info on
it anywhere?
Thanks,
-J


Reply | Threaded
Open this post in threaded view
|

Re: ranking algorithm

Fredrik Andersson-2-2
It's open source, there's your in-depth info! : )

Kleinbergs original report "Authoritative sources in a hyperlinked
environment" can be downloaded at
http://www.cs.cornell.edu/home/kleinber/ . It has been tuned up by
various people since it was released, but the principle of "hubs" and
"authorities" are still the same.

Fredrik

On 7/28/05, Jay Pound <[hidden email]> wrote:
> is there a whitepaper on the algorithm for nutch, or some in-depth info on
> it anywhere?
> Thanks,
> -J
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: ranking algorithm

Massimo Miccoli
Hi,

Some precisation. Without Analyze process the score  in calculate only
by count inlink for a page.

 

Fredrik Andersson ha scritto:

>It's open source, there's your in-depth info! : )
>
>Kleinbergs original report "Authoritative sources in a hyperlinked
>environment" can be downloaded at
>http://www.cs.cornell.edu/home/kleinber/ . It has been tuned up by
>various people since it was released, but the principle of "hubs" and
>"authorities" are still the same.
>
>Fredrik
>
>On 7/28/05, Jay Pound <[hidden email]> wrote:
>  
>
>>is there a whitepaper on the algorithm for nutch, or some in-depth info on
>>it anywhere?
>>Thanks,
>>-J
>>
>>
>>
>>    
>>
>
>
>-------------------------------------------------------
>SF.Net email is Sponsored by the Better Software Conference & EXPO September
>19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
>Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
>Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
>_______________________________________________
>Nutch-developers mailing list
>[hidden email]
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>  
>
Reply | Threaded
Open this post in threaded view
|

RE: ranking algorithm

Andrey Ilinykh
In reply to this post by em-13
A little bit offtopic. Nutch ranking algorithm uses score and nextScore. Who
can explain why we need nextScore?
Thank you,
  Andrey

-----Original Message-----
From: Fredrik Andersson [mailto:[hidden email]]
Sent: Thursday, July 28, 2005 7:42 AM
To: [hidden email]
Subject: Re: ranking algorithm


It's open source, there's your in-depth info! : )

Kleinbergs original report "Authoritative sources in a hyperlinked
environment" can be downloaded at
http://www.cs.cornell.edu/home/kleinber/ . It has been tuned up by
various people since it was released, but the principle of "hubs" and
"authorities" are still the same.

Fredrik

On 7/28/05, Jay Pound <[hidden email]> wrote:
> is there a whitepaper on the algorithm for nutch, or some in-depth info on
> it anywhere?
> Thanks,
> -J
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: ranking algorithm

Piotr Kosiorowski
nextScore is used to keep some temporary values during PageRank calculation.
Regards
Piotr

On 7/28/05, Andrey Ilinykh <[hidden email]> wrote:

> A little bit offtopic. Nutch ranking algorithm uses score and nextScore. Who
> can explain why we need nextScore?
> Thank you,
>   Andrey
>
> -----Original Message-----
> From: Fredrik Andersson [mailto:[hidden email]]
> Sent: Thursday, July 28, 2005 7:42 AM
> To: [hidden email]
> Subject: Re: ranking algorithm
>
>
> It's open source, there's your in-depth info! : )
>
> Kleinbergs original report "Authoritative sources in a hyperlinked
> environment" can be downloaded at
> http://www.cs.cornell.edu/home/kleinber/ . It has been tuned up by
> various people since it was released, but the principle of "hubs" and
> "authorities" are still the same.
>
> Fredrik
>
> On 7/28/05, Jay Pound <[hidden email]> wrote:
> > is there a whitepaper on the algorithm for nutch, or some in-depth info on
> > it anywhere?
> > Thanks,
> > -J
> >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Http Max Delays

Drew Farris
In reply to this post by Christophe Noel
By any chance are you crawling many pages stored on a single server or
small number of servers? If so, take a look at:

http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/msg04414.html
http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/msg04427.html

On 7/27/05, Christophe Noel <[hidden email]> wrote:

> Hello,
>
> When I'm fetching , I really have too many Http Timeout with default
> nutch parameters.
>
> Does anyone have tips to improve that point ?
>
> Thanks very much.
>
> Christophe Noël.
> www.cetic.be
>
> =====
>
> org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
>         at
> org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133)
>         at
> org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)
> org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
>         at
> org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133)
>         at
> org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201)
>         at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)
>
Reply | Threaded
Open this post in threaded view
|

Re: Http Max Delays

Michael Ji
I met that problem before, after I change the
http.timeout and max.delay values to 100 times the
default setting, the problem is gone,

you might look at nutch-default.xml and make a
overwritten in nutch-site.xml,

Michael,

--- Drew Farris <[hidden email]> wrote:

> By any chance are you crawling many pages stored on
> a single server or
> small number of servers? If so, take a look at:
>
>
http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/msg04414.html
>
http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/msg04427.html

>
> On 7/27/05, Christophe Noel
> <[hidden email]> wrote:
> > Hello,
> >
> > When I'm fetching , I really have too many Http
> Timeout with default
> > nutch parameters.
> >
> > Does anyone have tips to improve that point ?
> >
> > Thanks very much.
> >
> > Christophe No?l.
> > www.cetic.be
> >
> > =====
> >
> > org.apache.nutch.protocol.RetryLater: Exceeded
> http.max.delays: retry later.
> >         at
> >
>
org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133)
> >         at
> >
>
org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201)
> >         at
> >
>
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)
> > org.apache.nutch.protocol.RetryLater: Exceeded
> http.max.delays: retry later.
> >         at
> >
>
org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133)
> >         at
> >
>
org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201)
> >         at
> >
>
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)
> >
>



               
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs