Fetcher2's delay between successive requests

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Fetcher2's delay between successive requests

Doğacan Güney-3
Hi all,

I have been working on Fetcher2 code lately and I came across this
particular code (in FetchItemQueue.getFetchItem) that I didn't quite
understand:

public FetchItem getFetchItem() {
  ...
  long last = endTime.get() + (maxThreads > 1 ? crawlDelay : minCrawlDelay);
  ...
}

Now, the 'default' politeness behaviour should be 1 thread per host
and delaying n seconds between successive requests to that host,
right? But, won't this code wait only minCrawlDelay(which, by default,
is 0) if maxThreads == 1.

I also did not understand why there is a maxThread check at all. Each
individual thread should wait crawl delay before making another
request to the same host. Am I missing something here?

--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Fetcher2's delay between successive requests

Doğacan Güney-3
I have discovered another bug in Fetcher2. Plugin lib-http checks
Protocol.CHECK_{BLOCKING,ROBOTS}(which resolve to strings
protocol.plugin.check.{blocking,robots})  to see if it should handle
blocking or not.

But fetcher2 sets http.plugin.check.{blocking,robots} (notice the
protocol/http difference) to false to indicate lib-http shouldn't
handle blocking internally. Because of this, when you use Fetcher2,
lib-http still tries to block them which makes Fetcher2 much less
useful.

I am not sending a patch for this yet because I first want to get some
feedback on the first bug.

--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Fetcher2's delay between successive requests

Andrzej Białecki-2
In reply to this post by Doğacan Güney-3
Doğacan Güney wrote:

> Hi all,
>
> I have been working on Fetcher2 code lately and I came across this
> particular code (in FetchItemQueue.getFetchItem) that I didn't quite
> understand:
>
> public FetchItem getFetchItem() {
>  ...
>  long last = endTime.get() + (maxThreads > 1 ? crawlDelay : minCrawlDelay);
>  ...
> }
>
> Now, the 'default' politeness behaviour should be 1 thread per host
> and delaying n seconds between successive requests to that host,
> right? But, won't this code wait only minCrawlDelay(which, by default,
> is 0) if maxThreads == 1.

Yes, that was the intended behavior - normally, you should never use
more than 1 thread per host, unless you have an explicit permission to
do so.

If multiple threads make requests to the same host, then the crawl delay
parameter loses its usual meaning - see the details of this in comments
to NUTCH-385. However, the sensible way to do is to still provide a way
to limit the maximum rate of requests, and this is what the
minCrawlDelay parameter is for.


>
> I also did not understand why there is a maxThread check at all. Each
> individual thread should wait crawl delay before making another
> request to the same host. Am I missing something here?


See the ASCII-art graphs and comments in NUTCH-385 - this is likely not
what is expected.

Although this JIRA issue is still open, the Fetcher2 code tries to
implement this middle ground solution.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Fetcher2's delay between successive requests

Andrzej Białecki-2
In reply to this post by Doğacan Güney-3
Doğacan Güney wrote:

> I have discovered another bug in Fetcher2. Plugin lib-http checks
> Protocol.CHECK_{BLOCKING,ROBOTS}(which resolve to strings
> protocol.plugin.check.{blocking,robots})  to see if it should handle
> blocking or not.
>
> But fetcher2 sets http.plugin.check.{blocking,robots} (notice the
> protocol/http difference) to false to indicate lib-http shouldn't
> handle blocking internally. Because of this, when you use Fetcher2,
> lib-http still tries to block them which makes Fetcher2 much less
> useful.
>

This is definitely a bug.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Fetcher2's delay between successive requests

Doğacan Güney-3
In reply to this post by Andrzej Białecki-2
On 4/24/07, Andrzej Bialecki <[hidden email]> wrote:

> Doğacan Güney wrote:
> > Hi all,
> >
> > I have been working on Fetcher2 code lately and I came across this
> > particular code (in FetchItemQueue.getFetchItem) that I didn't quite
> > understand:
> >
> > public FetchItem getFetchItem() {
> >  ...
> >  long last = endTime.get() + (maxThreads > 1 ? crawlDelay : minCrawlDelay);
> >  ...
> > }
> >
> > Now, the 'default' politeness behaviour should be 1 thread per host
> > and delaying n seconds between successive requests to that host,
> > right? But, won't this code wait only minCrawlDelay(which, by default,
> > is 0) if maxThreads == 1.
>
> Yes, that was the intended behavior - normally, you should never use
> more than 1 thread per host, unless you have an explicit permission to
> do so.
>
> If multiple threads make requests to the same host, then the crawl delay
> parameter loses its usual meaning - see the details of this in comments
> to NUTCH-385. However, the sensible way to do is to still provide a way
> to limit the maximum rate of requests, and this is what the
> minCrawlDelay parameter is for.

I don't get it. The code seems to do exactly the opposite of what you
are saying. If maxThreads == 1 then maxThreads > 1 is false thus the
expression evaluates to minCrawlDelay not crawlDelay. Shouldn't the
expression be (maxThreads > 1 ? minCrawlDelay : crawlDelay) ?

>
>
> >
> > I also did not understand why there is a maxThread check at all. Each
> > individual thread should wait crawl delay before making another
> > request to the same host. Am I missing something here?
>
>
> See the ASCII-art graphs and comments in NUTCH-385 - this is likely not
> what is expected.
>
> Although this JIRA issue is still open, the Fetcher2 code tries to
> implement this middle ground solution.

OK. I guess this approach is good enough.

>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Fetcher2's delay between successive requests

Andrzej Białecki-2
Doğacan Güney wrote:

> I don't get it. The code seems to do exactly the opposite of what you
> are saying. If maxThreads == 1 then maxThreads > 1 is false thus the
> expression evaluates to minCrawlDelay not crawlDelay. Shouldn't the
> expression be (maxThreads > 1 ? minCrawlDelay : crawlDelay) ?

Yep, you're right - it's a bug. However, the reasoning that I presented
still holds, it's just the implementation that doesn't get it ;)


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Fetcher2's delay between successive requests

Doğacan Güney-3
On 4/24/07, Andrzej Bialecki <[hidden email]> wrote:

> Doğacan Güney wrote:
>
> > I don't get it. The code seems to do exactly the opposite of what you
> > are saying. If maxThreads == 1 then maxThreads > 1 is false thus the
> > expression evaluates to minCrawlDelay not crawlDelay. Shouldn't the
> > expression be (maxThreads > 1 ? minCrawlDelay : crawlDelay) ?
>
> Yep, you're right - it's a bug. However, the reasoning that I presented
> still holds, it's just the implementation that doesn't get it ;)
>

Heh, OK:). I opened an issue for these bugs (NUTCH-474)  and attached a patch.

>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


--
Doğacan Güney