Two possible extensions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Two possible extensions

Guenter, Matthias
Hi
Would it be of interest for the project to have an extension of crawl that allows:
- shaping the bandwidth used (inbound)
- keeping the number of request per second in a certain limit
- is able to schedule that with a difference between working hours and night

And an extension that crawls only file: /http: requests which have changed after a given date.
Sort of  sh ./nutch crawl -changedafter="2006-01-04"?

The code could be delivered end of April as part of a student project.

Kind regards

Matthias Günter

Reply | Threaded
Open this post in threaded view
|

Re: Two possible extensions

Andrzej Białecki-2
Guenter, Matthias wrote:
> Hi
> Would it be of interest for the project to have an extension of crawl that allows:
> - shaping the bandwidth used (inbound)
> - keeping the number of request per second in a certain limit
> - is able to schedule that with a difference between working hours and night
>  

Assuming we are talking about the SVN trunk/ (other branches are in the
maintenance mode only, no new features). With the current trunk/ being
based on map-reduce, I think this would require something like a central
"lock manager" - this would come very handy for other plugins, too. E.g.
the protocol plugins currently don't split the fetchlists (i.e. fetching
is performed by a single task) because they have no way to coordinate
the access to target hosts among distributed fetching tasks.

> And an extension that crawls only file: /http: requests which have changed after a given date.
>  

Please see the code in NUTCH-61 .

> Sort of  sh ./nutch crawl -changedafter="2006-01-04"?
>
> The code could be delivered end of April as part of a student project.
>  

Certainly it sounds interesting. However, I think it's essential for the
acceptance by the community and general usefulness that this should be
coordinated with the existing efforts, and discussed on the mailing lists.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Two possible extensions

Stefan Groschupf-2
In reply to this post by Guenter, Matthias
Hi.
Check the mail archive, some of theses things was already discussed  
and I guess people already have some code / plans but it is not yet  
part of the sources.
In any cases such contributions are very welcome from my point of view.

Stefan


Am 24.01.2006 um 11:08 schrieb Guenter, Matthias:

> Hi
> Would it be of interest for the project to have an extension of  
> crawl that allows:
> - shaping the bandwidth used (inbound)
> - keeping the number of request per second in a certain limit
> - is able to schedule that with a difference between working hours  
> and night
>
> And an extension that crawls only file: /http: requests which have  
> changed after a given date.
> Sort of  sh ./nutch crawl -changedafter="2006-01-04"?
>
> The code could be delivered end of April as part of a student project.
>
> Kind regards
>
> Matthias Günter
>
>