Droids crawler

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Droids crawler

Andrzej Białecki-2
Hi all,

In the light of discussion about the future of Nutch I'd lie to draw
your attention to Droids - a small crawler framework that uses Spring
for extensibility.

http://people.apache.org/~thorsten/droids/

Are there any lessons there that we could learn?

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Droids crawler

Dennis Kubes-2
Interesting.  Worth a deeper look I think.  I think one of the keys to a
new version of nutch would be crawler extensibility.

Dennis

Andrzej Bialecki wrote:

> Hi all,
>
> In the light of discussion about the future of Nutch I'd lie to draw
> your attention to Droids - a small crawler framework that uses Spring
> for extensibility.
>
> http://people.apache.org/~thorsten/droids/
>
> Are there any lessons there that we could learn?
>
Reply | Threaded
Open this post in threaded view
|

Re: Droids crawler

Rafael Turk
Droids looks great!

 It seams to me that it can even be considered (once stable) as a possible replacement for Nutch´s own crawler...

[]s
Rafael

On Fri, Sep 12, 2008 at 12:38 PM, Dennis Kubes <[hidden email]> wrote:
Interesting.  Worth a deeper look I think.  I think one of the keys to a new version of nutch would be crawler extensibility.

Dennis


Andrzej Bialecki wrote:
Hi all,

In the light of discussion about the future of Nutch I'd lie to draw your attention to Droids - a small crawler framework that uses Spring for extensibility.

http://people.apache.org/~thorsten/droids/

Are there any lessons there that we could learn?


Reply | Threaded
Open this post in threaded view
|

good crawler - droids

Rakesh Singh-2
http://people.apache.org/~thorsten/droids/

--- On Tue, 9/16/08, Rafael Turk <[hidden email]> wrote:
From: Rafael Turk <[hidden email]>
Subject: Re: Droids crawler
To: [hidden email]
Date: Tuesday, September 16, 2008, 5:36 PM

Droids looks great!

 It seams to me that it can even be considered (once stable) as a possible replacement for Nutch´s own crawler...

[]s
Rafael

On Fri, Sep 12, 2008 at 12:38 PM, Dennis Kubes <[hidden email]> wrote:
Interesting.  Worth a deeper look I think.  I think one of the keys to a new version of nutch would be crawler extensibility.

Dennis


Andrzej Bialecki wrote:
Hi all,

In the light of discussion about the future of Nutch I'd lie to draw your attention to Droids - a small crawler framework that uses Spring for extensibility.

http://people.apache.org/~thorsten/droids/

Are there any lessons there that we could learn?


Reply | Threaded
Open this post in threaded view
|

Re: Droids crawler

Doğacan Güney-3
In reply to this post by Dennis Kubes-2
On Fri, Sep 12, 2008 at 5:38 PM, Dennis Kubes <[hidden email]> wrote:
> Interesting.  Worth a deeper look I think.  I think one of the keys to a new
> version of nutch would be crawler extensibility.
>

I agree. So let's start a discussion then. What is missing from nutch's crawler?
What does droids do that we don't?

> Dennis
>
> Andrzej Bialecki wrote:
>>
>> Hi all,
>>
>> In the light of discussion about the future of Nutch I'd lie to draw your
>> attention to Droids - a small crawler framework that uses Spring for
>> extensibility.
>>
>> http://people.apache.org/~thorsten/droids/
>>
>> Are there any lessons there that we could learn?
>>
>



--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Droids crawler

thorsten
In reply to this post by Rafael Turk
On Tue, 2008-09-16 at 22:36 -0200, Rafael Turk wrote:
> Droids looks great!

:) cheers.

>
>  It seams to me that it can even be considered (once stable) as a
> possible replacement for Nutch´s own crawler...
>

Droids is not designed for a special usecase, it is a framework: Take
what you need, do what you want.

I am working ATM on https://issues.apache.org/jira/browse/LABS-144.
After this the packaging structure will look like:
- org.apache.droids.core.jar -> no superfluous dependency
- org.apache.droids.dynamics.jar -> current CLI, spring and
cocoon-configurator.
- org.apache.droids.droid.helloCrawler.jar -> droids specific code
- org.apache.droids.droid.indexerCrawler.jar -> droids specific code
...
- org.apache.droids.plugin.protocol.http.jar -> plugin specific code
- org.apache.droids.plugin.protocol.parser.tika.jar -> plugin specific
code
- org.apache.droids.plugin.handler.save.jar -> plugin specific code

The core of droids is very small providing only the API and some
abstract implementations to ease the creation of the specific robot. The
core has only logging dependencies.

The full-blown configuration possibilities are provided by the dynamics
package. Spring lets you inject beans to your classes without the need
to have a config object allover the place (one thing that I do not like
on nutch ATM since it reduces the re-usability of the different
classes). The cocoon-configurator provides dynamic registry support:
http://cocoon.apache.org/subprojects/configuration/1.0/spring-configurator/1.0/1400_1_1.html
"...Especially with the spring-configurator, beans can be added
dynamically just by dropping a jar into the class-path..."

This is as well the way how plugins can register with the core. Droids
offers you following the plugins so far:

      * Queue, a queue is the data structure where the different tasks
        are waiting for service. There is interest for a queue
        implementation for droids that is build on top of hadoop.
      * Protocol, the protocol interface is a wrapper to hide the
        underlying implementation of the communication at protocol
        level. Oleg announced that he want to work on an enhanced http
        protocol implementation for droids.
      * Parser -> Apache Tika, the parser component is just a wrapper
        for tika since it offers everything we need. No need to
        duplicate the effort. The Paser component parses different input
        types to SAX events.
      * Handler, a handler is a component that uses the original stream
        and/or the parse (ContentHandler coming from Tika) and the url
        to invoke arbitrary business logic on the objects. Unless like
        the other components different handler can be applied on the
        stream/parse. The is a plugin to index with solr, one to save
        the stream directly, ...

One can build a simple nutch crawler very quick by using the existing
plugins and developing nutch specific ones. This definitely would
benefit both projects since it all about reusing existing code the most
efficient way. Let the tika people worry about enhancing the parser
support, Mahout/Hadoop for the queue, ... you get the idea, no?

Configuration is very flexible because you can either use POJO or spring
to configure your crawler. Oleg Kalnichevski provided a SimpleRuntime
that is independent from Spring, this way one is free to create a very
specific robot or a flexible configurable. Droids is really glad to have
him on the team and I invite everybody here to join too. BTW all apache
committer have write access to the code base.

BTW Droids has recently submitted a proposal for incubation and is very
open for any feedback, help and cooperation.
http://markmail.org/message/km23arivpk4t4kdt

salu2

>
> []s
> Rafael
>
> On Fri, Sep 12, 2008 at 12:38 PM, Dennis Kubes <[hidden email]>
> wrote:
>         Interesting.  Worth a deeper look I think.  I think one of the
>         keys to a new version of nutch would be crawler extensibility.
>        
>         Dennis
>        
>        
>        
>         Andrzej Bialecki wrote:
>                 Hi all,
>                
>                 In the light of discussion about the future of Nutch
>                 I'd lie to draw your attention to Droids - a small
>                 crawler framework that uses Spring for extensibility.
>                
>                 http://people.apache.org/~thorsten/droids/
>                
>                 Are there any lessons there that we could learn?
>                
>
>
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions

Reply | Threaded
Open this post in threaded view
|

Re: Droids crawler

Otis Gospodnetic-2-2
Hi,

Just found this email is my Nutch folder.... and as I was reading it was thinking "Got to ask Dennis if he/they will do the Nutch-Droids integration" when I saw Dennis' name below.  So, Dennis, is Droids on the roadmap for you?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



From: Thorsten Scherler <[hidden email]>
To: [hidden email]
Sent: Friday, September 26, 2008 7:40:23 PM
Subject: Re: Droids crawler

On Tue, 2008-09-16 at 22:36 -0200, Rafael Turk wrote:
> Droids looks great!

:) cheers.

>
>  It seams to me that it can even be considered (once stable) as a
> possible replacement for Nutch´s own crawler...
>

Droids is not designed for a special usecase, it is a framework: Take
what you need, do what you want.

I am working ATM on https://issues.apache.org/jira/browse/LABS-144.
After this the packaging structure will look like:
- org.apache.droids.core.jar -> no superfluous dependency
- org.apache.droids.dynamics.jar -> current CLI, spring and
cocoon-configurator.
- org.apache.droids.droid.helloCrawler.jar -> droids specific code
- org.apache.droids.droid.indexerCrawler.jar -> droids specific code
...
- org.apache.droids.plugin.protocol.http.jar -> plugin specific code
- org.apache.droids.plugin.protocol.parser.tika.jar -> plugin specific
code
- org.apache.droids.plugin.handler.save.jar -> plugin specific code

The core of droids is very small providing only the API and some
abstract implementations to ease the creation of the specific robot. The
core has only logging dependencies.

The full-blown configuration possibilities are provided by the dynamics
package. Spring lets you inject beans to your classes without the need
to have a config object allover the place (one thing that I do not like
on nutch ATM since it reduces the re-usability of the different
classes). The cocoon-configurator provides dynamic registry support:
http://cocoon.apache.org/subprojects/configuration/1.0/spring-configurator/1.0/1400_1_1.html
"...Especially with the spring-configurator, beans can be added
dynamically just by dropping a jar into the class-path..."

This is as well the way how plugins can register with the core. Droids
offers you following the plugins so far:

      * Queue, a queue is the data structure where the different tasks
        are waiting for service. There is interest for a queue
        implementation for droids that is build on top of hadoop.
      * Protocol, the protocol interface is a wrapper to hide the
        underlying implementation of the communication at protocol
        level. Oleg announced that he want to work on an enhanced http
        protocol implementation for droids.
      * Parser -> Apache Tika, the parser component is just a wrapper
        for tika since it offers everything we need. No need to
        duplicate the effort. The Paser component parses different input
        types to SAX events.
      * Handler, a handler is a component that uses the original stream
        and/or the parse (ContentHandler coming from Tika) and the url
        to invoke arbitrary business logic on the objects. Unless like
        the other components different handler can be applied on the
        stream/parse. The is a plugin to index with solr, one to save
        the stream directly, ...

One can build a simple nutch crawler very quick by using the existing
plugins and developing nutch specific ones. This definitely would
benefit both projects since it all about reusing existing code the most
efficient way. Let the tika people worry about enhancing the parser
support, Mahout/Hadoop for the queue, ... you get the idea, no?

Configuration is very flexible because you can either use POJO or spring
to configure your crawler. Oleg Kalnichevski provided a SimpleRuntime
that is independent from Spring, this way one is free to create a very
specific robot or a flexible configurable. Droids is really glad to have
him on the team and I invite everybody here to join too. BTW all apache
committer have write access to the code base.

BTW Droids has recently submitted a proposal for incubation and is very
open for any feedback, help and cooperation.
http://markmail.org/message/km23arivpk4t4kdt

salu2

>
> []s
> Rafael
>
> On Fri, Sep 12, 2008 at 12:38 PM, Dennis Kubes <[hidden email]>
> wrote:
>        Interesting.  Worth a deeper look I think.  I think one of the
>        keys to a new version of nutch would be crawler extensibility.
>       
>        Dennis
>       
>       
>       
>        Andrzej Bialecki wrote:
>                Hi all,
>               
>                In the light of discussion about the future of Nutch
>                I'd lie to draw your attention to Droids - a small
>                crawler framework that uses Spring for extensibility.
>               
>                http://people.apache.org/~thorsten/droids/
>               
>                Are there any lessons there that we could learn?
>               
>
>
--
Thorsten Scherler                                thorsten.at.apache.org
Open Source Java                      consulting, training and solutions

Reply | Threaded
Open this post in threaded view
|

Re: Droids crawler

Dennis Kubes-2
Ah, so many things to do so little time.  I am currently in Saudi Arabia
talking about Nutch and Hadoop.  I have been meaning to take a look at
Droids, I haven't gotten to yet.  My schedule is clearing up though so I
should have a better answer for you in the next week or so.  I
definitely want to move towards more flexibility in nutch crawling and I
think droids may be the answer to that.  Just don't know yet.

Dennis

Otis Gospodnetic wrote:

> Hi,
>
> Just found this email is my Nutch folder.... and as I was reading it was
> thinking "Got to ask Dennis if he/they will do the Nutch-Droids
> integration" when I saw Dennis' name below.  So, Dennis, is Droids on
> the roadmap for you?
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> ------------------------------------------------------------------------
> *From:* Thorsten Scherler <[hidden email]>
> *To:* [hidden email]
> *Sent:* Friday, September 26, 2008 7:40:23 PM
> *Subject:* Re: Droids crawler
>
> On Tue, 2008-09-16 at 22:36 -0200, Rafael Turk wrote:
>  > Droids looks great!
>
> :) cheers.
>
>  >
>  >  It seams to me that it can even be considered (once stable) as a
>  > possible replacement for Nutch´s own crawler...
>  >
>
> Droids is not designed for a special usecase, it is a framework: Take
> what you need, do what you want.
>
> I am working ATM on https://issues.apache.org/jira/browse/LABS-144.
> After this the packaging structure will look like:
> - org.apache.droids.core.jar -> no superfluous dependency
> - org.apache.droids.dynamics.jar -> current CLI, spring and
> cocoon-configurator.
> - org.apache.droids.droid.helloCrawler.jar -> droids specific code
> - org.apache.droids.droid.indexerCrawler.jar -> droids specific code
> ...
> - org.apache.droids.plugin.protocol.http.jar -> plugin specific code
> - org.apache.droids.plugin.protocol.parser.tika.jar -> plugin specific
> code
> - org.apache.droids.plugin.handler.save.jar -> plugin specific code
>
> The core of droids is very small providing only the API and some
> abstract implementations to ease the creation of the specific robot. The
> core has only logging dependencies.
>
> The full-blown configuration possibilities are provided by the dynamics
> package. Spring lets you inject beans to your classes without the need
> to have a config object allover the place (one thing that I do not like
> on nutch ATM since it reduces the re-usability of the different
> classes). The cocoon-configurator provides dynamic registry support:
> http://cocoon.apache.org/subprojects/configuration/1.0/spring-configurator/1.0/1400_1_1.html
> "...Especially with the spring-configurator, beans can be added
> dynamically just by dropping a jar into the class-path..."
>
> This is as well the way how plugins can register with the core. Droids
> offers you following the plugins so far:
>
>       * Queue, a queue is the data structure where the different tasks
>         are waiting for service. There is interest for a queue
>         implementation for droids that is build on top of hadoop.
>       * Protocol, the protocol interface is a wrapper to hide the
>         underlying implementation of the communication at protocol
>         level. Oleg announced that he want to work on an enhanced http
>         protocol implementation for droids.
>       * Parser -> Apache Tika, the parser component is just a wrapper
>         for tika since it offers everything we need. No need to
>         duplicate the effort. The Paser component parses different input
>         types to SAX events.
>       * Handler, a handler is a component that uses the original stream
>         and/or the parse (ContentHandler coming from Tika) and the url
>         to invoke arbitrary business logic on the objects. Unless like
>         the other components different handler can be applied on the
>         stream/parse. The is a plugin to index with solr, one to save
>         the stream directly, ...
>
> One can build a simple nutch crawler very quick by using the existing
> plugins and developing nutch specific ones. This definitely would
> benefit both projects since it all about reusing existing code the most
> efficient way. Let the tika people worry about enhancing the parser
> support, Mahout/Hadoop for the queue, ... you get the idea, no?
>
> Configuration is very flexible because you can either use POJO or spring
> to configure your crawler. Oleg Kalnichevski provided a SimpleRuntime
> that is independent from Spring, this way one is free to create a very
> specific robot or a flexible configurable. Droids is really glad to have
> him on the team and I invite everybody here to join too. BTW all apache
> committer have write access to the code base.
>
> BTW Droids has recently submitted a proposal for incubation and is very
> open for any feedback, help and cooperation.
> http://markmail.org/message/km23arivpk4t4kdt
>
> salu2
>
>  >
>  > []s
>  > Rafael
>  >
>  > On Fri, Sep 12, 2008 at 12:38 PM, Dennis Kubes <[hidden email]
> <mailto:[hidden email]>>
>  > wrote:
>  >        Interesting.  Worth a deeper look I think.  I think one of the
>  >        keys to a new version of nutch would be crawler extensibility.
>  >      
>  >        Dennis
>  >      
>  >      
>  >      
>  >        Andrzej Bialecki wrote:
>  >                Hi all,
>  >              
>  >                In the light of discussion about the future of Nutch
>  >                I'd lie to draw your attention to Droids - a small
>  >                crawler framework that uses Spring for extensibility.
>  >              
>  >                http://people.apache.org/~thorsten/droids/ 
> <http://people.apache.org/%7Ethorsten/droids/>
>  >              
>  >                Are there any lessons there that we could learn?
>  >              
>  >
>  >
> --
> Thorsten Scherler                                thorsten.at.apache.org
> Open Source Java                      consulting, training and solutions
>