crawler feed?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

crawler feed?

rubdabadub
Hi:

Are there relatively stand-alone crawler that are
suitable/customizable for Solr? has anyone done any trials.. I have
seen some discussion about coocon crawler.. was that successfull?

Regards
Reply | Threaded
Open this post in threaded view
|

Re: crawler feed?

Thorsten Scherler-3
On Wed, 2007-02-07 at 11:09 +0100, rubdabadub wrote:
> Hi:
>
> Are there relatively stand-alone crawler that are
> suitable/customizable for Solr? has anyone done any trials.. I have
> seen some discussion about coocon crawler.. was that successfull?

http://wiki.apache.org/solr/SolrForrest

I am using this approach in a custom project that is cocoon based and is
working very fine. However cocoons crawler is not standalone but using
the cocoon cli. I am using the solr/forrest plugin for the commit and
dispatching the update. The indexing transformation in the plugin is a
wee bit different then the one in my project since I needed to extract
more information from the documents to create better filters.

However since the cocoon cli is not anymore in 2.2 (cocoon-trunk) and
forrest uses this as its main component, I am keen to write a simple
crawler that could be reused for cocoon, forrest, solr, nutch, ...

I may will start something pretty soon (I guess I will open a project in
Apache Labs) and will keep this list informed. My idea is to write
simple crawler which could be easily extended by plugins. So if a
project/app needs special processing for a crawled url one could write a
plugin to implement the functionality. A solr plugin for this crawler
would be very simple, basically it would parse the e.g. html page and
dispatches an update command for the extracted fields. I think one
should try to reuse much code from nutch as possible for this parsing.

If somebody is interested in such a standalone crawler project, I
welcome any help, ideas, suggestion, feedback and/or questions.

salu2
--
Thorsten Scherler                       thorsten.at.apache.org
Open Source Java & XML      consulting, training and solutions

Reply | Threaded
Open this post in threaded view
|

Re: crawler feed?

rubdabadub
Thorsten:

Thank you very much for the update.

On 2/7/07, Thorsten Scherler <[hidden email]> wrote:

> On Wed, 2007-02-07 at 11:09 +0100, rubdabadub wrote:
> > Hi:
> >
> > Are there relatively stand-alone crawler that are
> > suitable/customizable for Solr? has anyone done any trials.. I have
> > seen some discussion about coocon crawler.. was that successfull?
>
> http://wiki.apache.org/solr/SolrForrest
>
> I am using this approach in a custom project that is cocoon based and is
> working very fine. However cocoons crawler is not standalone but using
> the cocoon cli. I am using the solr/forrest plugin for the commit and
> dispatching the update. The indexing transformation in the plugin is a
> wee bit different then the one in my project since I needed to extract
> more information from the documents to create better filters.
>
> However since the cocoon cli is not anymore in 2.2 (cocoon-trunk) and
> forrest uses this as its main component, I am keen to write a simple
> crawler that could be reused for cocoon, forrest, solr, nutch, ...
>
> I may will start something pretty soon (I guess I will open a project in
> Apache Labs) and will keep this list informed. My idea is to write
> simple crawler which could be easily extended by plugins. So if a
> project/app needs special processing for a crawled url one could write a
> plugin to implement the functionality. A solr plugin for this crawler
> would be very simple, basically it would parse the e.g. html page and
> dispatches an update command for the extracted fields. I think one
> should try to reuse much code from nutch as possible for this parsing.

I have seen some discussion regarding nutch crawler. I think a standalone
crawler would be more desirable .. as you pointed out one could extend such
crawler via plugins. Is seems difficult to "rip nutch" crawler as a
standalone crawler? no?
Cos you would want as much as possible "same code base" no? I also think such
crawler is interesting in vertical search engine space. So Nutch 0.7
could be good target no?

Regards
Reply | Threaded
Open this post in threaded view
|

Re: crawler feed?

Sami Siren-2
In reply to this post by rubdabadub
rubdabadub wrote:
> Hi:
>
> Are there relatively stand-alone crawler that are
> suitable/customizable for Solr? has anyone done any trials.. I have
> seen some discussion about coocon crawler.. was that successfull?

There's also integration path available for Nutch[1] that i plan to
integrate after 0.9.0 is out.

--
 Sami Siren

[1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
Reply | Threaded
Open this post in threaded view
|

Re: crawler feed?

rubdabadub
This is really interesting. You mean to say i could give the patch a
try now i.e. the patch in the blog post :-)

I am looking forward to it. I hope it will be standalone i.e. you
don't need "the whole nutch" to get a standalone crawler working.. I
am not sure if this is how you planned.

Regards

On 2/7/07, Sami Siren <[hidden email]> wrote:

> rubdabadub wrote:
> > Hi:
> >
> > Are there relatively stand-alone crawler that are
> > suitable/customizable for Solr? has anyone done any trials.. I have
> > seen some discussion about coocon crawler.. was that successfull?
>
> There's also integration path available for Nutch[1] that i plan to
> integrate after 0.9.0 is out.
>
> --
>  Sami Siren
>
> [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
>
Reply | Threaded
Open this post in threaded view
|

Re: crawler feed?

rubdabadub
Hi:

Just want to say that my tiny experiment with Sami's Solr/Nutch
integration worked :-!)  Super thanks for the pointer. Which leads me
to write the following..

It would be great if I could use this in my current project. This way
I can eliminate my current python based aggregator/crawler which was
used to submit docs to Solr. This solution works but the crawler is
not as robust as I wanted it to be. As far as I understand SOLR-20
seems to be good to go for trunk? no?

So I am lobbying for SOLR-20 :-)

Cheers


On 2/7/07, rubdabadub <[hidden email]> wrote:

> This is really interesting. You mean to say i could give the patch a
> try now i.e. the patch in the blog post :-)
>
> I am looking forward to it. I hope it will be standalone i.e. you
> don't need "the whole nutch" to get a standalone crawler working.. I
> am not sure if this is how you planned.
>
> Regards
>
> On 2/7/07, Sami Siren <[hidden email]> wrote:
> > rubdabadub wrote:
> > > Hi:
> > >
> > > Are there relatively stand-alone crawler that are
> > > suitable/customizable for Solr? has anyone done any trials.. I have
> > > seen some discussion about coocon crawler.. was that successfull?
> >
> > There's also integration path available for Nutch[1] that i plan to
> > integrate after 0.9.0 is out.
> >
> > --
> >  Sami Siren
> >
> > [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: crawler feed?

thorsten
In reply to this post by Sami Siren-2
On Wed, 2007-02-07 at 18:03 +0200, Sami Siren wrote:
> rubdabadub wrote:
> > Hi:
> >
> > Are there relatively stand-alone crawler that are
> > suitable/customizable for Solr? has anyone done any trials.. I have
> > seen some discussion about coocon crawler.. was that successfull?
>
> There's also integration path available for Nutch[1] that i plan to
> integrate after 0.9.0 is out.

sounds very nice, I just finished to read. Thanks.

Today a submitted a proposal for an Apache Labs project called Apache
Druids.

http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser

Basic idea is to create a flexible crawler framework. The core should be
a simple crawler which could be easily expended by plugins. So if a
project/app needs special processing for a crawled url one could write a
plugin to implement the functionality.

salu2

>
> --
>  Sami Siren
>
> [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java & XML                consulting, training and solutions

Reply | Threaded
Open this post in threaded view
|

Re: crawler feed?

rubdabadub
Thorsten:

First of all I read your lab idea with great interest as I am in need
of such crawler. However there are certain things that I like to
discuss. I am not sure what forum will be appropriate for this but I
will do my idea shooting here first then please tell me where should I
post further comments.

A vertical search engine that will focus on a specific set of data i.e
use solr for example cos it provides the maximum field flexibility
would greatly benefit from such crawler. i.e the next big technorati
or the next big event finding solution can use your crawler to crawl
feeds using a feed-plugin (maybe nutch plugins) or scrape websites for
event info using some x-path/xquery stuff (personally I think xpath is
a pain in the a... :-)

What I worry about is those issue that has to deal with

- updating crawls
- how many threads per host
- scale etc.

All the maintainers headaches!  I know you will use as much code as
you can from Nutch plus are not planning to re-invent the wheel. But
wouldn't be much easier to jump into Sami's idea and make it better
and more stand-alone and still benefit from the Nutch community? I
wonder wouldn't it be easy to push/purse a route where nutch crawler
becomes a standalone crawler? no? I read a post about it on the list.

I would like to hear more about how your plan will evolve in terms of
druid and why not join forces with Sami and co.?

Regards

On 2/7/07, Thorsten Scherler <[hidden email]> wrote:

> On Wed, 2007-02-07 at 18:03 +0200, Sami Siren wrote:
> > rubdabadub wrote:
> > > Hi:
> > >
> > > Are there relatively stand-alone crawler that are
> > > suitable/customizable for Solr? has anyone done any trials.. I have
> > > seen some discussion about coocon crawler.. was that successfull?
> >
> > There's also integration path available for Nutch[1] that i plan to
> > integrate after 0.9.0 is out.
>
> sounds very nice, I just finished to read. Thanks.
>
> Today a submitted a proposal for an Apache Labs project called Apache
> Druids.
>
> http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser
>
> Basic idea is to create a flexible crawler framework. The core should be
> a simple crawler which could be easily expended by plugins. So if a
> project/app needs special processing for a crawled url one could write a
> plugin to implement the functionality.
>
> salu2
>
> >
> > --
> >  Sami Siren
> >
> > [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
> --
> Thorsten Scherler                                 thorsten.at.apache.org
> Open Source Java & XML                consulting, training and solutions
>
>
Reply | Threaded
Open this post in threaded view
|

Re: crawler feed?

Otis Gospodnetic-2
In reply to this post by rubdabadub
Alright, this is good! (R2D2s)  You may want to post this to nutch-dev.  I was kind of asking for and about this the other day on nutch-(user?)...

Otis

----- Original Message ----
From: Thorsten Scherler <[hidden email]>
To: [hidden email]
Sent: Wednesday, February 7, 2007 5:15:04 PM
Subject: Re: crawler feed?

On Wed, 2007-02-07 at 18:03 +0200, Sami Siren wrote:
> rubdabadub wrote:
> > Hi:
> >
> > Are there relatively stand-alone crawler that are
> > suitable/customizable for Solr? has anyone done any trials.. I have
> > seen some discussion about coocon crawler.. was that successfull?
>
> There's also integration path available for Nutch[1] that i plan to
> integrate after 0.9.0 is out.

sounds very nice, I just finished to read. Thanks.

Today a submitted a proposal for an Apache Labs project called Apache
Druids.

http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser

Basic idea is to create a flexible crawler framework. The core should be
a simple crawler which could be easily expended by plugins. So if a
project/app needs special processing for a crawled url one could write a
plugin to implement the functionality.

salu2

>
> --
>  Sami Siren
>
> [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java & XML                consulting, training and solutions




Reply | Threaded
Open this post in threaded view
|

[Droids] Re: crawler feed?

thorsten
In reply to this post by rubdabadub
On Thu, 2007-02-08 at 14:40 +0100, rubdabadub wrote:
> Thorsten:
>
> First of all I read your lab idea with great interest as I am in need
> of such crawler. However there are certain things that I like to
> discuss. I am not sure what forum will be appropriate for this but I
> will do my idea shooting here first then please tell me where should I
> post further comments.

Since it is not an official lab project yet, I am unsure myself, but I
think we should discuss details on [hidden email]. Please reply to
to the labs ml.

>
> A vertical search engine that will focus on a specific set of data i.e
> use solr for example cos it provides the maximum field flexibility
> would greatly benefit from such crawler. i.e the next big technorati
> or the next big event finding solution can use your crawler to crawl
> feeds using a feed-plugin (maybe nutch plugins) or scrape websites for
> event info using some x-path/xquery stuff (personally I think xpath is
> a pain in the a... :-)

This like you pointed out are surely some use cases for the crawler in
combination with plugins.

Another is the wget like crawl that application can use to export a
static site (e.g. CMS, etc.).

>
> What I worry about is those issue that has to deal with
>
> - updating crawls

Actually if you only see the crawl there is no differences between
updating or any other crawl.

> - how many threads per host

should be configurable.

> - scale etc.

you mean a crawl cluster?

>
> All the maintainers headaches!

That is why droids is a labs proposal.

http://labs.apache.org/bylaws.html

All apache committer have write access and when a lab is promoted, the
files are moved over to the incubation area.

>  I know you will use as much code as
> you can from Nutch plus are not planning to re-invent the wheel. But
> wouldn't be much easier to jump into Sami's idea and make it better
> and more stand-alone and still benefit from the Nutch community?

I will start a thread on nutch dev and see whether or not it is possible
to extract the crawler from the core, but the main idea is to keep
droids simple.

Imaging something like the following pseudo code:

    public void crawl(String url) {
        // resolving the stream
        InputStream stream = new URL(url).openStream();
        // Lookup plugins that is registered for the stream
        Plugin plugin = lookupPlugin(stream);
        // extract links
        // link pattern matcher
        Links[] links = plugin.extractLinks(stream);
        // Match patterns plugins for storing/excluding links
        links = plugin.handleLinks(links);
        // pass the stream to the plugin for further processing
        plugin.main(stream);
    }


> I
> wonder wouldn't it be easy to push/purse a route where nutch crawler
> becomes a standalone crawler? no? I read a post about it on the list.
>

Can you provide some links to get some background information? TIA.

> I would like to hear more about how your plan will evolve in terms of
> druid and why not join forces with Sami and co.?

I am more familiar with solr then nutch I have to admit.

Like said all committer have write access on droids and everybody is
welcome to join the effort. Who knows maybe the first droid is a
standalone nutch crawler with plugin extension points if some nutch
committer joins the lab.
 
Thanks rubdabadub for your feedback.

salu2

>
> Regards
>
> On 2/7/07, Thorsten Scherler <[hidden email]> wrote:
> > On Wed, 2007-02-07 at 18:03 +0200, Sami Siren wrote:
> > > rubdabadub wrote:
> > > > Hi:
> > > >
> > > > Are there relatively stand-alone crawler that are
> > > > suitable/customizable for Solr? has anyone done any trials.. I have
> > > > seen some discussion about coocon crawler.. was that successfull?
> > >
> > > There's also integration path available for Nutch[1] that i plan to
> > > integrate after 0.9.0 is out.
> >
> > sounds very nice, I just finished to read. Thanks.
> >
> > Today a submitted a proposal for an Apache Labs project called Apache
> > Druids.
> >
> > http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser
> >
> > Basic idea is to create a flexible crawler framework. The core should be
> > a simple crawler which could be easily expended by plugins. So if a
> > project/app needs special processing for a crawled url one could write a
> > plugin to implement the functionality.
> >
> > salu2
> >
> > >
> > > --
> > >  Sami Siren
> > >
> > > [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
> > --
> > Thorsten Scherler                                 thorsten.at.apache.org
> > Open Source Java & XML                consulting, training and solutions
> >
> >
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java & XML                consulting, training and solutions