crawling / data aggregation - is nutch the right tool?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

crawling / data aggregation - is nutch the right tool?

no spam-11
I'm trying to crawl several sites, aggregate the data and provide lucene
based search.  I want the lucene index to contain a small subset of the
data, ie just the contents of a few tags.  I see that nutch provides the
crawling infrastructure and scales really nicely.  I just don't have great
insight into how I can tie into the part that extracts text from html.

Apache droids seems to be built for a task like this but I wonder if I'd
spend a lot of time writing the infrastructure to handle the main task of
crawling.

Thanks,
Mark
Reply | Threaded
Open this post in threaded view
|

Re: crawling / data aggregation - is nutch the right tool?

Subhojit Roy
Hi,

I have used Nutch for quite a while now and have encountered similar
requirements as you mention i.e. to be able to extract specific text from
HTML to be indexed. I have not looked at Droid yet.

However we developed code to extract specific text from HTML based on its
"id" or specific text from meta-tag contents, and integrated that with
Nutch. Example: we needed to extract specific text from certain "div tags"
only. This works reasonably well now. This code though, took us  a some time
to develop and integrate.

Due to the fact that Nutch is a complete open source crawler that includes
advanced ranking algorithms like scoring-opic etc., there was no way of
moving away from Nutch.

Thanks,
-sroy

On Sun, Nov 15, 2009 at 9:36 PM, no spam <[hidden email]> wrote:

> I'm trying to crawl several sites, aggregate the data and provide lucene
> based search.  I want the lucene index to contain a small subset of the
> data, ie just the contents of a few tags.  I see that nutch provides the
> crawling infrastructure and scales really nicely.  I just don't have great
> insight into how I can tie into the part that extracts text from html.
>
> Apache droids seems to be built for a task like this but I wonder if I'd
> spend a lot of time writing the infrastructure to handle the main task of
> crawling.
>
> Thanks,
> Mark
>



--
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: [hidden email]
http://www.profound.in
Reply | Threaded
Open this post in threaded view
|

Re: crawling / data aggregation - is nutch the right tool?

Otis Gospodnetic-2-2
In reply to this post by no spam-11
Droids is much simpler if all you want to do is do a little bit of crawling.  Nutch is built to scale to many millions of web pages.
If you need to crawl just a few sites, I'd suggest Droids.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: no spam <[hidden email]>
> To: [hidden email]
> Sent: Sun, November 15, 2009 11:06:52 AM
> Subject: crawling / data aggregation - is nutch the right tool?
>
> I'm trying to crawl several sites, aggregate the data and provide lucene
> based search.  I want the lucene index to contain a small subset of the
> data, ie just the contents of a few tags.  I see that nutch provides the
> crawling infrastructure and scales really nicely.  I just don't have great
> insight into how I can tie into the part that extracts text from html.
>
> Apache droids seems to be built for a task like this but I wonder if I'd
> spend a lot of time writing the infrastructure to handle the main task of
> crawling.
>
> Thanks,
> Mark

Reply | Threaded
Open this post in threaded view
|

Re: crawling / data aggregation - is nutch the right tool?

Subhojit Roy
Hi,

Is it possible to selectively choose content from crawled pages using Droid?
Does it have a good HTML parser built-in? Can one specify an "id" or a
"class" and ensure that the content within that tag gets included or
excluded?

Thanks,
-sroy

On Mon, Nov 16, 2009 at 10:59 AM, Otis Gospodnetic
<[hidden email]>wrote:

> Droids is much simpler if all you want to do is do a little bit of
> crawling.  Nutch is built to scale to many millions of web pages.
> If you need to crawl just a few sites, I'd suggest Droids.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
> > From: no spam <[hidden email]>
> > To: [hidden email]
> > Sent: Sun, November 15, 2009 11:06:52 AM
> > Subject: crawling / data aggregation - is nutch the right tool?
> >
> > I'm trying to crawl several sites, aggregate the data and provide lucene
> > based search.  I want the lucene index to contain a small subset of the
> > data, ie just the contents of a few tags.  I see that nutch provides the
> > crawling infrastructure and scales really nicely.  I just don't have
> great
> > insight into how I can tie into the part that extracts text from html.
> >
> > Apache droids seems to be built for a task like this but I wonder if I'd
> > spend a lot of time writing the infrastructure to handle the main task of
> > crawling.
> >
> > Thanks,
> > Mark
>
>


--
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: [hidden email]
http://www.profound.in
Reply | Threaded
Open this post in threaded view
|

Re: crawling / data aggregation - is nutch the right tool?

no spam-11
In reply to this post by Otis Gospodnetic-2-2
For now I only need to crawl hundreds of pages, previously I wrote stuff
from scratch in perl.   I want something that allows me to get started
quickly and allows for scale in the future.  I like that Droids is a
framework and I only have to do minimal work to get started.  Apache-Tika is
the framework for parsing and it looks right for the job.  It's the part
that I have a hard time evaluating with Nutch.   Some of what I have read
from the mailing list suggests it's still not all that easy to do extraction
with Nutch, am I wrong?

Mark
Reply | Threaded
Open this post in threaded view
|

Re: crawling / data aggregation - is nutch the right tool?

Subhojit Roy
Apache-Tika is integrated with Nutch. All you need to do is to specify the
formats that (are supported by Tika & Nutch) and you would like to index, in
the configuration file nutch-site.xml under plugin.includes (ex: parse-pdf).
I have used that to extract text from PDF, doc files etc. It works quite
easily.

The hard part with Nutch is to extract _selective_ portions of a crawled
HTML page. Example: you would like only the portion of the HTML page with
the div id="description" to be included in the index created. The rest of
the HTML must be ignored. That's where it gets difficult with Nutch.

I have the code (for Nutch 1.0) that extracts a div tag with a specified id
integrated with Nutch. If that is something that you would like to use, I
can send you the code.

Thanks,
-sroy

On Tue, Nov 17, 2009 at 12:31 AM, no spam <[hidden email]> wrote:

> For now I only need to crawl hundreds of pages, previously I wrote stuff
> from scratch in perl.   I want something that allows me to get started
> quickly and allows for scale in the future.  I like that Droids is a
> framework and I only have to do minimal work to get started.  Apache-Tika
> is
> the framework for parsing and it looks right for the job.  It's the part
> that I have a hard time evaluating with Nutch.   Some of what I have read
> from the mailing list suggests it's still not all that easy to do
> extraction
> with Nutch, am I wrong?
>
> Mark
>



--
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: [hidden email]
http://www.profound.in
Reply | Threaded
Open this post in threaded view
|

Re: crawling / data aggregation - is nutch the right tool?

no spam-11
This is exactly what I want to do, extract a selective portion.  I'd love to
see that code example and how it's wired up.

Thanks,
Mark


>
Reply | Threaded
Open this post in threaded view
|

Re: crawling / data aggregation - is nutch the right tool?

no spam-11
In reply to this post by Subhojit Roy
This was a great write up by Andrzej Bialecki about the future of Nutch and
for small crawls he summed it up here:

- Nutch is too complex and too heavy for those that need to crawl up to a
few thousand pages. Now that the Droids project exists it's probably not
worth the effort to attempt a complete re-design of Nutch so that it fits
the need of this group - Nutch is based on map-reduce, and it's not likely
we would want to change that, so this means there will always be a
significant overhead for small jobs. I'm not saying we should not make Nutch
easier to use, but for small crawls Nutch is an overkill.
Reply | Threaded
Open this post in threaded view
|

Re: crawling / data aggregation - is nutch the right tool?

Subhojit Roy
In reply to this post by no spam-11
Hi,

This functionality is developed as a plugin (pretty crude as of now but we are planning to make it more configurable soon). The name of the plugin is "productdiv".

Note that at this point this plugin takes the URL of the site being crawled/indexed as a hardcoded string besides the "class" of the div tags. So you will need to manually change the URL & class names before recompiling the plugin. Note that this plugin _includes_ selected divs into the "content" section of the index. Exclusion of div's with "class=xxx" is not part of this plugin.

Here are the instructions:

1) unzip & extract the attached productdiv.zip file in the nutch-1.0/src/plugins directory

2) Add the plugin name into the build.xml in nutch-1.0/src/plugins/build.xml
    Example: <ant dir="productdiv" target="deploy"/> & <ant dir="productdiv" target="clean"/>

3) In the plugin file org/apache/nutch/parse/productdiv/RecommendedParser.java, do the following:
   
a) Hardcode the div class name for those divs  whose contents you would like to include in the content field of the index e.g. "entry" & "message readable" are the names of div classes in this sample code

b) Also the top level URL which is being crawled/indexed must be hardcoded in the file. Example: http://www.xyz.com/forums & http://www.xyz.com/blogs
  
line no. 63 to 68

String divID=""; 

String str_arr_div[] ={"message readable","entry"};

if(address.toString ().startsWith ("http://www.xyz.com/blog/"))

            divID="entry";                            // name of div class that must be included

if(address.toString ().startsWith ("http://www.xyz.com/forums/"))    

            divID="message readable";        // name of div class that must be included

5) Now comment out the following lines in the Nutch source file  nutch-1.0/src/plugins/index-basic/src/java/...../BasicIndexingFilter.java

a) 74
b) 116-117

line 74:  doc.add("content", parse.getText());  //line no. 74

line 116-117: LuceneWriter.addFieldOptions("content", LuceneWriter.STORE.NO,   // line no. 116
        LuceneWriter.INDEX.TOKENIZED, conf);                                               //line no. 117
 
6) Now recompile Nutch 1.0 using ant as usual

7) If the plugin has compiled successfully you will be able to see the corresponding jarfile in the $HOME/build/plugins directory

8) Now enable the plugin by specifying its name in the $HOME/conf/nutch-site.xml file inside the plugins.include property

9) Then recrawl the URL http://www.xyz.com/forums & http://www.xyz.com/blog. You should be able to see content from all pages only from the div's whose class names are "entry" and "message readable".

Let me know if you have any questions.

Thanks,
-sroy


On Tue, Nov 17, 2009 at 10:25 PM, no spam <[hidden email]> wrote:
This is exactly what I want to do, extract a selective portion.  I'd love to
see that code example and how it's wired up.

Thanks,
Mark


>



--
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: [hidden email]
http://www.profound.in

productdiv.zip (15K) Download Attachment