Using Nutch to crawl and use it as input to Solr

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Nutch to crawl and use it as input to Solr

Kumar Krishnasami
Hi All,

I am trying to decide if I could use Nutch for a project I am working on
with the following requirements:

1. I need to build the ability to search a bunch of urls.
2. These urls are given to me and there is no need to crawl links from
or to these urls.
3. From time to time new urls will be added to the original set of urls.
I need to update the indexes as soon as I get a new url to be added to
the original set of urls.
4. There is no need to rank these urls based on outside links etc..

Based on these requirements it seems that most of the capabilities of
Nutch (crawling, hadoop etc.) would be an overkill for this project.
There is no need for a linkdb etc..

Due to this I am thinking that I could use Solr with some other
component to feed it with the appropriate data. If I use Solr, I would
need a mechanism to fetch those urls and convert them to the format Solr
needs the data to be sent to it. Can I use Nutch for this by just using
the Fetcher and build something that would convert the html into the
appropriate xml format for Solr? Is there something else that I could
use that anyone here is aware of?

I am just starting out with Nutch and Solr and any help would be greatly
appreciated.

Thanks,
Kumar.
Reply | Threaded
Open this post in threaded view
|

Re: Using Nutch to crawl and use it as input to Solr

Otis Gospodnetic-2-2
Use Droids to crawl.  It already has hooks to index crawled content with Solr, e.g.
http://search-lucene.com/c?id=Droids:/droids-solr/src/main/java/org/apache/droids/solr/SolrHandler.java||solr


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----

> From: Kumar Krishnasami <[hidden email]>
> To: [hidden email]
> Sent: Sat, January 23, 2010 2:27:58 AM
> Subject: Using Nutch to crawl and use it as input to Solr
>
> Hi All,
>
> I am trying to decide if I could use Nutch for a project I am working on with
> the following requirements:
>
> 1. I need to build the ability to search a bunch of urls.
> 2. These urls are given to me and there is no need to crawl links from or to
> these urls.
> 3. From time to time new urls will be added to the original set of urls. I need
> to update the indexes as soon as I get a new url to be added to the original set
> of urls.
> 4. There is no need to rank these urls based on outside links etc..
>
> Based on these requirements it seems that most of the capabilities of Nutch
> (crawling, hadoop etc.) would be an overkill for this project. There is no need
> for a linkdb etc..
>
> Due to this I am thinking that I could use Solr with some other component to
> feed it with the appropriate data. If I use Solr, I would need a mechanism to
> fetch those urls and convert them to the format Solr needs the data to be sent
> to it. Can I use Nutch for this by just using the Fetcher and build something
> that would convert the html into the appropriate xml format for Solr? Is there
> something else that I could use that anyone here is aware of?
>
> I am just starting out with Nutch and Solr and any help would be greatly
> appreciated.
>
> Thanks,
> Kumar.