Merging different crawls into a single index?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Merging different crawls into a single index?

McCallie,David


Hello,

First, let me thank all the developers who have created Nutch -- it is
wonderful and elegant code.

Second, a simple question:

I am using "bin/nutch crawl" to crawl and index two separate sites: one
is an http site, and the second is a network file system. These two
crawls have completely different URL seed files, and different
crawl-urlfilter.txt files.  When the two crawls are done, I'd like to
merge the indexes into a single index for the webapp to search.  How
should I do this?  I tried using "bin/nutch merge" to simply merge the
index directories into a third directory.  This created a valid Lucene
Index (verified with Luke) but it won't work with the search.jsp in the
webapp.   I assume that I need to merge the crawldb and linkdb as well,
but I can't see how to do this?

Thanks in advance,

--david





CONFIDENTIALITY NOTICE

This message and any included attachments
are from Cerner Corporation and are intended
only for the addressee. The information
contained in this message is confidential and
may constitute inside or non-public information
under international, federal, or state
securities laws. Unauthorized forwarding,
printing, copying, distribution, or use of such
information is strictly prohibited and may be
unlawful. If you are not the addressee, please
promptly delete this message and notify the
sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas
City, Missouri, U.S.A at (+1) (816)221-1024.
---------------------------------------- --
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Merging different crawls into a single index?

Stefan Groschupf-2
David,
you don't need crawl and link db merged, right you need to provide a  
link db, but this is just for some detail information. I personal  
remove this feature from the jsp's.
However merging the indexes will work, it is just a question where  
you store the index, how you name the folder and that you provide at  
least a dummy linkdb.
I'm not sure what the name of the merged index folder should be, i  
guess index but you can take a look into the nutch bean init methods  
to verify things.
HTH
Stefan

Am 05.02.2006 um 04:54 schrieb McCallie,David:

>
>
> Hello,
>
> First, let me thank all the developers who have created Nutch -- it is
> wonderful and elegant code.
>
> Second, a simple question:
>
> I am using "bin/nutch crawl" to crawl and index two separate sites:  
> one
> is an http site, and the second is a network file system. These two
> crawls have completely different URL seed files, and different
> crawl-urlfilter.txt files.  When the two crawls are done, I'd like to
> merge the indexes into a single index for the webapp to search.  How
> should I do this?  I tried using "bin/nutch merge" to simply merge the
> index directories into a third directory.  This created a valid Lucene
> Index (verified with Luke) but it won't work with the search.jsp in  
> the
> webapp.   I assume that I need to merge the crawldb and linkdb as  
> well,
> but I can't see how to do this?
>
> Thanks in advance,
>
> --david
>
>
>
>
>
> CONFIDENTIALITY NOTICE
>
> This message and any included attachments
> are from Cerner Corporation and are intended
> only for the addressee. The information
> contained in this message is confidential and
> may constitute inside or non-public information
> under international, federal, or state
> securities laws. Unauthorized forwarding,
> printing, copying, distribution, or use of such
> information is strictly prohibited and may be
> unlawful. If you are not the addressee, please
> promptly delete this message and notify the
> sender of the delivery error by e-mail or you
> may call Cerner's corporate offices in Kansas
> City, Missouri, U.S.A at (+1) (816)221-1024.
> ---------------------------------------- --

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Loading...