Multiple indexes on a single server instance.

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple indexes on a single server instance.

TJ Roberts
I have five different indexes each with their own special configuration.  I would like to be able to switch between the different indexes dynamically on a single instance of nutch running on jakarta-tomcat.  Is this possible, or do I have to run five instances of nutch, one for each index?

               
---------------------------------
Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone.  Get Yahoo! Messenger with Voice
Reply | Threaded
Open this post in threaded view
|

Re: Multiple indexes on a single server instance.

Stefan Groschupf-2
I'm not sure what you are planing to do, but you can just switch a  
symbolic link on your hdd driven by a cronjob to switch between index  
on a given time.
May be you need to touch the web.xml to restart the searcher.
If you try to search in different kind of indexes at the same time, I  
suggest to merge the indexes and have a kind keyfield for each of the  
indexes.
For example add a field to each of your indexes names "indexName" and  
put A, B and C as value into it.
Than you can merge your index. During runtime you just need to have a  
queryfilter that extend a indexName:A or indexName:B to the query  
string.

Does this somehow help to solve your problem?
Stefan

Am 23.05.2006 um 15:26 schrieb TJ Roberts:

> I have five different indexes each with their own special  
> configuration.  I would like to be able to switch between the  
> different indexes dynamically on a single instance of nutch running  
> on jakarta-tomcat.  Is this possible, or do I have to run five  
> instances of nutch, one for each index?
>
>
> ---------------------------------
> Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone.  
> Get Yahoo! Messenger with Voice

Reply | Threaded
Open this post in threaded view
|

Re: Multiple indexes on a single server instance.

sudhendra seshachala
I am experiencing a similar problem.
  What I have done is as follows.
  I have different parse-plugin for each site ( I have 3 sites to crawl and fetch data). But I capture data into same format I call it datarepository.
  I have one index-plugin which indexes on data repository and one query-plugin on the data repository,
  I dont have to run multiple instances. I just run one instance of search engine.
  However the parse configuration is different for each site so I run different crawler for each site
  Then I index and merge all of them. So far the results are good if not "WOW".
  I still have to figure a way of ranking the page. For example I would like to be able to apply ranking on the data repository. Let me know If I was clear...
   
  Thanks
Stefan Groschupf <[hidden email]> wrote:
  I'm not sure what you are planing to do, but you can just switch a
symbolic link on your hdd driven by a cronjob to switch between index
on a given time.
May be you need to touch the web.xml to restart the searcher.
If you try to search in different kind of indexes at the same time, I
suggest to merge the indexes and have a kind keyfield for each of the
indexes.
For example add a field to each of your indexes names "indexName" and
put A, B and C as value into it.
Than you can merge your index. During runtime you just need to have a
queryfilter that extend a indexName:A or indexName:B to the query
string.

Does this somehow help to solve your problem?
Stefan

Am 23.05.2006 um 15:26 schrieb TJ Roberts:

> I have five different indexes each with their own special
> configuration. I would like to be able to switch between the
> different indexes dynamically on a single instance of nutch running
> on jakarta-tomcat. Is this possible, or do I have to run five
> instances of nutch, one for each index?
>
>
> ---------------------------------
> Feel free to call! Free PC-to-PC calls. Low rates on PC-to-Phone.
> Get Yahoo! Messenger with Voice




  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: Multiple indexes on a single server instance.

Stefan Neufeind
In reply to this post by Stefan Groschupf-2
I've been running into a similar question myself a while ago. What I
could imagine are company A, company B and company C. All want to be
able to have "their own" search-engine. At the same time there might be
a "special" search-engine needed that crawls content from both company A
and B but not C. I think that's where your suggestion comes into play,
right? With the indexname.
a) How would you "extend" your indexes by one field before merging them?
is there a small tool to add a field to an index?
b) Do you always have to merge the indexes, or could you use some
feature from the "distributed" nutch to search in multiple indexes? I
just think about that because it would allow you to use multiple maybe
huge indexes that could all be updated separately and without having to
merge them again.

Another point I have understood from the original question:
How would it be possible to have an OpenSearch-interface for multiple
indexes running on one single Tomcat-instance. I think the author asked
whether you could/would install separate copies at the same time with
differeent searcher.dir-settings in their nutch-site.xml.
With your suggestion: I understand that a plugin similar to "query-more"
could be written to allow providing a search for "indexName" (as you
suggested) as well, right? With this, would it also be possible to ask
for "indexName=A or B but not C"?

  Stefan

Stefan Groschupf wrote:

> I'm not sure what you are planing to do, but you can just switch a
> symbolic link on your hdd driven by a cronjob to switch between index on
> a given time.
> May be you need to touch the web.xml to restart the searcher.
> If you try to search in different kind of indexes at the same time, I
> suggest to merge the indexes and have a kind keyfield for each of the
> indexes.
> For example add a field to each of your indexes names "indexName" and
> put A, B and C as value into it.
> Than you can merge your index. During runtime you just need to have a
> queryfilter that extend a indexName:A or indexName:B to the query string.
>
> Does this somehow help to solve your problem?
> Stefan
>
> Am 23.05.2006 um 15:26 schrieb TJ Roberts:
>
>> I have five different indexes each with their own special
>> configuration.  I would like to be able to switch between the
>> different indexes dynamically on a single instance of nutch running on
>> jakarta-tomcat.  Is this possible, or do I have to run five instances
>> of nutch, one for each index?
Reply | Threaded
Open this post in threaded view
|

Re: Multiple indexes on a single server instance.

Stefan Neufeind
In reply to this post by sudhendra seshachala
sudhendra seshachala wrote:
> I am experiencing a similar problem.
>   What I have done is as follows.
>   I have different parse-plugin for each site ( I have 3 sites to crawl and fetch data). But I capture data into same format I call it datarepository.
>   I have one index-plugin which indexes on data repository and one query-plugin on the data repository,
>   I dont have to run multiple instances. I just run one instance of search engine.
>   However the parse configuration is different for each site so I run different crawler for each site
>   Then I index and merge all of them. So far the results are good if not "WOW".
>   I still have to figure a way of ranking the page. For example I would like to be able to apply ranking on the data repository. Let me know If I was clear...

Hi,

not sure if I got you right with your last point, but it just came to my
mind:
It would be nice to be able to have something like
"If it's from indexA, give it 100 extra-points - if from indexB give it
50 extra-points". Or some "if indexA give it 20% extra-weight" or so.
But I don't believe this is easily doable. Or is it?

I got a similar problem with languages: give priority to documents in
German and English. But somewhere after those results also list
documents in other languages. So I'd need to be able to give
"extra-points" on a "per-language"-basis, based on the indexed
language-field, right?


Regards,
 Stefan

> Stefan Groschupf <[hidden email]> wrote:
>   I'm not sure what you are planing to do, but you can just switch a
> symbolic link on your hdd driven by a cronjob to switch between index
> on a given time.
> May be you need to touch the web.xml to restart the searcher.
> If you try to search in different kind of indexes at the same time, I
> suggest to merge the indexes and have a kind keyfield for each of the
> indexes.
> For example add a field to each of your indexes names "indexName" and
> put A, B and C as value into it.
> Than you can merge your index. During runtime you just need to have a
> queryfilter that extend a indexName:A or indexName:B to the query
> string.
>
> Does this somehow help to solve your problem?
> Stefan
>
> Am 23.05.2006 um 15:26 schrieb TJ Roberts:
>
>> I have five different indexes each with their own special
>> configuration. I would like to be able to switch between the
>> different indexes dynamically on a single instance of nutch running
>> on jakarta-tomcat. Is this possible, or do I have to run five
>> instances of nutch, one for each index?
Reply | Threaded
Open this post in threaded view
|

Re: Multiple indexes on a single server instance.

sudhendra seshachala

Yes you nailed it. I am not sure, if it is doable. I am still trying to figure that..
  My problem is I capture same or similar data from all sites. I should be able to apply those extra points.
   
  Stefan Neufeind <[hidden email]> wrote:
  sudhendra seshachala wrote:
> I am experiencing a similar problem.
> What I have done is as follows.
> I have different parse-plugin for each site ( I have 3 sites to crawl and fetch data). But I capture data into same format I call it datarepository.
> I have one index-plugin which indexes on data repository and one query-plugin on the data repository,
> I dont have to run multiple instances. I just run one instance of search engine.
> However the parse configuration is different for each site so I run different crawler for each site
> Then I index and merge all of them. So far the results are good if not "WOW".
> I still have to figure a way of ranking the page. For example I would like to be able to apply ranking on the data repository. Let me know If I was clear...

Hi,

not sure if I got you right with your last point, but it just came to my
mind:
It would be nice to be able to have something like
"If it's from indexA, give it 100 extra-points - if from indexB give it
50 extra-points". Or some "if indexA give it 20% extra-weight" or so.
But I don't believe this is easily doable. Or is it?

I got a similar problem with languages: give priority to documents in
German and English. But somewhere after those results also list
documents in other languages. So I'd need to be able to give
"extra-points" on a "per-language"-basis, based on the indexed
language-field, right?


Regards,
Stefan

> Stefan Groschupf wrote:
> I'm not sure what you are planing to do, but you can just switch a
> symbolic link on your hdd driven by a cronjob to switch between index
> on a given time.
> May be you need to touch the web.xml to restart the searcher.
> If you try to search in different kind of indexes at the same time, I
> suggest to merge the indexes and have a kind keyfield for each of the
> indexes.
> For example add a field to each of your indexes names "indexName" and
> put A, B and C as value into it.
> Than you can merge your index. During runtime you just need to have a
> queryfilter that extend a indexName:A or indexName:B to the query
> string.
>
> Does this somehow help to solve your problem?
> Stefan
>
> Am 23.05.2006 um 15:26 schrieb TJ Roberts:
>
>> I have five different indexes each with their own special
>> configuration. I would like to be able to switch between the
>> different indexes dynamically on a single instance of nutch running
>> on jakarta-tomcat. Is this possible, or do I have to run five
>> instances of nutch, one for each index?



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: Multiple indexes on a single server instance.

Andrzej Białecki-2
In reply to this post by Stefan Neufeind
Stefan Neufeind wrote:

> sudhendra seshachala wrote:
>  
>> I am experiencing a similar problem.
>>   What I have done is as follows.
>>   I have different parse-plugin for each site ( I have 3 sites to crawl and fetch data). But I capture data into same format I call it datarepository.
>>   I have one index-plugin which indexes on data repository and one query-plugin on the data repository,
>>   I dont have to run multiple instances. I just run one instance of search engine.
>>   However the parse configuration is different for each site so I run different crawler for each site
>>   Then I index and merge all of them. So far the results are good if not "WOW".
>>   I still have to figure a way of ranking the page. For example I would like to be able to apply ranking on the data repository. Let me know If I was clear...
>>    
>
> Hi,
>
> not sure if I got you right with your last point, but it just came to my
> mind:
> It would be nice to be able to have something like
> "If it's from indexA, give it 100 extra-points - if from indexB give it
> 50 extra-points". Or some "if indexA give it 20% extra-weight" or so.
> But I don't believe this is easily doable. Or is it?
>
> I got a similar problem with languages: give priority to documents in
> German and English. But somewhere after those results also list
> documents in other languages. So I'd need to be able to give
> "extra-points" on a "per-language"-basis, based on the indexed
> language-field, right?
>  


This is not only doable, but fairly easy - just add these fields to the
index through a custom IndexingFilter plugin, and then implement a
corresponding QueryPlugin that will expand your query appropriately -
this "prioritization" that you describe is equivalent to adding a
non-required and non-prohibited clause to a Lucene query. Please see how
it's done in the existing index-more/query-more and
index-basic/query-basic plugins.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Multiple indexes on a single server instance.

sudhendra seshachala
Yes. That is wha I am trying. But for some reason it is not working..
  Does these fields should be lower case only. ?
   
 

Andrzej Bialecki <[hidden email]> wrote:
  Stefan Neufeind wrote:

> sudhendra seshachala wrote:
>
>> I am experiencing a similar problem.
>> What I have done is as follows.
>> I have different parse-plugin for each site ( I have 3 sites to crawl and fetch data). But I capture data into same format I call it datarepository.
>> I have one index-plugin which indexes on data repository and one query-plugin on the data repository,
>> I dont have to run multiple instances. I just run one instance of search engine.
>> However the parse configuration is different for each site so I run different crawler for each site
>> Then I index and merge all of them. So far the results are good if not "WOW".
>> I still have to figure a way of ranking the page. For example I would like to be able to apply ranking on the data repository. Let me know If I was clear...
>>
>
> Hi,
>
> not sure if I got you right with your last point, but it just came to my
> mind:
> It would be nice to be able to have something like
> "If it's from indexA, give it 100 extra-points - if from indexB give it
> 50 extra-points". Or some "if indexA give it 20% extra-weight" or so.
> But I don't believe this is easily doable. Or is it?
>
> I got a similar problem with languages: give priority to documents in
> German and English. But somewhere after those results also list
> documents in other languages. So I'd need to be able to give
> "extra-points" on a "per-language"-basis, based on the indexed
> language-field, right?
>


This is not only doable, but fairly easy - just add these fields to the
index through a custom IndexingFilter plugin, and then implement a
corresponding QueryPlugin that will expand your query appropriately -
this "prioritization" that you describe is equivalent to adding a
non-required and non-prohibited clause to a Lucene query. Please see how
it's done in the existing index-more/query-more and
index-basic/query-basic plugins.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com





  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: Multiple indexes on a single server instance.

Andrzej Białecki-2
sudhendra seshachala wrote:
> Yes. That is wha I am trying. But for some reason it is not working..
>   Does these fields should be lower case only. ?
>  


Preferably. If you use the default NutchDocumentAnalyzer it will
lowercase field names, so you won't get any match.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Multiple indexes on a single server instance.

ravi chintakunta
See my thread

http://www.mail-archive.com/nutch-user@.../msg03014.html

where I have modified NutchBean to dynamically pickup the indexes that
have to be searcheed. The web page has checkboxes for each index, and
thus these indexes can be searched in any combination.

- Ravi Chintakunta



On 5/31/06, Andrzej Bialecki <[hidden email]> wrote:

> sudhendra seshachala wrote:
> > Yes. That is wha I am trying. But for some reason it is not working..
> >   Does these fields should be lower case only. ?
> >
>
>
> Preferably. If you use the default NutchDocumentAnalyzer it will
> lowercase field names, so you won't get any match.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Multiple indexes on a single server instance.

Stefan Neufeind
Sounds like others might have use for that as well possibly. Can you
provide a clean patchset, maybe? How about an "multi-index"-plugin which
parses a xml-file to find the paths to allow indexes, like

<indexes>
  <index>
    <id>index1</id>
    <path>/data/something/index1</path>
  </index>
  <index>
    <id>index2</id>
    <path>/data/somethingelse</path>
  </index>
</indexes>


Regards,
 Stefan

Ravi Chintakunta wrote:

> See my thread
>
> http://www.mail-archive.com/nutch-user@.../msg03014.html
>
> where I have modified NutchBean to dynamically pickup the indexes that
> have to be searcheed. The web page has checkboxes for each index, and
> thus these indexes can be searched in any combination.
>
> - Ravi Chintakunta
>
>
>
> On 5/31/06, Andrzej Bialecki <[hidden email]> wrote:
>> sudhendra seshachala wrote:
>> > Yes. That is wha I am trying. But for some reason it is not working..
>> >   Does these fields should be lower case only. ?
>> >
>>
>>
>> Preferably. If you use the default NutchDocumentAnalyzer it will
>> lowercase field names, so you won't get any match.
Reply | Threaded
Open this post in threaded view
|

Re: Multiple indexes on a single server instance.

Witney, Ernest
I would love to see something like that. Currently trying to figure  
out how to convert our htdig solution over to nutch. We use many  
different indexes then merge into searchable groups. We need to group  
by region.

see http://plantfacts.osu.edu/web

We started with phantom , then went to htdig now we wish to use  
NUTCH. We have NUTCH searching well, just don't have any background  
in java to do my own tweaks in code.

-Bud

On Jun 6, 2006, at 4:53 PM, Stefan Neufeind wrote:

> Sounds like others might have use for that as well possibly. Can you
> provide a clean patchset, maybe? How about an "multi-index"-plugin  
> which
> parses a xml-file to find the paths to allow indexes, like
>
> <indexes>
>   <index>
>     <id>index1</id>
>     <path>/data/something/index1</path>
>   </index>
>   <index>
>     <id>index2</id>
>     <path>/data/somethingelse</path>
>   </index>
> </indexes>
>
>
> Regards,
>  Stefan
>
> Ravi Chintakunta wrote:
>> See my thread
>>
>> http://www.mail-archive.com/nutch-user@.../ 
>> msg03014.html
>>
>> where I have modified NutchBean to dynamically pickup the indexes  
>> that
>> have to be searcheed. The web page has checkboxes for each index, and
>> thus these indexes can be searched in any combination.
>>
>> - Ravi Chintakunta
>>
>>
>>
>> On 5/31/06, Andrzej Bialecki <[hidden email]> wrote:
>>> sudhendra seshachala wrote:
>>>> Yes. That is wha I am trying. But for some reason it is not  
>>>> working..
>>>>   Does these fields should be lower case only. ?
>>>>
>>>
>>>
>>> Preferably. If you use the default NutchDocumentAnalyzer it will
>>> lowercase field names, so you won't get any match.