Infrastructure for large Lucene index

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Infrastructure for large Lucene index

Slava Imeshev

I am dealing with pretty challenging task, so I thought it would be
a good idea to ask community before I re-invent any wheels of my own.

I have a Lucene index that is going to grow to 100GB soon. This is
index going to be read very aggresively (10s of millions  requests
per day) with some occasional updates (10 batches per day).

The idea is to split load between multiple server nodes running Lucene
on *nix while accessing the same index that is shared across the network.

I am wondering if it's a good idea and/or if there are any recommendations
regarding selecting/tweaking network configuration (software+hardware)
for an index of this size.

Thank you.

Slava Imeshev

Reply | Threaded
Open this post in threaded view
|

RE: Infrastructure for large Lucene index

james-17
Hi Slava,

We currently do this across many machines for
http://www.FreePatentsOnline.com.  Our indexes are, in aggregate across our
various collections, even larger than you need.  We use Remote
ParalellMultiSearcher, with some custom modifications (and we are in the
process of making more) to allow most robust handling of many processes at
once and integration of the responses from various sub-indexes.  This works
fine on commodity hardware, and you will be IO bound, so get multiple drives
in each machine.

Out of curiosity, what project are you working on?  That's a lot of hits!

Sincerely,
James Ryley, Ph.D.
www.FreePatentsOnline.com


> -----Original Message-----
> From: Slava Imeshev [mailto:[hidden email]]
> Sent: Friday, October 06, 2006 2:28 PM
> To: [hidden email]
> Subject: Infrastructure for large Lucene index
>
>
> I am dealing with pretty challenging task, so I thought it would be
> a good idea to ask community before I re-invent any wheels of my own.
>
> I have a Lucene index that is going to grow to 100GB soon. This is
> index going to be read very aggresively (10s of millions  requests
> per day) with some occasional updates (10 batches per day).
>
> The idea is to split load between multiple server nodes running Lucene
> on *nix while accessing the same index that is shared across the network.
>
> I am wondering if it's a good idea and/or if there are any recommendations
> regarding selecting/tweaking network configuration (software+hardware)
> for an index of this size.
>
> Thank you.
>
> Slava Imeshev

Reply | Threaded
Open this post in threaded view
|

RE: Infrastructure for large Lucene index

james-17
In reply to this post by Slava Imeshev
I may have misinterpreted your email in my initial response.  Are you saying
you want nodes (presumably for more CPUs) that all access the same shared
index (on Network Attached Storage, presumably)?

If so, I think you are going to have read and write performance issues
unless you are using some SERIOUS storage system.  If you aren't already
committed to the hardware configuration you seem to be describing, I would
go with commodity hardware and split the indexes across each machine -- data
locality is going to be important.

Sincerely,
James Ryley, Ph.D.
www.FreePatentsOnline.com


> -----Original Message-----
> From: Slava Imeshev [mailto:[hidden email]]
> Sent: Friday, October 06, 2006 2:28 PM
> To: [hidden email]
> Subject: Infrastructure for large Lucene index
>
>
> I am dealing with pretty challenging task, so I thought it would be
> a good idea to ask community before I re-invent any wheels of my own.
>
> I have a Lucene index that is going to grow to 100GB soon. This is
> index going to be read very aggresively (10s of millions  requests
> per day) with some occasional updates (10 batches per day).
>
> The idea is to split load between multiple server nodes running Lucene
> on *nix while accessing the same index that is shared across the network.
>
> I am wondering if it's a good idea and/or if there are any recommendations
> regarding selecting/tweaking network configuration (software+hardware)
> for an index of this size.
>
> Thank you.
>
> Slava Imeshev

Reply | Threaded
Open this post in threaded view
|

Re: Infrastructure for large Lucene index

Doug Cutting
In reply to this post by james-17
James wrote:
> We use Remote
> ParalellMultiSearcher, with some custom modifications (and we are in the
> process of making more) to allow most robust handling of many processes at
> once and integration of the responses from various sub-indexes.

Can you share these modifications with others?  If so, please attach
them to an issue in Jira.

Thanks,

Doug
Reply | Threaded
Open this post in threaded view
|

RE: Infrastructure for large Lucene index

james-17
Hi Doug,

I want to say yes -- I think it would only be appropriate given the great
tools you have given us for free.  Let me check with the powers that be
here, and then get the code into a more polished form.  We hope to have it
really enterprise-ready over the next couple months.

Sincerely,
James

> -----Original Message-----
> From: Doug Cutting [mailto:[hidden email]]
> Sent: Friday, October 06, 2006 4:15 PM
> To: [hidden email]
> Subject: Re: Infrastructure for large Lucene index
>
> James wrote:
> > We use Remote
> > ParalellMultiSearcher, with some custom modifications (and we are in the
> > process of making more) to allow most robust handling of many processes
> at
> > once and integration of the responses from various sub-indexes.
>
> Can you share these modifications with others?  If so, please attach
> them to an issue in Jira.
>
> Thanks,
>
> Doug

Reply | Threaded
Open this post in threaded view
|

RE: Infrastructure for large Lucene index

Slava Imeshev
In reply to this post by james-17
James,

--- James <[hidden email]> wrote:
> We currently do this across many machines for
> http://www.FreePatentsOnline.com.  Our indexes are, in aggregate across our
> various collections, even larger than you need.  We use Remote
> ParalellMultiSearcher, with some custom modifications (and we are in the
> process of making more) to allow most robust handling of many processes at

I am not sure if ParalellMultiSearcher going to help here because we have
a large uniform index, not a set of collections.

> once and integration of the responses from various sub-indexes.  This works
> fine on commodity hardware, and you will be IO bound, so get multiple drives
> in each machine.
>
> Out of curiosity, what project are you working on?  That's a lot of hits!
>
> Sincerely,
> James Ryley, Ph.D.
> www.FreePatentsOnline.com
>
>
> > -----Original Message-----
> > From: Slava Imeshev [mailto:[hidden email]]
> > Sent: Friday, October 06, 2006 2:28 PM
> > To: [hidden email]
> > Subject: Infrastructure for large Lucene index
> >
> >
> > I am dealing with pretty challenging task, so I thought it would be
> > a good idea to ask community before I re-invent any wheels of my own.
> >
> > I have a Lucene index that is going to grow to 100GB soon. This is
> > index going to be read very aggresively (10s of millions  requests
> > per day) with some occasional updates (10 batches per day).
> >
> > The idea is to split load between multiple server nodes running Lucene
> > on *nix while accessing the same index that is shared across the network.
> >
> > I am wondering if it's a good idea and/or if there are any recommendations
> > regarding selecting/tweaking network configuration (software+hardware)
> > for an index of this size.

Regards,

Slava Imeshev

Reply | Threaded
Open this post in threaded view
|

Re: Infrastructure for large Lucene index

Yonik Seeley-2
In reply to this post by james-17
On 10/6/06, James <[hidden email]> wrote:
> Our indexes are, in aggregate across our
> various collections, even larger than you need.  We use Remote
> ParalellMultiSearcher, with some custom modifications (and we are in the
> process of making more)

I'm looking into adding some form of distributed search to Solr.
The main problem I see with directly using ParallelMultiSearcher is a
lack of high availability features.

If the index is broken into multiple "shards" then we need multiple
copies of each shard, and some way of loadbalancing and failing over
amongst copies of shards.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
Reply | Threaded
Open this post in threaded view
|

RE: Infrastructure for large Lucene index

james-17
In reply to this post by Slava Imeshev
We have both.  Although we have multiple collections, our largest
collections are still way too big for one machine.  You have to come up with
a scheme to randomly split documents across multiple servers (randomly so
that word frequency issues hopefully don't mean one index is getting pounded
while others are inactive for certain searches).

Sincerely,
James

> -----Original Message-----
> From: Slava Imeshev [mailto:[hidden email]]
> Sent: Friday, October 06, 2006 4:33 PM
> To: [hidden email]
> Subject: RE: Infrastructure for large Lucene index
>
> James,
>
> --- James <[hidden email]> wrote:
> > We currently do this across many machines for
> > http://www.FreePatentsOnline.com.  Our indexes are, in aggregate across
> our
> > various collections, even larger than you need.  We use Remote
> > ParalellMultiSearcher, with some custom modifications (and we are in the
> > process of making more) to allow most robust handling of many processes
> at
>
> I am not sure if ParalellMultiSearcher going to help here because we have
> a large uniform index, not a set of collections.
>
> > once and integration of the responses from various sub-indexes.  This
> works
> > fine on commodity hardware, and you will be IO bound, so get multiple
> drives
> > in each machine.
> >
> > Out of curiosity, what project are you working on?  That's a lot of
> hits!
> >
> > Sincerely,
> > James Ryley, Ph.D.
> > www.FreePatentsOnline.com
> >
> >
> > > -----Original Message-----
> > > From: Slava Imeshev [mailto:[hidden email]]
> > > Sent: Friday, October 06, 2006 2:28 PM
> > > To: [hidden email]
> > > Subject: Infrastructure for large Lucene index
> > >
> > >
> > > I am dealing with pretty challenging task, so I thought it would be
> > > a good idea to ask community before I re-invent any wheels of my own.
> > >
> > > I have a Lucene index that is going to grow to 100GB soon. This is
> > > index going to be read very aggresively (10s of millions  requests
> > > per day) with some occasional updates (10 batches per day).
> > >
> > > The idea is to split load between multiple server nodes running Lucene
> > > on *nix while accessing the same index that is shared across the
> network.
> > >
> > > I am wondering if it's a good idea and/or if there are any
> recommendations
> > > regarding selecting/tweaking network configuration (software+hardware)
> > > for an index of this size.
>
> Regards,
>
> Slava Imeshev

Reply | Threaded
Open this post in threaded view
|

RE: Infrastructure for large Lucene index

james-17
In reply to this post by Yonik Seeley-2
> If the index is broken into multiple "shards" then we need multiple copies
of each shard, and some way of loadbalancing and failing over amongst copies
of shards.

 

Yep.  Unfortunately it's not simple, but those are all pieces of what we are
currently in the process of implementing.

 

Sincerely,

James Ryley, Ph.D.

www.FreePatentsOnline.com <http://www.freepatentsonline.com/>

 

> -----Original Message-----

> From: [hidden email] [mailto:[hidden email]] On Behalf Of Yonik

> Seeley

> Sent: Friday, October 06, 2006 4:37 PM

> To: [hidden email]

> Subject: Re: Infrastructure for large Lucene index

>

> On 10/6/06, James <[hidden email]> wrote:

> > Our indexes are, in aggregate across our

> > various collections, even larger than you need.  We use Remote

> > ParalellMultiSearcher, with some custom modifications (and we are in the

> > process of making more)

>

> I'm looking into adding some form of distributed search to Solr.

> The main problem I see with directly using ParallelMultiSearcher is a

> lack of high availability features.

>

> If the index is broken into multiple "shards" then we need multiple

> copies of each shard, and some way of loadbalancing and failing over

> amongst copies of shards.

>

> -Yonik

> http://incubator.apache.org/solr Solr, the open-source Lucene search

> server

Reply | Threaded
Open this post in threaded view
|

RE: Infrastructure for large Lucene index

Slava Imeshev
In reply to this post by james-17
--- James <[hidden email]> wrote:
> I may have misinterpreted your email in my initial response.  Are you saying
> you want nodes (presumably for more CPUs) that all access the same shared
> index (on Network Attached Storage, presumably)?

Yes, that's right.

> If so, I think you are going to have read and write performance issues
> unless you are using some SERIOUS storage system.  

Yes, that's what I am trying to figure out, how serious it should be.

> If you aren't already
> committed to the hardware configuration you seem to be describing,

I am not.

> I would go with commodity hardware and split the indexes across each machine -- data
> locality is going to be important.

This is understood, but that is not going to work for searching for "cat dog"
when the "cat" is in one index and the "dog" in another.

Slava

>
> Sincerely,
> James Ryley, Ph.D.
> www.FreePatentsOnline.com
>
>
> > -----Original Message-----
> > From: Slava Imeshev [mailto:[hidden email]]
> > Sent: Friday, October 06, 2006 2:28 PM
> > To: [hidden email]
> > Subject: Infrastructure for large Lucene index
> >
> >
> > I am dealing with pretty challenging task, so I thought it would be
> > a good idea to ask community before I re-invent any wheels of my own.
> >
> > I have a Lucene index that is going to grow to 100GB soon. This is
> > index going to be read very aggresively (10s of millions  requests
> > per day) with some occasional updates (10 batches per day).
> >
> > The idea is to split load between multiple server nodes running Lucene
> > on *nix while accessing the same index that is shared across the network.
> >
> > I am wondering if it's a good idea and/or if there are any recommendations
> > regarding selecting/tweaking network configuration (software+hardware)
> > for an index of this size.
> >
> > Thank you.
> >
> > Slava Imeshev
>
>
>

Reply | Threaded
Open this post in threaded view
|

RE: Infrastructure for large Lucene index

james-17
You don't separate the index by having "cat" in one and "dog" in another.
You separate it by document, so that both indexes have cat and dog, but the
indexes are smaller, meaning that response time is greatly increased.

> -----Original Message-----
> From: Slava Imeshev [mailto:[hidden email]]
> Sent: Friday, October 06, 2006 4:59 PM
> To: [hidden email]
> Subject: RE: Infrastructure for large Lucene index
>
> --- James <[hidden email]> wrote:
> > I may have misinterpreted your email in my initial response.  Are you
> saying
> > you want nodes (presumably for more CPUs) that all access the same
> shared
> > index (on Network Attached Storage, presumably)?
>
> Yes, that's right.
>
> > If so, I think you are going to have read and write performance issues
> > unless you are using some SERIOUS storage system.
>
> Yes, that's what I am trying to figure out, how serious it should be.
>
> > If you aren't already
> > committed to the hardware configuration you seem to be describing,
>
> I am not.
>
> > I would go with commodity hardware and split the indexes across each
> machine -- data
> > locality is going to be important.
>
> This is understood, but that is not going to work for searching for "cat
> dog"
> when the "cat" is in one index and the "dog" in another.
>
> Slava
>
> >
> > Sincerely,
> > James Ryley, Ph.D.
> > www.FreePatentsOnline.com
> >
> >
> > > -----Original Message-----
> > > From: Slava Imeshev [mailto:[hidden email]]
> > > Sent: Friday, October 06, 2006 2:28 PM
> > > To: [hidden email]
> > > Subject: Infrastructure for large Lucene index
> > >
> > >
> > > I am dealing with pretty challenging task, so I thought it would be
> > > a good idea to ask community before I re-invent any wheels of my own.
> > >
> > > I have a Lucene index that is going to grow to 100GB soon. This is
> > > index going to be read very aggresively (10s of millions  requests
> > > per day) with some occasional updates (10 batches per day).
> > >
> > > The idea is to split load between multiple server nodes running Lucene
> > > on *nix while accessing the same index that is shared across the
> network.
> > >
> > > I am wondering if it's a good idea and/or if there are any
> recommendations
> > > regarding selecting/tweaking network configuration (software+hardware)
> > > for an index of this size.
> > >
> > > Thank you.
> > >
> > > Slava Imeshev
> >
> >
> >

Reply | Threaded
Open this post in threaded view
|

RE: Infrastructure for large Lucene index

Slava Imeshev
--- James <[hidden email]> wrote:
> You don't separate the index by having "cat" in one and "dog" in another.
> You separate it by document, so that both indexes have cat and dog, but the
> indexes are smaller, meaning that response time is greatly increased.

I think I oversimplified the problem with this example.

Slava

>
> > -----Original Message-----
> > From: Slava Imeshev [mailto:[hidden email]]
> > Sent: Friday, October 06, 2006 4:59 PM
> > To: [hidden email]
> > Subject: RE: Infrastructure for large Lucene index
> >
> > --- James <[hidden email]> wrote:
> > > I may have misinterpreted your email in my initial response.  Are you
> > saying
> > > you want nodes (presumably for more CPUs) that all access the same
> > shared
> > > index (on Network Attached Storage, presumably)?
> >
> > Yes, that's right.
> >
> > > If so, I think you are going to have read and write performance issues
> > > unless you are using some SERIOUS storage system.
> >
> > Yes, that's what I am trying to figure out, how serious it should be.
> >
> > > If you aren't already
> > > committed to the hardware configuration you seem to be describing,
> >
> > I am not.
> >
> > > I would go with commodity hardware and split the indexes across each
> > machine -- data
> > > locality is going to be important.
> >
> > This is understood, but that is not going to work for searching for "cat
> > dog"
> > when the "cat" is in one index and the "dog" in another.
> >
> > Slava
> >
> > >
> > > Sincerely,
> > > James Ryley, Ph.D.
> > > www.FreePatentsOnline.com
> > >
> > >
> > > > -----Original Message-----
> > > > From: Slava Imeshev [mailto:[hidden email]]
> > > > Sent: Friday, October 06, 2006 2:28 PM
> > > > To: [hidden email]
> > > > Subject: Infrastructure for large Lucene index
> > > >
> > > >
> > > > I am dealing with pretty challenging task, so I thought it would be
> > > > a good idea to ask community before I re-invent any wheels of my own.
> > > >
> > > > I have a Lucene index that is going to grow to 100GB soon. This is
> > > > index going to be read very aggresively (10s of millions  requests
> > > > per day) with some occasional updates (10 batches per day).
> > > >
> > > > The idea is to split load between multiple server nodes running Lucene
> > > > on *nix while accessing the same index that is shared across the
> > network.
> > > >
> > > > I am wondering if it's a good idea and/or if there are any
> > recommendations
> > > > regarding selecting/tweaking network configuration (software+hardware)
> > > > for an index of this size.
> > > >
> > > > Thank you.
> > > >
> > > > Slava Imeshev
> > >
> > >
> > >
>
>
>

Reply | Threaded
Open this post in threaded view
|

RE: Infrastructure for large Lucene index

Slava Imeshev
In reply to this post by james-17
--- James <[hidden email]> wrote:
> We have both.  Although we have multiple collections, our largest
> collections are still way too big for one machine.  You have to come up with
> a scheme to randomly split documents across multiple servers (randomly so
> that word frequency issues hopefully don't mean one index is getting pounded
> while others are inactive for certain searches).

Yes, this is valid concern.

Slava


>
> Sincerely,
> James
>
> > -----Original Message-----
> > From: Slava Imeshev [mailto:[hidden email]]
> > Sent: Friday, October 06, 2006 4:33 PM
> > To: [hidden email]
> > Subject: RE: Infrastructure for large Lucene index
> >
> > James,
> >
> > --- James <[hidden email]> wrote:
> > > We currently do this across many machines for
> > > http://www.FreePatentsOnline.com.  Our indexes are, in aggregate across
> > our
> > > various collections, even larger than you need.  We use Remote
> > > ParalellMultiSearcher, with some custom modifications (and we are in the
> > > process of making more) to allow most robust handling of many processes
> > at
> >
> > I am not sure if ParalellMultiSearcher going to help here because we have
> > a large uniform index, not a set of collections.
> >
> > > once and integration of the responses from various sub-indexes.  This
> > works
> > > fine on commodity hardware, and you will be IO bound, so get multiple
> > drives
> > > in each machine.
> > >
> > > Out of curiosity, what project are you working on?  That's a lot of
> > hits!
> > >
> > > Sincerely,
> > > James Ryley, Ph.D.
> > > www.FreePatentsOnline.com
> > >
> > >
> > > > -----Original Message-----
> > > > From: Slava Imeshev [mailto:[hidden email]]
> > > > Sent: Friday, October 06, 2006 2:28 PM
> > > > To: [hidden email]
> > > > Subject: Infrastructure for large Lucene index
> > > >
> > > >
> > > > I am dealing with pretty challenging task, so I thought it would be
> > > > a good idea to ask community before I re-invent any wheels of my own.
> > > >
> > > > I have a Lucene index that is going to grow to 100GB soon. This is
> > > > index going to be read very aggresively (10s of millions  requests
> > > > per day) with some occasional updates (10 batches per day).
> > > >
> > > > The idea is to split load between multiple server nodes running Lucene
> > > > on *nix while accessing the same index that is shared across the
> > network.
> > > >
> > > > I am wondering if it's a good idea and/or if there are any
> > recommendations
> > > > regarding selecting/tweaking network configuration (software+hardware)
> > > > for an index of this size.
> >
> > Regards,
> >
> > Slava Imeshev
>
>
>

Reply | Threaded
Open this post in threaded view
|

RE: Infrastructure for large Lucene index

Slava Imeshev
In reply to this post by james-17
-- James <[hidden email]> wrote:
> > If the index is broken into multiple "shards" then we need multiple copies
> of each shard, and some way of loadbalancing and failing over amongst copies
> of shards.
>
> Yep.  Unfortunately it's not simple, but those are all pieces of what we are
> currently in the process of implementing.

The problem is that over time indexes develop "personality" and the term frequency
can be vary significantly from index to index....

Slava



>
>  
>
> Sincerely,
>
> James Ryley, Ph.D.
>
> www.FreePatentsOnline.com <http://www.freepatentsonline.com/>
>
>  
>
> > -----Original Message-----
>
> > From: [hidden email] [mailto:[hidden email]] On Behalf Of Yonik
>
> > Seeley
>
> > Sent: Friday, October 06, 2006 4:37 PM
>
> > To: [hidden email]
>
> > Subject: Re: Infrastructure for large Lucene index
>
> >
>
> > On 10/6/06, James <[hidden email]> wrote:
>
> > > Our indexes are, in aggregate across our
>
> > > various collections, even larger than you need.  We use Remote
>
> > > ParalellMultiSearcher, with some custom modifications (and we are in the
>
> > > process of making more)
>
> >
>
> > I'm looking into adding some form of distributed search to Solr.
>
> > The main problem I see with directly using ParallelMultiSearcher is a
>
> > lack of high availability features.
>
> >
>
> > If the index is broken into multiple "shards" then we need multiple
>
> > copies of each shard, and some way of loadbalancing and failing over
>
> > amongst copies of shards.
>
> >
>
> > -Yonik
>
> > http://incubator.apache.org/solr Solr, the open-source Lucene search
>
> > server
>
>

Reply | Threaded
Open this post in threaded view
|

RE: Infrastructure for large Lucene index

james-17
Agreed.  For example, with patents we have to be concerned about
technology-related terms that are more prominent in certain time periods.  I
think a good random assignment scheme addresses most such problems, but
worst case you can always redo the indexes entirely if they get too
non-random.

Sincerely,
James

> -----Original Message-----
> From: Slava Imeshev [mailto:[hidden email]]
> Sent: Friday, October 06, 2006 5:27 PM
> To: [hidden email]
> Subject: RE: Infrastructure for large Lucene index
>
> -- James <[hidden email]> wrote:
> > > If the index is broken into multiple "shards" then we need multiple
> copies
> > of each shard, and some way of loadbalancing and failing over amongst
> copies
> > of shards.
> >
> > Yep.  Unfortunately it's not simple, but those are all pieces of what we
> are
> > currently in the process of implementing.
>
> The problem is that over time indexes develop "personality" and the term
> frequency
> can be vary significantly from index to index....
>
> Slava
>
>
>
> >
> >
> >
> > Sincerely,
> >
> > James Ryley, Ph.D.
> >
> > www.FreePatentsOnline.com <http://www.freepatentsonline.com/>
> >
> >
> >
> > > -----Original Message-----
> >
> > > From: [hidden email] [mailto:[hidden email]] On Behalf Of Yonik
> >
> > > Seeley
> >
> > > Sent: Friday, October 06, 2006 4:37 PM
> >
> > > To: [hidden email]
> >
> > > Subject: Re: Infrastructure for large Lucene index
> >
> > >
> >
> > > On 10/6/06, James <[hidden email]> wrote:
> >
> > > > Our indexes are, in aggregate across our
> >
> > > > various collections, even larger than you need.  We use Remote
> >
> > > > ParalellMultiSearcher, with some custom modifications (and we are in
> the
> >
> > > > process of making more)
> >
> > >
> >
> > > I'm looking into adding some form of distributed search to Solr.
> >
> > > The main problem I see with directly using ParallelMultiSearcher is a
> >
> > > lack of high availability features.
> >
> > >
> >
> > > If the index is broken into multiple "shards" then we need multiple
> >
> > > copies of each shard, and some way of loadbalancing and failing over
> >
> > > amongst copies of shards.
> >
> > >
> >
> > > -Yonik
> >
> > > http://incubator.apache.org/solr Solr, the open-source Lucene search
> >
> > > server
> >
> >

Reply | Threaded
Open this post in threaded view
|

Re: Infrastructure for large Lucene index

Yonik Seeley-2
In reply to this post by Slava Imeshev
On 10/6/06, Slava Imeshev <[hidden email]> wrote:

> -- James <[hidden email]> wrote:
> > > If the index is broken into multiple "shards" then we need multiple copies
> > of each shard, and some way of loadbalancing and failing over amongst copies
> > of shards.
> >
> > Yep.  Unfortunately it's not simple, but those are all pieces of what we are
> > currently in the process of implementing.
>
> The problem is that over time indexes develop "personality" and the term frequency
> can be vary significantly from index to index....

A global idf calculation is possible though... MultiSearcher already
does this when searching across multiple indicies.  The downside of
doing it across remote indicies is an increase in the number of RPC
calls.  In general, it's probably better to try and keep index shards
balanced.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
Reply | Threaded
Open this post in threaded view
|

Re: Infrastructure for large Lucene index

Doug Cutting
In reply to this post by james-17
James wrote:
> Let me check with the powers that be
> here, and then get the code into a more polished form.  We hope to have it
> really enterprise-ready over the next couple months.

Great!  Once you have permission, please post it sooner rather than
later, then others can help with polishing, or at least be informed by
your methods.  What I'd hate to happen is for you to get permission but
never have time to polish it, and hence never contribute it.  An
unpolished patch is better than no patch at all.

Thanks!

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Infrastructure for large Lucene index

kkrugler
In reply to this post by Slava Imeshev
>I am dealing with pretty challenging task, so I thought it would be
>a good idea to ask community before I re-invent any wheels of my own.
>
>I have a Lucene index that is going to grow to 100GB soon. This is
>index going to be read very aggresively (10s of millions  requests
>per day) with some occasional updates (10 batches per day).
>
>The idea is to split load between multiple server nodes running Lucene
>on *nix while accessing the same index that is shared across the network.
>
>I am wondering if it's a good idea and/or if there are any recommendations
>regarding selecting/tweaking network configuration (software+hardware)
>for an index of this size.

A few quick comments to this, including some of the subsequent thread
discussion.

1. Unless you have a lot of RAM, sufficient to effectively keep the
entire index in memory, you're better off maximizing the number of
spindles. Using one big file server is, IMHO, a Really Bad Idea for
this type of application. You'll pay top dollar for something with
the reliability and performance that you think you need, and then
you'll still wind up being I/O bound.

I'd say the best configuration is a dual CPU/dual core box with 4
fast drives and a boat-load of RAM - say 8GB for starters. You run
four JVMs with four indexes on each box, where each index is on a
separate drive. Assume the file system will do a reasonable job of
caching data for you, so don't bother trying to use RAMDirectory or
MMapDirectory.

2. It's easy to get hung up on document frequency skews. As James and
others have noted, in general things seem to work OK by just
randomizing which document goes to what index - e.g. do it by hash of
the document URL/name, and make sure that every new batch of
documents (if you're doing incremental updates) gets spread this same
way. As long as your hash function has nothing to do with searchable
terms that you care about, you should be OK.

3. If you're worried about high availability, then one fairly simple
approach is to have two parallel set of search clusters, with a load
balancer in front. For each cluster, monitor both the front-end
server (where the results get combined) and each of the back-end
search servers - for example, something like Big Brother or Ganglia.
Then if one of the search servers (or, god forbid, the front end
server) goes down, you can automatically remove that cluster from the
load balancer's active set.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Reply | Threaded
Open this post in threaded view
|

Re: Infrastructure for large Lucene index

Chris Hostetter-3

: 3. If you're worried about high availability, then one fairly simple
: approach is to have two parallel set of search clusters, with a load
: balancer in front. For each cluster, monitor both the front-end
: server (where the results get combined) and each of the back-end
: search servers - for example, something like Big Brother or Ganglia.
: Then if one of the search servers (or, god forbid, the front end
: server) goes down, you can automatically remove that cluster from the
: load balancer's active set.

the availability of this approach doesn't scale very cleanly though ... if
any one box in either cluster goes down, the entire cluster becomes
unusable.  Doubling the size of your collection would only double the
number of boxes you need -- but the reliability would be cut in half,
meaning you'd really need to quadruple the number of boxes (doubling the
number of clusters) to maintain the same level of reliability ... if i'[m
not mistaken the number of boxes would need to grow quadraticly as your
index size grows linearly.

A system where every individual node in the cluster is load balanced
across 2 physical boxes would require the same amount of hardware to start
with, but would require a lot less hardware to grow.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Infrastructure for large Lucene index

Doug Cutting
Chris Hostetter wrote:

> : 3. If you're worried about high availability, then one fairly simple
> : approach is to have two parallel set of search clusters, with a load
> : balancer in front. For each cluster, monitor both the front-end
> : server (where the results get combined) and each of the back-end
> : search servers - for example, something like Big Brother or Ganglia.
> : Then if one of the search servers (or, god forbid, the front end
> : server) goes down, you can automatically remove that cluster from the
> : load balancer's active set.
>
> the availability of this approach doesn't scale very cleanly though ... if
> any one box in either cluster goes down, the entire cluster becomes
> unusable.

A cost-effective variation works as follows: if you have 10 indexes and
11 nodes, then you keep one node as a spare.  When any of the 10 active
nodes fail, the 11th resumes its duties.  While the 11th node is
launching you search only 9 out of the 10 indexes, so failover is not
entirely seamless, but it's a lot cheaper than mirroring all nodes.

Doug
12