maximum index size

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

maximum index size

Kevin Osborn-2
I know there are a bunch of variables here (RAM, number of fields, hits, etc.), but I am trying to get a sense of how big of an index in terms of number of documents Solr can reasonable handle. I have heard indexes of 3-4 million documents running fine. But, I have no idea what a reasonable upper limit might be.

I have a large number of documents and about 200-300 customers would have access to varying subsets of those documents. So, one possible strategy is to have everything in a large index, but duplicate the documents for each customer that has access to that document. But, that would really make the total number of documents huge. So, I am trying to get a sense of how big is too big. Each document will probably have about 30 fields. Most of them will be strings, but there will be some text, ints,a nd floats.

An extension to this strategy is to segment the customers among various instances of Solr.

Reply | Threaded
Open this post in threaded view
|

Re: maximum index size

Mike Klaas
On 3/27/07, Kevin Osborn <[hidden email]> wrote:
> I know there are a bunch of variables here (RAM, number of fields, hits, etc.), but I am trying to get a sense of how big of an index in terms of number of documents Solr can reasonable handle. I have heard indexes of 3-4 million documents running fine. But, I have no idea what a reasonable upper limit might be.

People have constructed (lucene) indices with over a billion
documents.  But if "reasonable" means something like "<1s query time
for a medium-complexity query on non-astronomical hardware", I
wouldn't go much higher than the figure you quote.

> I have a large number of documents and about 200-300 customers would have access to varying subsets of those documents. So, one possible strategy is to have everything in a large index, but duplicate the documents for each customer that has access to that document. But, that would really make the total number of documents huge. So, I am trying to get a sense of how big is too big. Each document will probably have about 30 fields. Most of them will be strings, but there will be some text, ints,a nd floats.

If you are going to store a document for each customer then some field
must indicate to which customer the document instance belongs.  In
that case, why not index a single copy of each document, with a field
containing a list of customers having access?

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: maximum index size

Kevin Osborn-2
In reply to this post by Kevin Osborn-2


----- Original Message ----
From: Mike Klaas <[hidden email]>
To: [hidden email]
Sent: Tuesday, March 27, 2007 3:20:40 PM
Subject: Re: maximum index size

If you are going to store a document for each customer then some field
must indicate to which customer the document instance belongs.  In
that case, why not index a single copy of each document, with a field
containing a list of customers having access?

Unfortunately, each customer will also potentially customize the way they do their searches. If it was just an ACL, that is probably what I would do.




Reply | Threaded
Open this post in threaded view
|

Re: maximum index size

Mike Klaas
On 3/27/07, Kevin Osborn <[hidden email]> wrote:
>
> > If you are going to store a document for each customer then some field
> > must indicate to which customer the document instance belongs.  In
> > that case, why not index a single copy of each document, with a field
> > containing a list of customers having access?
>
> Unfortunately, each customer will also potentially customize the way they do their searches. If it was just an ACL, that is probably what I would do.

If there is per-document, per-client data which is non-trivial and
cannot be efficiently expressed using some technique like dynamic
fields (sounds like it), then they are effectively different
documents.

Sounds like you have lots of small fields.  The performance of
combining these documents all into one index will depend greatly on
how your sort, field overlap, etc., perhaps moreso than the number of
docs.  I'll don't have much lucene-sort fu, though, so an expert
should chime in...

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: maximum index size

Venkatesh Seetharam
In reply to this post by Kevin Osborn-2
I've 50 million documents each about 10K in size and I've 4 index partitions
each consisting of 12.5 million documents. Each index partition is about
80GB. A search typically takes about 3-5 seconds. Single word searches are
faster than multi-word searches. I'm still working on finding the ideal
index size that Solr can handle well with in a second.

Thanks,
Venkatesh

On 3/27/07, Kevin Osborn <[hidden email]> wrote:

>
> I know there are a bunch of variables here (RAM, number of fields, hits,
> etc.), but I am trying to get a sense of how big of an index in terms of
> number of documents Solr can reasonable handle. I have heard indexes of 3-4
> million documents running fine. But, I have no idea what a reasonable upper
> limit might be.
>
> I have a large number of documents and about 200-300 customers would have
> access to varying subsets of those documents. So, one possible strategy is
> to have everything in a large index, but duplicate the documents for each
> customer that has access to that document. But, that would really make the
> total number of documents huge. So, I am trying to get a sense of how big is
> too big. Each document will probably have about 30 fields. Most of them will
> be strings, but there will be some text, ints,a nd floats.
>
> An extension to this strategy is to segment the customers among various
> instances of Solr.
>
>
Reply | Threaded
Open this post in threaded view
|

RE: maximum index size

Andre Basse
>I've 50 million documents each about 10K in size and I've 4 index
partitions each consisting of 12.5 million documents. Each index
partition is about 80GB. A search typically takes about 3-5 seconds.
Single word searches are faster than multi-word searches. I'm still
working on finding the ideal index size that Solr can handle well with
in a second.

Hi Venkatesh,

I'm looking at a similar size of archive. What hardware are you running?
Do you use collection distribution?


Thanks,

Andre


The information contained in this e-mail message and any accompanying files is or may be confidential. If you are not the intended recipient, any use, dissemination, reliance, forwarding, printing or copying of this e-mail or any attached files is unauthorised. This e-mail is subject to copyright. No part of it should be reproduced, adapted or communicated without the written consent of the copyright owner. If you have received this e-mail in error please advise the sender immediately by return e-mail or telephone and delete all copies. Fairfax does not guarantee the accuracy or completeness of any information contained in this e-mail or attached files. Internet communications are not secure, therefore Fairfax does not accept legal responsibility for the contents of this message or attached files.
Reply | Threaded
Open this post in threaded view
|

Re: maximum index size

Venkatesh Seetharam
Hi Andre,

Comments are inline.

> What hardware are you running?
4 Dual-proc 64 GB blades for each searcher and a broker that merges results
on 64 bit SUSE linux running JDK 1.6 with 8GB Heap.

> Do you use collection distribution?
Nope. I use hadoop to index the documents.

Thanks,
Venkatesh

On 3/27/07, Andre Basse <[hidden email]> wrote:

>
> >I've 50 million documents each about 10K in size and I've 4 index
> partitions each consisting of 12.5 million documents. Each index
> partition is about 80GB. A search typically takes about 3-5 seconds.
> Single word searches are faster than multi-word searches. I'm still
> working on finding the ideal index size that Solr can handle well with
> in a second.
>
> Hi Venkatesh,
>
> I'm looking at a similar size of archive. What hardware are you running?
> Do you use collection distribution?
>
>
> Thanks,
>
> Andre
>
>
> The information contained in this e-mail message and any accompanying
> files is or may be confidential. If you are not the intended recipient, any
> use, dissemination, reliance, forwarding, printing or copying of this e-mail
> or any attached files is unauthorised. This e-mail is subject to copyright.
> No part of it should be reproduced, adapted or communicated without the
> written consent of the copyright owner. If you have received this e-mail in
> error please advise the sender immediately by return e-mail or telephone and
> delete all copies. Fairfax does not guarantee the accuracy or completeness
> of any information contained in this e-mail or attached files. Internet
> communications are not secure, therefore Fairfax does not accept legal
> responsibility for the contents of this message or attached files.
>
Reply | Threaded
Open this post in threaded view
|

Re: maximum index size

Otis Gospodnetic-2
In reply to this post by Kevin Osborn-2
Hi Mike,

I'm curious about what you said there:  "People have constructed (lucene) indices with over a billion
documents.".  Are you referring to somebody specific?  I've never heard of anyone creating a single Lucene index that large, but I'd love to know who did that.

Thanks,
Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Mike Klaas <[hidden email]>
To: [hidden email]
Sent: Tuesday, March 27, 2007 6:20:40 PM
Subject: Re: maximum index size

On 3/27/07, Kevin Osborn <[hidden email]> wrote:
> I know there are a bunch of variables here (RAM, number of fields, hits, etc.), but I am trying to get a sense of how big of an index in terms of number of documents Solr can reasonable handle. I have heard indexes of 3-4 million documents running fine. But, I have no idea what a reasonable upper limit might be.

People have constructed (lucene) indices with over a billion
documents.  But if "reasonable" means something like "<1s query time
for a medium-complexity query on non-astronomical hardware", I
wouldn't go much higher than the figure you quote.

> I have a large number of documents and about 200-300 customers would have access to varying subsets of those documents. So, one possible strategy is to have everything in a large index, but duplicate the documents for each customer that has access to that document. But, that would really make the total number of documents huge. So, I am trying to get a sense of how big is too big. Each document will probably have about 30 fields. Most of them will be strings, but there will be some text, ints,a nd floats.

If you are going to store a document for each customer then some field
must indicate to which customer the document instance belongs.  In
that case, why not index a single copy of each document, with a field
containing a list of customers having access?

-Mike



Reply | Threaded
Open this post in threaded view
|

Re: maximum index size

Venkatesh Seetharam
In reply to this post by Mike Klaas
Hi Mike,

I'd be interested to know what is the ideal size for an index to achieve 1
sec response time for queries. I'd appreciate if you can share any numbers.

Thanks,
Venkatesh

On 3/27/07, Mike Klaas <[hidden email]> wrote:

>
> On 3/27/07, Kevin Osborn <[hidden email]> wrote:
> > I know there are a bunch of variables here (RAM, number of fields, hits,
> etc.), but I am trying to get a sense of how big of an index in terms of
> number of documents Solr can reasonable handle. I have heard indexes of 3-4
> million documents running fine. But, I have no idea what a reasonable upper
> limit might be.
>
> People have constructed (lucene) indices with over a billion
> documents.  But if "reasonable" means something like "<1s query time
> for a medium-complexity query on non-astronomical hardware", I
> wouldn't go much higher than the figure you quote.
>
> > I have a large number of documents and about 200-300 customers would
> have access to varying subsets of those documents. So, one possible strategy
> is to have everything in a large index, but duplicate the documents for each
> customer that has access to that document. But, that would really make the
> total number of documents huge. So, I am trying to get a sense of how big is
> too big. Each document will probably have about 30 fields. Most of them will
> be strings, but there will be some text, ints,a nd floats.
>
> If you are going to store a document for each customer then some field
> must indicate to which customer the document instance belongs.  In
> that case, why not index a single copy of each document, with a field
> containing a list of customers having access?
>
> -Mike
>
Reply | Threaded
Open this post in threaded view
|

Re: maximum index size

Mike Klaas
In reply to this post by Otis Gospodnetic-2
Hi Otis,

I'm afraid I wasn't thinking of anyone specific--just something I
recall reading on the lucene list.  I assumed that the "document" was
a very small piece of data.

Of course, it is also possible that the message I recall reading was
something like http://java2.5341.com/msg/91276.html, which doesn't
exactly boast completion of such a feat!

-Mike

On 3/28/07, Otis Gospodnetic <[hidden email]> wrote:

> Hi Mike,
>
> I'm curious about what you said there:  "People have constructed (lucene) indices with over a billion
> documents.".  Are you referring to somebody specific?  I've never heard of anyone creating a single Lucene index that large, but I'd love to know who did that.
>
> Thanks,
> Otis
>
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
> ----- Original Message ----
> From: Mike Klaas <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, March 27, 2007 6:20:40 PM
> Subject: Re: maximum index size
>
> On 3/27/07, Kevin Osborn <[hidden email]> wrote:
> > I know there are a bunch of variables here (RAM, number of fields, hits, etc.), but I am trying to get a sense of how big of an index in terms of number of documents Solr can reasonable handle. I have heard indexes of 3-4 million documents running fine. But, I have no idea what a reasonable upper limit might be.
>
> People have constructed (lucene) indices with over a billion
> documents.  But if "reasonable" means something like "<1s query time
> for a medium-complexity query on non-astronomical hardware", I
> wouldn't go much higher than the figure you quote.
>
> > I have a large number of documents and about 200-300 customers would have access to varying subsets of those documents. So, one possible strategy is to have everything in a large index, but duplicate the documents for each customer that has access to that document. But, that would really make the total number of documents huge. So, I am trying to get a sense of how big is too big. Each document will probably have about 30 fields. Most of them will be strings, but there will be some text, ints,a nd floats.
>
> If you are going to store a document for each customer then some field
> must indicate to which customer the document instance belongs.  In
> that case, why not index a single copy of each document, with a field
> containing a list of customers having access?
>
> -Mike
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Index Files

Michael Beccaria
Simple curious question from a newbie:

Can I have another computer index my data, then copy the index folder
files into my live system and run a commit?

The project idea I have is for a library catalog which will update
holdings information (whether a book is checked out) for an item record
(also a solr\lucene record). My collection is small enough that it can
index the entire library collection in about 20 minutes. If I have
another computer continually indexing, then copy those files to my live
system and commit, will that successfully update the index?

Mike

--------------------
Mike Beccaria
Systems Librarian
Head of Digital Initiatives
Paul Smith's College
518.327.6376
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Index Files

Yonik Seeley-2
On 3/29/07, Michael Beccaria <[hidden email]> wrote:

> Simple curious question from a newbie:
>
> Can I have another computer index my data, then copy the index folder
> files into my live system and run a commit?
>
> The project idea I have is for a library catalog which will update
> holdings information (whether a book is checked out) for an item record
> (also a solr\lucene record). My collection is small enough that it can
> index the entire library collection in about 20 minutes. If I have
> another computer continually indexing, then copy those files to my live
> system and commit, will that successfully update the index?

Yes, this is essentially what Solr's distribution scripts do in an
automated way.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: maximum index size

Chris Hostetter-3
In reply to this post by Venkatesh Seetharam

: I'd be interested to know what is the ideal size for an index to achieve 1
: sec response time for queries. I'd appreciate if you can share any numbers.

that's a fairly impossible question to answer ... the lucene email
archives have lots of discusssion about how the number of documents isn't
really the biggest facter when considering raw search performance ... the
number of unique terms in the index and the average number of terms per
docuemnt are typically more significant factors.

there's also the question of what you mean by a "query" .. a simple term
query is a lot cheaper/faster then a complex boolean query or a phrase
query.




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: maximum index size

James liu-2
If u wanna use one index file to do it, i think u know how to do when u read
my this mail.

I think maybe you can divid it into serveral ?(i don't know how to define
it.) everyone have one master and serveral slaver if u use solr...one
request do serveral query.
it can reduce index file size and index time.
But it will lost some search performance...maybe it will be fixed by more pc
or server.


2007/3/30, Chris Hostetter <[hidden email]>:

>
>
> : I'd be interested to know what is the ideal size for an index to achieve
> 1
> : sec response time for queries. I'd appreciate if you can share any
> numbers.
>
> that's a fairly impossible question to answer ... the lucene email
> archives have lots of discusssion about how the number of documents isn't
> really the biggest facter when considering raw search performance ... the
> number of unique terms in the index and the average number of terms per
> docuemnt are typically more significant factors.
>
> there's also the question of what you mean by a "query" .. a simple term
> query is a lot cheaper/faster then a complex boolean query or a phrase
> query.
>
>
>
>
> -Hoss
>
>


--
regards
jl