Large index question

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Large index question

Scott Smith-2
Supposed I want to index 500,000 documents (average document size is
4kBs).  Let's assume I create a single index and that the index is
static (I'm not going to add any new documents to it).  I would guess
the index would be around 2GB.  

 

Now, I do searches against this on a somewhat beefy machine (2GB RAM,
Core 2 Duo, Windows XP).  Does anyone have any idea what kinds of search
times I can expect for moderately complicated searches (several sets of
keywords against several fields)?  Are there things I can do to increase
search performance?  For example, does Lucene like lots of RAM, lots of
CPU, faster HD, all of the above?  Am I better splitting the index file
into 2 (N?) versions and search on multiple indexes simultaneously?  

 

Anyone have any thoughts about this?

 

Scott

 

Reply | Threaded
Open this post in threaded view
|

Re: Large index question

Doron Cohen
"Scott Smith" <[hidden email]> wrote on 12/10/2006 14:14:57:

> Supposed I want to index 500,000 documents (average document size is
> 4kBs).  Let's assume I create a single index and that the index is
> static (I'm not going to add any new documents to it).  I would guess
> the index would be around 2GB.

The input data size is ~2GB but the index itself may be smaller,
particularly if not storing fields/termvectors.

> Now, I do searches against this on a somewhat beefy machine (2GB RAM,
> Core 2 Duo, Windows XP).  Does anyone have any idea what kinds of search
> times I can expect for moderately complicated searches (several sets of
> keywords against several fields)?  Are there things I can do to increase
> search performance?  For example, does Lucene like lots of RAM, lots of
> CPU, faster HD, all of the above?  Am I better splitting the index file
> into 2 (N?) versions and search on multiple indexes simultaneously?
>
> Anyone have any thoughts about this?

Indexing time (at list for plain text or simple HTML) would be stg near
half an hour, so you might just give it a try. If index size turns out to
be small enough to reside in RAM (and you don't need the RAM for other
activities at the same time) you could try RAMDirectory. I wonder if anyone
ever compared RAMDir to a "hot" searcher above FSDir, - seems that having
all the index data in RAM would be faster than relying on IO caching by the
system, but if for some reason the RAMDir cannot be in RAM all the time, I
would assume that paging in/out would make it more costly than using FSDir
and just count on system IO caching. In the latter case see relevant
discussions on warming a searcher and caching filters.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Large index question

chrislusf
Lots of memory will help a lot. I have a customer of DBSight and he is
using Intel Core Duo, and configure everything in memory. The index
size is about 700M. When I checked his system's average response time,
it's 12ms! I guess you can estimate what you will get from your beefy
machine.

So it maybe a good idea to try your index in a 64bit JVM with the
whole index in memory.

For indexing, it's better to have faster disks for this IO intensive process.

Chris Lu
-------------------------
Instant Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com


On 10/12/06, Doron Cohen <[hidden email]> wrote:

> "Scott Smith" <[hidden email]> wrote on 12/10/2006 14:14:57:
>
> > Supposed I want to index 500,000 documents (average document size is
> > 4kBs).  Let's assume I create a single index and that the index is
> > static (I'm not going to add any new documents to it).  I would guess
> > the index would be around 2GB.
>
> The input data size is ~2GB but the index itself may be smaller,
> particularly if not storing fields/termvectors.
>
> > Now, I do searches against this on a somewhat beefy machine (2GB RAM,
> > Core 2 Duo, Windows XP).  Does anyone have any idea what kinds of search
> > times I can expect for moderately complicated searches (several sets of
> > keywords against several fields)?  Are there things I can do to increase
> > search performance?  For example, does Lucene like lots of RAM, lots of
> > CPU, faster HD, all of the above?  Am I better splitting the index file
> > into 2 (N?) versions and search on multiple indexes simultaneously?
> >
> > Anyone have any thoughts about this?
>
> Indexing time (at list for plain text or simple HTML) would be stg near
> half an hour, so you might just give it a try. If index size turns out to
> be small enough to reside in RAM (and you don't need the RAM for other
> activities at the same time) you could try RAMDirectory. I wonder if anyone
> ever compared RAMDir to a "hot" searcher above FSDir, - seems that having
> all the index data in RAM would be faster than relying on IO caching by the
> system, but if for some reason the RAMDir cannot be in RAM all the time, I
> would assume that paging in/out would make it more costly than using FSDir
> and just count on system IO caching. In the latter case see relevant
> discussions on warming a searcher and caching filters.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Large index question

Artem Vasiliev
In reply to this post by Scott Smith-2
Hello Scott!

I think your index is just not large really. My Sharehound's indexes of my
corporate LAN is about 10G/10mlns of (really small) documents now, and queries
get really little time, less than a second for non-sorted queries and some more
for sorted. The machine is some P4 with 1G RAM. I use just FSDir.

Best regards,
Artem.

SS> Supposed I want to index 500,000 documents (average document size is
SS> 4kBs).  Let's assume I create a single index and that the index is
SS> static (I'm not going to add any new documents to it).  I would guess
SS> the index would be around 2GB.  

 

SS> Now, I do searches against this on a somewhat beefy machine (2GB RAM,
SS> Core 2 Duo, Windows XP).  Does anyone have any idea what kinds of search
SS> times I can expect for moderately complicated searches (several sets of
SS> keywords against several fields)?  Are there things I can do to increase
SS> search performance?  For example, does Lucene like lots of RAM, lots of
SS> CPU, faster HD, all of the above?  Am I better splitting the index file
SS> into 2 (N?) versions and search on multiple indexes simultaneously?  

 

SS> Anyone have any thoughts about this?

 

SS> Scott

 




--
С уважением,
 Artem                          mailto:[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Large index question

Mark Miller-3
In reply to this post by Scott Smith-2
I recently played around with a 2 million doc index of docs that averaged
between 2-10k. The system had 4 gig of ram and a 3 gig dual core proc (not
using a parallel searcher to take advantage of the extra core)...pretty
beefy, but with 4 times the docs your talking about. I didn't see a query
that took over a second without a sort.

A similar setup on a single core 3200+ AMD 64 with a gig of ram was also
blazingly fast (no sorts involved again).

- Mark

On 10/12/06, Scott Smith <[hidden email]> wrote:

>
> Supposed I want to index 500,000 documents (average document size is
> 4kBs).  Let's assume I create a single index and that the index is
> static (I'm not going to add any new documents to it).  I would guess
> the index would be around 2GB.
>
>
>
> Now, I do searches against this on a somewhat beefy machine (2GB RAM,
> Core 2 Duo, Windows XP).  Does anyone have any idea what kinds of search
> times I can expect for moderately complicated searches (several sets of
> keywords against several fields)?  Are there things I can do to increase
> search performance?  For example, does Lucene like lots of RAM, lots of
> CPU, faster HD, all of the above?  Am I better splitting the index file
> into 2 (N?) versions and search on multiple indexes simultaneously?
>
>
>
> Anyone have any thoughts about this?
>
>
>
> Scott
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Large index question

artemv-2
In reply to this post by Scott Smith-2
Hello Scott!

I think your index is just not large really. My Sharehound's indexes of my
corporate LAN is about 10G/10mlns of (really small) documents now, and queries
get really little time, less than a second for non-sorted queries and some more
for sorted. The machine is some P4 with 1G RAM. I use just FSDir.

Best regards,
Artem.

SS> Supposed I want to index 500,000 documents (average document size is
SS> 4kBs).  Let's assume I create a single index and that the index is
SS> static (I'm not going to add any new documents to it).  I would guess
SS> the index would be around 2GB.  

 

SS> Now, I do searches against this on a somewhat beefy machine (2GB RAM,
SS> Core 2 Duo, Windows XP).  Does anyone have any idea what kinds of search
SS> times I can expect for moderately complicated searches (several sets of
SS> keywords against several fields)?  Are there things I can do to increase
SS> search performance?  For example, does Lucene like lots of RAM, lots of
SS> CPU, faster HD, all of the above?  Am I better splitting the index file
SS> into 2 (N?) versions and search on multiple indexes simultaneously?  

 

SS> Anyone have any thoughts about this?

 

SS> Scott


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]