Best Lucene hardware

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Best Lucene hardware

james-17
Hi,

 

I'm wondering if someone familiar with the way Lucene accesses data could
give their opinion on whether hard drive seek time or throughput is more
important in Lucene performance, assuming a very large index that cannot fit
in RAM.  I'm looking at buying some new servers that will be running Lucene,
and wonder if I should go with SCSI RAID, or if perhaps spending the extra
money on processors (and going with SATA for drives) is better.  I'm not
sure where the bottleneck is in an average system, and I don't have any SCSI
RAID systems available for testing.

 

Thanks,

James

Reply | Threaded
Open this post in threaded view
|

Re: Best Lucene hardware

sivan v
hello Mr.james,
   
  u can get some info from the following link...
   
  http://lucene.apache.org/java/docs/benchmarks.html
   
   
 

James <[hidden email]> wrote:
  Hi,



I'm wondering if someone familiar with the way Lucene accesses data could
give their opinion on whether hard drive seek time or throughput is more
important in Lucene performance, assuming a very large index that cannot fit
in RAM. I'm looking at buying some new servers that will be running Lucene,
and wonder if I should go with SCSI RAID, or if perhaps spending the extra
money on processors (and going with SATA for drives) is better. I'm not
sure where the bottleneck is in an average system, and I don't have any SCSI
RAID systems available for testing.



Thanks,

James




      Enduringly your's,
  V.Sivanarul.,M.Tech.




               
---------------------------------
Relax. Yahoo! Mail virus scanning helps detect nasty viruses!
Reply | Threaded
Open this post in threaded view
|

RE: Best Lucene hardware

james-17
Hi,

Thanks for the info.  Unfortunately, most of that has to do with indexing,
whereas I am concerned with retrieval speed.  And, there really isn't enough
information there to make good comparisons -- there are several completely
different systems with no way to pin down what the important changes in
hardware are.  But, thanks for the link!

Sincerely,
James

> -----Original Message-----
> From: sivan v [mailto:[hidden email]]
> Sent: Sunday, February 05, 2006 9:47 AM
> To: [hidden email]
> Subject: Re: Best Lucene hardware
>
> hello Mr.james,
>
>   u can get some info from the following link...
>
>   http://lucene.apache.org/java/docs/benchmarks.html


Reply | Threaded
Open this post in threaded view
|

RE: Best Lucene hardware

Wolfgang Täger
Dear James,

I recently had the same question, but no definitive answer to offer.

I guess that throughput/access time requirements depend on:
a) document size (the larger the document, the more the throughput might
be important)
b) how many documents you want to actually read (only a few to display
them, or all to do some processing with them)
        If you want to read many documents, seek time becomes more
important

My best guess is that access time is more important for you, unless you
store only very few very large documents.

Of course you should look for native command queuing discs (the disc may
reorder the read commands to reduce seek time).

Another option (if your memory requirements are not so huge) : Solid state
disk, see e.g.
http://techreport.com/reviews/2006q1/gigabyte-iram/index.x?pg=7

The second version shall support up to 16Gbyte, see
http://www.vr-zone.com.sg/?i=3052

Best regards,

Wolfgang
 
 

 
 
 



"James" <[hidden email]>
05-02-2006 18:12
Please respond to
[hidden email]


To
<[hidden email]>
cc

Subject
RE: Best Lucene hardware






Hi,

Thanks for the info.  Unfortunately, most of that has to do with indexing,
whereas I am concerned with retrieval speed.  And, there really isn't
enough
information there to make good comparisons -- there are several completely
different systems with no way to pin down what the important changes in
hardware are.  But, thanks for the link!

Sincerely,
James

> -----Original Message-----
> From: sivan v [mailto:[hidden email]]
> Sent: Sunday, February 05, 2006 9:47 AM
> To: [hidden email]
> Subject: Re: Best Lucene hardware
>
> hello Mr.james,
>
>   u can get some info from the following link...
>
>   http://lucene.apache.org/java/docs/benchmarks.html



Reply | Threaded
Open this post in threaded view
|

RE: Best Lucene hardware

james-17
Thanks for the feedback.  I saw those solid-state hard drives, and those are
definitely an interesting option if I am I/O limited.  But, I suspect that I
am CPU limited, which (ironically, after all the investigation that I have
done), seems to make commodity server farms the best option.

Thanks,
James

> Dear James,
>
> I recently had the same question, but no definitive answer to offer.
>
> I guess that throughput/access time requirements depend on:
> a) document size (the larger the document, the more the throughput might
> be important)
> b) how many documents you want to actually read (only a few to display
> them, or all to do some processing with them)
>         If you want to read many documents, seek time becomes more
> important
>
> My best guess is that access time is more important for you, unless you
> store only very few very large documents.
>
> Of course you should look for native command queuing discs (the disc may
> reorder the read commands to reduce seek time).
>
> Another option (if your memory requirements are not so huge) : Solid state
> disk, see e.g.
> http://techreport.com/reviews/2006q1/gigabyte-iram/index.x?pg=7
>
> The second version shall support up to 16Gbyte, see
> http://www.vr-zone.com.sg/?i=3052
>
> Best regards,
>
> Wolfgang


Reply | Threaded
Open this post in threaded view
|

Term Vectors -- searching or just ranking?

james-17
In reply to this post by Wolfgang Täger
Hi,

We are implementing term vectors, and there is something about which I am
unclear:  Can term vectors be used to perform a search in its entirety
(e.g., rank all 1 million documents in a database order, and then return the
top 100), or, due to computational time requirements, are term vectors only
intended to be a ranking method for a small subset of data that is the
result of a Boolean search (e.g., we know the 100 documents that possible
answers, now put them in relevancy order)?

Thanks,
James

Reply | Threaded
Open this post in threaded view
|

Re: Term Vectors -- searching or just ranking?

Fredrik Andersson-2-2
Hi James,

I can't speak for anyone else, but my experience is that the general
approach is to first select a subset based on the angle between the query
vector and the document vector, in their non-reduced forms (this is a normal
search-for-keyword, what Lucene does by default, in vector notation). From
there, you pick up the (subset) documents along with their reduced term
vectors and compare their angle toward the reduced query vector.
If you skip the first step, you will have one dot product (query vector and
document vector) for every document in your database, but you will only need
to store the reduced term vectors. That's a lot of computation, but it's
necessary if you want to match documents that are related to a query but
does not contain any/some of the words in it. In my experience, the
advantages of this approach is a cool feature, but the hits returned are
usually pretty shitty. If you don't get a hit on a normal keyword search,
just leave the document (note, this is only my oppinion).
Some terminology if you did not follow: "reduced" refers to the projection
of a vector on to a smaller subspace (you can normally reduce the dimension
/ column space of the term-document matrix by ~60% and have virtually no
loss of precision in your searches). See "singular value decomposition", for
that matter.

Hope that helps,
Fredrik




On 4/20/06, James <[hidden email]> wrote:

>
> Hi,
>
> We are implementing term vectors, and there is something about which I am
> unclear:  Can term vectors be used to perform a search in its entirety
> (e.g., rank all 1 million documents in a database order, and then return
> the
> top 100), or, due to computational time requirements, are term vectors
> only
> intended to be a ranking method for a small subset of data that is the
> result of a Boolean search (e.g., we know the 100 documents that possible
> answers, now put them in relevancy order)?
>
> Thanks,
> James
>
>