TermFreqVector and performance, index size

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

TermFreqVector and performance, index size

Philippe Deslauriers (Beetext)

Hello,

We are upgrading from 1.3 to 1.9.
We planned to use the Highlight package for highlighting, replacing our in
house highlight classes.

From what I can read, HighLight package requires the use of the
TermFreqVector to be added to the index. I will get into the Highlight
package later, but right now I am trying to understand the TermFreqVector
uses and impacts.

When adding the “content” field, I did a few tests with the different
options, to calculate the indexation time and size of the index, as we are
working with HUGE indexes (1Gb ++). For the test I used roughly 4500 random
text documents.

With lucene 1.3, time to index 2:05, index size 13.0 mb

With lucene 1.9

Field.TermVector NO  (time 1m:45s, index size 7.1 mb)
Field.TermVector WITH_POSITIONS_OFFSETS  -> (index size 25 mb !!!, time
2m:45s)
Field.TermVector YES NO  (time 2m:01s, index size 13.3 mb mb)

What are the OFFSETS and POSITIONS used for? Do I need it for Highlighting?
Can I create the TermFreqVector on the fly for a document, or do I have to
include them in the index?


Philippe



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: TermFreqVector and performance, index size

Daniel Naber-5
On Donnerstag 27 April 2006 14:32, Philippe Deslauriers (Beetext) wrote:

> What are the OFFSETS and POSITIONS used for? Do I need it for
> Highlighting?

No, you can provide an analyzer to Highlight.getBestFragment() and it will
re-analyze your text without the need for term vectors.

Regards
 Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]