Retrieving term positions without storing the term vectors

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Retrieving term positions without storing the term vectors

syga
Dear all,

   Am I correct to believe that a quoted (phrase) search, like "red dog", returns documents containing the consecutive words "red" and "dog" in that order, even without storing the term vector (Field.TermVector.NO)?

   If the inverted index (with Field.TermVector.NO and Field.Store.NO) is able to check whether the words are consecutive and in the right order, then I suppose that the inverted index must somehow contain the positional information of the words in the documents.
   
   If my supposition is correct, then is it possible to access this positional information via the Lucene API? Of course, I am not speaking about indexReader.getTermFreqVector(doc, field), which returns null if we use Field.TermVector.NO.

   If my supposition is incorrect, could you please explain how the inverted index is able to deal with quoted searches without having this positional information?

   Thank you so much,
SG.
Reply | Threaded
Open this post in threaded view
|

Re: Retrieving term positions without storing the term vectors

Michael McCandless-2

Indeed Lucene stores position information and uses that when doing  
phrase queries.  It is stored separately from term vectors.

However, the positions are "inverted" meaning for a given term you can  
find all documents that had that term, as well as the positions where  
that term had occurred in the documents.  So, because of this  
inversion, it's not easily reconstructed into all terms & their  
positions that occurred in a document.  It is feasible to do so, but  
the amount of computation/IO really makes it unrealistic in most  
situations.  This is why term vectors (they are not inverted) are used  
when you want to retrieve all terms/positions/offsets for a single  
document.

Mike

PS -- it's better to use java-user mailing list for this sort of  
question.

syga wrote:

>
> Dear all,
>
>   Am I correct to believe that a quoted (phrase) search, like "red  
> dog",
> returns documents containing the consecutive words "red" and "dog"  
> in that
> order, even without storing the term vector (Field.TermVector.NO)?
>
>   If the inverted index (with Field.TermVector.NO and  
> Field.Store.NO) is
> able to check whether the words are consecutive and in the right  
> order, then
> I suppose that the inverted index must somehow contain the positional
> information of the words in the documents.
>
>   If my supposition is correct, then is it possible to access this
> positional information via the Lucene API? Of course, I am not  
> speaking
> about indexReader.getTermFreqVector(doc, field), which returns null  
> if we
> use Field.TermVector.NO.
>
>   If my supposition is incorrect, could you please explain how the  
> inverted
> index is able to deal with quoted searches without having this  
> positional
> information?
>
>   Thank you so much,
> SG.
> --
> View this message in context: http://www.nabble.com/Retrieving-term-positions-without-storing-the-term-vectors-tp18359432p18359432.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>