Use case for term vector's token position/offset?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Use case for term vector's token position/offset?

Jong Kim
Hi,
 
When I look at org.apache.lucene.document.Field.TermVector,
it defines the following 5 options as to the detailed info
that can be stored wrt term vectors.

1. NO
2. WITH_OFFSETS
3. WITH_POSITIONS
4. WITH_POSITIONS_OFFSETS
5. YES

It isn't difficult to understand where the basic term vector
information (ie, terms and their number of occurences - option 5)
might be useful. I believe it can be used to implement features
like "concept search" or "more like this" functionalities.

However, it isn't clear to me how the other extra info (ie,
token position information and/or token offset information)
might be used? Can anyone help me understand what kind of
(advanced) search techniques people use these extra
information for, or even better, any pointer to real world
examples?

Thanks
/Jong


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Use case for term vector's token position/offset?

Grant Ingersoll-2
Hi Jong,

I think these are useful for things like highlighting (I think  
contrib/highlighter can use them); other post processing algorithms  
such as: question answering, calculating co-occurrences (find the 6  
terms to the left and right of the term at position 16).  Perhaps you  
want to give higher scores to documents where your terms occur in a  
certain part of the document (like the beginning)

Really, any application where you need to know the relationships  
between the terms in a document or the document and the original.

HTH,
Grant

On Nov 21, 2006, at 10:36 AM, Jong Kim wrote:

> Hi,
>
> When I look at org.apache.lucene.document.Field.TermVector,
> it defines the following 5 options as to the detailed info
> that can be stored wrt term vectors.
>
> 1. NO
> 2. WITH_OFFSETS
> 3. WITH_POSITIONS
> 4. WITH_POSITIONS_OFFSETS
> 5. YES
>
> It isn't difficult to understand where the basic term vector
> information (ie, terms and their number of occurences - option 5)
> might be useful. I believe it can be used to implement features
> like "concept search" or "more like this" functionalities.
>
> However, it isn't clear to me how the other extra info (ie,
> token position information and/or token offset information)
> might be used? Can anyone help me understand what kind of
> (advanced) search techniques people use these extra
> information for, or even better, any pointer to real world
> examples?
>
> Thanks
> /Jong
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]