get terms by positions

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

get terms by positions

Renzo Scheffer
Hi,

 

can anybody be so kind to tell me if it is possible to search a Term by its
position?

 

I search a term (for excample "soccer") and get back the DocId's and
positions as follows:

 

 

TermPositions termPos = reader.termPositions(new Term("contents","soccer"));

while(termPos.next()){

int freq = termPos.freq();

for(int i=0; i<freq; i++){

 

      int docNumber = termPos.doc();

      int position = termPos.nextPosition();

System.out.println("DocId: "+docNumber+"; Pos:"+position);

}

 

 

 

Output:

 

DocId: 0; Pos: 1

DocId: 0; Pos: 4

DocId: 0; Pos: 7

DocId: 1; Pos: 3

DocId: 1; Pos: 7

 

Now I try to get back terms, one position before/after "soccer". I
considered to take the

Position and increase or decrease it. But I can't find a way to get back a
term, according to the given Position.

Can anybody help me?

 

Thanks, Renzo

 

Reply | Threaded
Open this post in threaded view
|

Re: get terms by positions

Nicolas Lalevée-2
Le Lundi 02 Octobre 2006 23:06, Renzo Scheffer a écrit :

> Hi,
>
>
>
> can anybody be so kind to tell me if it is possible to search a Term by its
> position?
>
>
>
> I search a term (for excample "soccer") and get back the DocId's and
> positions as follows:
>
>
>
>
>
> TermPositions termPos = reader.termPositions(new
> Term("contents","soccer"));
>
> while(termPos.next()){
>
> int freq = termPos.freq();
>
> for(int i=0; i<freq; i++){
>
>
>
>       int docNumber = termPos.doc();
>
>       int position = termPos.nextPosition();
>
> System.out.println("DocId: "+docNumber+"; Pos:"+position);
>
> }
>
>
>
>
>
>
>
> Output:
>
>
>
> DocId: 0; Pos: 1
>
> DocId: 0; Pos: 4
>
> DocId: 0; Pos: 7
>
> DocId: 1; Pos: 3
>
> DocId: 1; Pos: 7
>
>
>
> Now I try to get back terms, one position before/after "soccer". I
> considered to take the
>
> Position and increase or decrease it. But I can't find a way to get back a
> term, according to the given Position.
>
> Can anybody help me?
>

I think this is a non-sense to try to find a term. In Lucene, you search with
a term, you are not trying to get some. Basically, in Lucene, you have a list
of term pointing on documents, not the reverse.

Maybe if you explain why you are trying to do that, we can find a better way
to do it.

Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: get terms by positions

Doron Cohen
In reply to this post by Renzo Scheffer
You can store TermVectors with position info, but I don't think this would
be enough for what you are asking, because it is not meant for direct
access to a term by its position, and because TermVectors store tokens,
i.e. the "indexed" form of the word, which I am not sure is what you need.

It seems doable by the following:

(1) store the field with the document - see
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.Store.html#YES

(2) store term vectors with offsets - see
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.TermVector.html#WITH_OFFSETS

(3) access the TermVector of a document by docid - see
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#getTermFreqVector(int,
 java.lang.String) and
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/TermPositionVector.html

(4) Use the offset info to extract the relevant part of the original text
from the field, iterating backwards and forwards for a whitespace or so -
see
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#document(int,
 org.apache.lucene.document.FieldSelector)

All this seems a lot of work, and I am also not sure about the result
performance - the index size would grow due to both TermVectors and the
stored field. If the field content for each doc is also stored in another
store (say, db) then this is less of a concern, but still, lots of work,
and measureable extra IO (in addition to search) if this is part of a
search application.

Perhaps you can expand on the context of this? How is this feature going to
be used?

"Renzo Scheffer" <[hidden email]> wrote on 02/10/2006 14:06:40:
> Hi,
>
>
>
> can anybody be so kind to tell me if it is possible to search a Term by
its

> position?
>
>
>
> I search a term (for excample "soccer") and get back the DocId's and
> positions as follows:
>
>
>
>
>
> TermPositions termPos = reader.termPositions(new
Term("contents","soccer"));

>
> while(termPos.next()){
>
> int freq = termPos.freq();
>
> for(int i=0; i<freq; i++){
>
>
>
>       int docNumber = termPos.doc();
>
>       int position = termPos.nextPosition();
>
> System.out.println("DocId: "+docNumber+"; Pos:"+position);
>
> }
>
>
>
>
>
>
>
> Output:
>
>
>
> DocId: 0; Pos: 1
>
> DocId: 0; Pos: 4
>
> DocId: 0; Pos: 7
>
> DocId: 1; Pos: 3
>
> DocId: 1; Pos: 7
>
>
>
> Now I try to get back terms, one position before/after "soccer". I
> considered to take the
>
> Position and increase or decrease it. But I can't find a way to get back
a
> term, according to the given Position.
>
> Can anybody help me?
>
>
>
> Thanks, Renzo


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: get terms by positions

catalinmititelu
In reply to this post by Renzo Scheffer
Hi,
I have the same problem.
This is useful when you try to extract the contexts (terms before and after) of a certain term (for example).
I found a solution but it performs badly: when you try to retrieve those contexts you have to re-tokenize the documents containing the given term (i.e. "soccer") and you have to keep the before and after tokens using TokenStream and the current position. Re-tokenizing could be ok on small files, but on large files it induces a bad performance of the applications.
Any different solution is welcome.

Catalin.

----- Original Message ----
From: NicolasLalevée <[hidden email]>
To: [hidden email]
Sent: Tuesday, October 3, 2006 1:04:20 AM
Subject: Re: get terms by positions

Le Lundi 02 Octobre 2006 23:06, Renzo Scheffer a écrit :

> Hi,
>
>
>
> can anybody be so kind to tell me if it is possible to search a Term by its
> position?
>
>
>
> I search a term (for excample "soccer") and get back the DocId's and
> positions as follows:
>
>
>
>
>
> TermPositions termPos = reader.termPositions(new
> Term("contents","soccer"));
>
> while(termPos.next()){
>
> int freq = termPos.freq();
>
> for(int i=0; i<freq; i++){
>
>
>
>       int docNumber = termPos.doc();
>
>       int position = termPos.nextPosition();
>
> System.out.println("DocId: "+docNumber+"; Pos:"+position);
>
> }
>
>
>
>
>
>
>
> Output:
>
>
>
> DocId: 0; Pos: 1
>
> DocId: 0; Pos: 4
>
> DocId: 0; Pos: 7
>
> DocId: 1; Pos: 3
>
> DocId: 1; Pos: 7
>
>
>
> Now I try to get back terms, one position before/after "soccer". I
> considered to take the
>
> Position and increase or decrease it. But I can't find a way to get back a
> term, according to the given Position.
>
> Can anybody help me?
>

I think this is a non-sense to try to find a term. In Lucene, you search with
a term, you are not trying to get some. Basically, in Lucene, you have a list
of term pointing on documents, not the reverse.

Maybe if you explain why you are trying to do that, we can find a better way
to do it.

Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]







Reply | Threaded
Open this post in threaded view
|

AW: get terms by positions

Renzo Scheffer
In reply to this post by Nicolas Lalevée-2
I try to get back a list of all left or right neighbours of a searchterm.
Then I will count them to get back the Information, how often a specific
word is used as neighbour of the searchterm. I know that the results are
variable according to the used Analyzer/Filter. It's just an experiment and
first I'll try to find out if it is possible to do something like that with
Lucene.

Renzo

-----Ursprüngliche Nachricht-----
Von: Nicolas Lalevée [mailto:[hidden email]]
Gesendet: Dienstag, 3. Oktober 2006 00:04
An: [hidden email]
Betreff: Re: get terms by positions

Le Lundi 02 Octobre 2006 23:06, Renzo Scheffer a écrit :
> Hi,
>
>
>
> can anybody be so kind to tell me if it is possible to search a Term by
its

> position?
>
>
>
> I search a term (for excample "soccer") and get back the DocId's and
> positions as follows:
>
>
>
>
>
> TermPositions termPos = reader.termPositions(new
> Term("contents","soccer"));
>
> while(termPos.next()){
>
> int freq = termPos.freq();
>
> for(int i=0; i<freq; i++){
>
>
>
>       int docNumber = termPos.doc();
>
>       int position = termPos.nextPosition();
>
> System.out.println("DocId: "+docNumber+"; Pos:"+position);
>
> }
>
>
>
>
>
>
>
> Output:
>
>
>
> DocId: 0; Pos: 1
>
> DocId: 0; Pos: 4
>
> DocId: 0; Pos: 7
>
> DocId: 1; Pos: 3
>
> DocId: 1; Pos: 7
>
>
>
> Now I try to get back terms, one position before/after "soccer". I
> considered to take the
>
> Position and increase or decrease it. But I can't find a way to get back a
> term, according to the given Position.
>
> Can anybody help me?
>

I think this is a non-sense to try to find a term. In Lucene, you search
with
a term, you are not trying to get some. Basically, in Lucene, you have a
list
of term pointing on documents, not the reverse.

Maybe if you explain why you are trying to do that, we can find a better way

to do it.

Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: get terms by positions

Nicolas Lalevée-2
Le Mardi 03 Octobre 2006 12:14, Renzo Scheffer a écrit :
> I try to get back a list of all left or right neighbours of a searchterm.
> Then I will count them to get back the Information, how often a specific
> word is used as neighbour of the searchterm. I know that the results are
> variable according to the used Analyzer/Filter. It's just an experiment and
> first I'll try to find out if it is possible to do something like that with
> Lucene.

So you are trying to do some analysis about the documents stored in your
index. I would say that Lucene isn't designed to do it at all. Lucene is done
for searching.

BTW, as Doron show us, there is maybe a way to acheive this. See also
the "similarity" contribution of Lucene.

Nicolas

>
> Renzo
>
> -----Ursprüngliche Nachricht-----
> Von: Nicolas Lalevée [mailto:[hidden email]]
> Gesendet: Dienstag, 3. Oktober 2006 00:04
> An: [hidden email]
> Betreff: Re: get terms by positions
>
> Le Lundi 02 Octobre 2006 23:06, Renzo Scheffer a écrit :
> > Hi,
> >
> >
> >
> > can anybody be so kind to tell me if it is possible to search a Term by
>
> its
>
> > position?
> >
> >
> >
> > I search a term (for excample "soccer") and get back the DocId's and
> > positions as follows:
> >
> >
> >
> >
> >
> > TermPositions termPos = reader.termPositions(new
> > Term("contents","soccer"));
> >
> > while(termPos.next()){
> >
> > int freq = termPos.freq();
> >
> > for(int i=0; i<freq; i++){
> >
> >
> >
> >       int docNumber = termPos.doc();
> >
> >       int position = termPos.nextPosition();
> >
> > System.out.println("DocId: "+docNumber+"; Pos:"+position);
> >
> > }
> >
> >
> >
> >
> >
> >
> >
> > Output:
> >
> >
> >
> > DocId: 0; Pos: 1
> >
> > DocId: 0; Pos: 4
> >
> > DocId: 0; Pos: 7
> >
> > DocId: 1; Pos: 3
> >
> > DocId: 1; Pos: 7
> >
> >
> >
> > Now I try to get back terms, one position before/after "soccer". I
> > considered to take the
> >
> > Position and increase or decrease it. But I can't find a way to get back
> > a term, according to the given Position.
> >
> > Can anybody help me?
>
> I think this is a non-sense to try to find a term. In Lucene, you search
> with
> a term, you are not trying to get some. Basically, in Lucene, you have a
> list
> of term pointing on documents, not the reverse.
>
> Maybe if you explain why you are trying to do that, we can find a better
> way
>
> to do it.
>
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: get terms by positions

Grant Ingersoll
In reply to this post by Renzo Scheffer
We often calculate co-occurrence information as an offline task and  
store it and then it is just a simple lookup at run time.   You just  
have to put together the appropriate loops based on the window size  
that you want for any given term.  Probably not efficient if you  
index is changing a lot.

-Grant

On Oct 3, 2006, at 6:14 AM, Renzo Scheffer wrote:

> I try to get back a list of all left or right neighbours of a  
> searchterm.
> Then I will count them to get back the Information, how often a  
> specific
> word is used as neighbour of the searchterm. I know that the  
> results are
> variable according to the used Analyzer/Filter. It's just an  
> experiment and
> first I'll try to find out if it is possible to do something like  
> that with
> Lucene.
>
> Renzo
>
> -----Ursprüngliche Nachricht-----
> Von: Nicolas Lalevée [mailto:[hidden email]]
> Gesendet: Dienstag, 3. Oktober 2006 00:04
> An: [hidden email]
> Betreff: Re: get terms by positions
>
> Le Lundi 02 Octobre 2006 23:06, Renzo Scheffer a écrit :
>> Hi,
>>
>>
>>
>> can anybody be so kind to tell me if it is possible to search a  
>> Term by
> its
>> position?
>>
>>
>>
>> I search a term (for excample "soccer") and get back the DocId's and
>> positions as follows:
>>
>>
>>
>>
>>
>> TermPositions termPos = reader.termPositions(new
>> Term("contents","soccer"));
>>
>> while(termPos.next()){
>>
>> int freq = termPos.freq();
>>
>> for(int i=0; i<freq; i++){
>>
>>
>>
>>       int docNumber = termPos.doc();
>>
>>       int position = termPos.nextPosition();
>>
>> System.out.println("DocId: "+docNumber+"; Pos:"+position);
>>
>> }
>>
>>
>>
>>
>>
>>
>>
>> Output:
>>
>>
>>
>> DocId: 0; Pos: 1
>>
>> DocId: 0; Pos: 4
>>
>> DocId: 0; Pos: 7
>>
>> DocId: 1; Pos: 3
>>
>> DocId: 1; Pos: 7
>>
>>
>>
>> Now I try to get back terms, one position before/after "soccer". I
>> considered to take the
>>
>> Position and increase or decrease it. But I can't find a way to  
>> get back a
>> term, according to the given Position.
>>
>> Can anybody help me?
>>
>
> I think this is a non-sense to try to find a term. In Lucene, you  
> search
> with
> a term, you are not trying to get some. Basically, in Lucene, you  
> have a
> list
> of term pointing on documents, not the reverse.
>
> Maybe if you explain why you are trying to do that, we can find a  
> better way
>
> to do it.
>
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AW: get terms by positions

Grant Ingersoll-4
I should note, though, that we do this using the Lucene index, using  
the TermDocs, etc.

On Oct 3, 2006, at 8:42 AM, Grant Ingersoll wrote:

> We often calculate co-occurrence information as an offline task and  
> store it and then it is just a simple lookup at run time.   You  
> just have to put together the appropriate loops based on the window  
> size that you want for any given term.  Probably not efficient if  
> you index is changing a lot.
>
> -Grant
>
> On Oct 3, 2006, at 6:14 AM, Renzo Scheffer wrote:
>
>> I try to get back a list of all left or right neighbours of a  
>> searchterm.
>> Then I will count them to get back the Information, how often a  
>> specific
>> word is used as neighbour of the searchterm. I know that the  
>> results are
>> variable according to the used Analyzer/Filter. It's just an  
>> experiment and
>> first I'll try to find out if it is possible to do something like  
>> that with
>> Lucene.
>>
>> Renzo
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Nicolas Lalevée [mailto:[hidden email]]
>> Gesendet: Dienstag, 3. Oktober 2006 00:04
>> An: [hidden email]
>> Betreff: Re: get terms by positions
>>
>> Le Lundi 02 Octobre 2006 23:06, Renzo Scheffer a écrit :
>>> Hi,
>>>
>>>
>>>
>>> can anybody be so kind to tell me if it is possible to search a  
>>> Term by
>> its
>>> position?
>>>
>>>
>>>
>>> I search a term (for excample "soccer") and get back the DocId's and
>>> positions as follows:
>>>
>>>
>>>
>>>
>>>
>>> TermPositions termPos = reader.termPositions(new
>>> Term("contents","soccer"));
>>>
>>> while(termPos.next()){
>>>
>>> int freq = termPos.freq();
>>>
>>> for(int i=0; i<freq; i++){
>>>
>>>
>>>
>>>       int docNumber = termPos.doc();
>>>
>>>       int position = termPos.nextPosition();
>>>
>>> System.out.println("DocId: "+docNumber+"; Pos:"+position);
>>>
>>> }
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Output:
>>>
>>>
>>>
>>> DocId: 0; Pos: 1
>>>
>>> DocId: 0; Pos: 4
>>>
>>> DocId: 0; Pos: 7
>>>
>>> DocId: 1; Pos: 3
>>>
>>> DocId: 1; Pos: 7
>>>
>>>
>>>
>>> Now I try to get back terms, one position before/after "soccer". I
>>> considered to take the
>>>
>>> Position and increase or decrease it. But I can't find a way to  
>>> get back a
>>> term, according to the given Position.
>>>
>>> Can anybody help me?
>>>
>>
>> I think this is a non-sense to try to find a term. In Lucene, you  
>> search
>> with
>> a term, you are not trying to get some. Basically, in Lucene, you  
>> have a
>> list
>> of term pointing on documents, not the reverse.
>>
>> Maybe if you explain why you are trying to do that, we can find a  
>> better way
>>
>> to do it.
>>
>> Nicolas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> --------------------------
> Grant Ingersoll
> Sr. Software Engineer
> Center for Natural Language Processing
> Syracuse University
> 335 Hinds Hall
> Syracuse, NY 13244
> http://www.cnlp.org
>
> Voice: 315-443-5484
> Skype: grant_ingersoll
> Fax: 315-443-6886
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]