number of term occurrences

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

number of term occurrences

beatriz ramos
Hello,
I´m working with Lucene. I need to get the number of occurrences of the term
in the document. I had seen the documentations ant I don´t find anything.
Do you have any idea?
Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: number of term occurrences

Erick Erickson
Use TermDocs.seek(Term) to get to the term. That'll position your TermDocs
variable at a list, ordered by document ID of the ocurrences of a term. Then
TermDocs.skipTo(doc ID) will get you to the list of terms for that document
(you have to know what Lucene DocId you care about here.).

Now TermDocs.next() will increment through and you can count. Something
like.

TermDocs td = IndexReader.termDocs();
td.seek(new Term("field", "value"));
td.skipto(docId);
int idx = 0;
while (td.next()) {
    if (td.doc != docId) break;
    ++idx;
}


idx should contain your count now.

Note, the above could be off by one, you'll have to check whether skipto
puts you on a document already....

Best
Erick



On 10/23/06, beatriz ramos <[hidden email]> wrote:
>
> Hello,
> I´m working with Lucene. I need to get the number of occurrences of the
> term
> in the document. I had seen the documentations ant I don´t find anything.
> Do you have any idea?
> Thanks.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: number of term occurrences

Paul Elschot
On Monday 23 October 2006 21:16, Erick Erickson wrote:

> Use TermDocs.seek(Term) to get to the term. That'll position your TermDocs
> variable at a list, ordered by document ID of the ocurrences of a term. Then
> TermDocs.skipTo(doc ID) will get you to the list of terms for that document
> (you have to know what Lucene DocId you care about here.).
>
> Now TermDocs.next() will increment through and you can count. Something
> like.
>
> TermDocs td = IndexReader.termDocs();
> td.seek(new Term("field", "value"));
> td.skipto(docId);

and then td.freq() should give the answer without counting.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: number of term occurrences

Erick Erickson
Yeah, but I haven't used the termfreq thingy enough to think of it
automatically <G>...

Besides, I'm learning that if I put a fooliwh answer out there, someone'll
correct me.
Thanks
Erick

On 10/23/06, Paul Elschot <[hidden email]> wrote:

>
> On Monday 23 October 2006 21:16, Erick Erickson wrote:
> > Use TermDocs.seek(Term) to get to the term. That'll position your
> TermDocs
> > variable at a list, ordered by document ID of the ocurrences of a term.
> Then
> > TermDocs.skipTo(doc ID) will get you to the list of terms for that
> document
> > (you have to know what Lucene DocId you care about here.).
> >
> > Now TermDocs.next() will increment through and you can count. Something
> > like.
> >
> > TermDocs td = IndexReader.termDocs();
> > td.seek(new Term("field", "value"));
> > td.skipto(docId);
>
> and then td.freq() should give the answer without counting.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: number of term occurrences

Grant Ingersoll-2
In reply to this post by beatriz ramos
You can also use Term Vectors, at the cost of extra storage.  Search  
this list for Term Vectors for info on how to implement.

On Oct 23, 2006, at 5:50 AM, beatriz ramos wrote:

> Hello,
> I´m working with Lucene. I need to get the number of occurrences of  
> the term
> in the document. I had seen the documentations ant I don´t find  
> anything.
> Do you have any idea?
> Thanks.

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: number of term occurrences

beatriz ramos
Hi, thanks for all your answers, but they don't work

I have tried the 3 options and with all of them we get termDoc = 0
I have checked my index with Luke software and termDoc is 1 here, so my
index is correct.

is it possible I have a problem with the reader? (because my index is
allright)

Thanks

(when I talk about termDocs, it means number of documents in which term
appears)



On 24/10/06, Grant Ingersoll <[hidden email]> wrote:

>
> You can also use Term Vectors, at the cost of extra storage.  Search
> this list for Term Vectors for info on how to implement.
>
> On Oct 23, 2006, at 5:50 AM, beatriz ramos wrote:
>
> > Hello,
> > I´m working with Lucene. I need to get the number of occurrences of
> > the term
> > in the document. I had seen the documentations ant I don´t find
> > anything.
> > Do you have any idea?
> > Thanks.
>
> --------------------------
> Grant Ingersoll
> Sr. Software Engineer
> Center for Natural Language Processing
> Syracuse University
> 335 Hinds Hall
> Syracuse, NY 13244
> http://www.cnlp.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: number of term occurrences

Paz Belmonte
Hi,

I have tried this options too and the Term Vector return null.

Which do you think that it is the problem?


2006/10/24, beatriz ramos <[hidden email]>:

>
>
>
> ---------- Forwarded message ----------
> From: beatriz ramos <[hidden email]>
> Date: 24-Oct-2006 11:24
> Subject: Re: number of term occurrences
> To: [hidden email]
>
> Hi, thanks for all your answers, but they don't work
>
> I have tried the 3 options and with all of them we get termDoc = 0
> I have checked my index with Luke software and termDoc is 1 here, so my
> index is correct.
>
> is it possible I have a problem with the reader? (because my index is
> allright)
>
> Thanks
>
> (when I talk about termDocs, it means number of documents in which term
> appears)
>
>
>
> On 24/10/06, Grant Ingersoll <[hidden email]> wrote:
> >
> > You can also use Term Vectors, at the cost of extra storage.  Search
> > this list for Term Vectors for info on how to implement.
> >
> > On Oct 23, 2006, at 5:50 AM, beatriz ramos wrote:
> >
> > > Hello,
> > > I´m working with Lucene. I need to get the number of occurrences of
> > > the term
> > > in the document. I had seen the documentations ant I don´t find
> > > anything.
> > > Do you have any idea?
> > > Thanks.
> >
> > --------------------------
> > Grant Ingersoll
> > Sr. Software Engineer
> > Center for Natural Language Processing
> > Syracuse University
> > 335 Hinds Hall
> > Syracuse, NY 13244
> > http://www.cnlp.org
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: number of term occurrences

Samir Abdou
Hi,

You indexed without storing vectors! This is why the term vector is null.

Samir


-----Message d'origine-----
De : Paz Belmonte [mailto:[hidden email]]
Envoyé : mardi, 24. octobre 2006 12:30
À : java-user
Objet : Re: number of term occurrences

Hi,

I have tried this options too and the Term Vector return null.

Which do you think that it is the problem?


2006/10/24, beatriz ramos <[hidden email]>:

>
>
>
> ---------- Forwarded message ----------
> From: beatriz ramos <[hidden email]>
> Date: 24-Oct-2006 11:24
> Subject: Re: number of term occurrences
> To: [hidden email]
>
> Hi, thanks for all your answers, but they don't work
>
> I have tried the 3 options and with all of them we get termDoc = 0
> I have checked my index with Luke software and termDoc is 1 here, so my
> index is correct.
>
> is it possible I have a problem with the reader? (because my index is
> allright)
>
> Thanks
>
> (when I talk about termDocs, it means number of documents in which term
> appears)
>
>
>
> On 24/10/06, Grant Ingersoll <[hidden email]> wrote:
> >
> > You can also use Term Vectors, at the cost of extra storage.  Search
> > this list for Term Vectors for info on how to implement.
> >
> > On Oct 23, 2006, at 5:50 AM, beatriz ramos wrote:
> >
> > > Hello,
> > > I´m working with Lucene. I need to get the number of occurrences of
> > > the term
> > > in the document. I had seen the documentations ant I don´t find
> > > anything.
> > > Do you have any idea?
> > > Thanks.
> >
> > --------------------------
> > Grant Ingersoll
> > Sr. Software Engineer
> > Center for Natural Language Processing
> > Syracuse University
> > 335 Hinds Hall
> > Syracuse, NY 13244
> > http://www.cnlp.org
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: number of term occurrences

Paz Belmonte
I don't know. How are this vectors stored?
Could you show me an example? (or documentation where I can find it)

2006/10/24, Samir Abdou <[hidden email]>:

>
> Hi,
>
> You indexed without storing vectors! This is why the term vector is null.
>
> Samir
>
>
> -----Message d'origine-----
> De: Paz Belmonte [mailto:[hidden email]]
> Envoyé: mardi, 24. octobre 2006 12:30
> Ŕ: java-user
> Objet: Re: number of term occurrences
>
> Hi,
>
> I have tried this options too and the Term Vector return null.
>
> Which do you think that it is the problem?
>
>
> 2006/10/24, beatriz ramos <[hidden email]>:
> >
> >
> >
> > ---------- Forwarded message ----------
> > From: beatriz ramos <[hidden email]>
> > Date: 24-Oct-2006 11:24
> > Subject: Re: number of term occurrences
> > To: [hidden email]
> >
> > Hi, thanks for all your answers, but they don't work
> >
> > I have tried the 3 options and with all of them we get termDoc = 0
> > I have checked my index with Luke software and termDoc is 1 here, so my
> > index is correct.
> >
> > is it possible I have a problem with the reader? (because my index is
> > allright)
> >
> > Thanks
> >
> > (when I talk about termDocs, it means number of documents in which term
> > appears)
> >
> >
> >
> > On 24/10/06, Grant Ingersoll <[hidden email]> wrote:
> > >
> > > You can also use Term Vectors, at the cost of extra storage.  Search
> > > this list for Term Vectors for info on how to implement.
> > >
> > > On Oct 23, 2006, at 5:50 AM, beatriz ramos wrote:
> > >
> > > > Hello,
> > > > I´m working with Lucene. I need to get the number of occurrences of
> > > > the term
> > > > in the document. I had seen the documentations ant I don´t find
> > > > anything.
> > > > Do you have any idea?
> > > > Thanks.
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > Sr. Software Engineer
> > > Center for Natural Language Processing
> > > Syracuse University
> > > 335 Hinds Hall
> > > Syracuse, NY 13244
> > > http://www.cnlp.org
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: number of term occurrences

pgwillia
When you create a Document by adding Field(s)
(http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html)
consider the last constructor which allows you to specify if the the field
will have its TermVector stored or not stored.  Also, Luke has a column in
its document view which tells you if the TermVector is stored or not
stored by the presence or lack of precence of a + under the T column.

Cheers,
Tricia

On Tue, 24 Oct 2006, Paz Belmonte wrote:

> I don't know. How are this vectors stored?
> Could you show me an example? (or documentation where I can find it)
>
> 2006/10/24, Samir Abdou <[hidden email]>:
>>
>> Hi,
>>
>> You indexed without storing vectors! This is why the term vector is null.
>>
>> Samir
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: number of term occurrences

Doron Cohen
In reply to this post by beatriz ramos
I don't know why the termDocs option did not work for you. Perhaps you did
not (re)open the searcher after the index was populated?  Anyhow, here is a
small code snippet that does just this, see if it works for you, then you
can compare it to your code...

  void numberOfTermOcc() throws Exception {
    System.out.println("======== populate index");
    RAMDirectory dir = new RAMDirectory();
    IndexWriter iw = new IndexWriter(dir,
                                     new StandardAnalyzer(),true);
    for (int i = 0; i < 10; i++) {
      Document doc = new Document();
      for (int j = 0; j < 10; j++) {
        doc.add(new Field("field_"+(i+j), "value_"+(i+j),
                          Field.Store.NO, Field.Index.TOKENIZED));
        doc.add(new Field("field_"+(i+j), "value_"+(i+j),
                          Field.Store.NO, Field.Index.TOKENIZED));
        doc.add(new Field("field_"+(i+j), "value_"+(i+j+1),
                          Field.Store.NO, Field.Index.TOKENIZED));
      }
      iw.addDocument(doc);
    }
    iw.close();

    IndexReader ir = IndexReader.open(dir);
    printTermDocs(ir, new Term("field_7","value_7"));
    printTermDocs(ir, new Term("field_7","value_8"));
  }

  void printTermDocs(IndexReader ir, Term t) throws IOException {
    System.out.println("========= iterate docs for "+t);
    TermDocs td = ir.termDocs(t);

    while (td.next()) {
      System.out.println("term frequency in doc "+td.doc()+
                         " is: "+ td.freq());
    };
  }

"beatriz ramos" <[hidden email]> wrote on 24/10/2006
02:24:47:

> Hi, thanks for all your answers, but they don't work
>
> I have tried the 3 options and with all of them we get termDoc = 0
> I have checked my index with Luke software and termDoc is 1 here, so my
> index is correct.
>
> is it possible I have a problem with the reader? (because my index is
> allright)
>
> Thanks
>
> (when I talk about termDocs, it means number of documents in which term
> appears)
>
>
>
> On 24/10/06, Grant Ingersoll <[hidden email]> wrote:
> >
> > You can also use Term Vectors, at the cost of extra storage.  Search
> > this list for Term Vectors for info on how to implement.
> >
> > On Oct 23, 2006, at 5:50 AM, beatriz ramos wrote:
> >
> > > Hello,
> > > I´m working with Lucene. I need to get the number of occurrences of
> > > the term
> > > in the document. I had seen the documentations ant I don´t find
> > > anything.
> > > Do you have any idea?
> > > Thanks.
> >
> > --------------------------
> > Grant Ingersoll
> > Sr. Software Engineer
> > Center for Natural Language Processing
> > Syracuse University
> > 335 Hinds Hall
> > Syracuse, NY 13244
> > http://www.cnlp.org
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: number of term occurrences

beatriz ramos
In reply to this post by Samir Abdou
Thank you

I had forgotten "Field.TermVector.YES" when I created the new Field



On 24/10/06, Samir Abdou <[hidden email]> wrote:

>
> Hi,
>
> You indexed without storing vectors! This is why the term vector is null.
>
> Samir
>
>
> -----Message d'origine-----
> De: Paz Belmonte [mailto:[hidden email]]
> Envoyé: mardi, 24. octobre 2006 12:30
> Ŕ: java-user
> Objet: Re: number of term occurrences
>
> Hi,
>
> I have tried this options too and the Term Vector return null.
>
> Which do you think that it is the problem?
>
>
> 2006/10/24, beatriz ramos <[hidden email]>:
> >
> >
> >
> > ---------- Forwarded message ----------
> > From: beatriz ramos <[hidden email]>
> > Date: 24-Oct-2006 11:24
> > Subject: Re: number of term occurrences
> > To: [hidden email]
> >
> > Hi, thanks for all your answers, but they don't work
> >
> > I have tried the 3 options and with all of them we get termDoc = 0
> > I have checked my index with Luke software and termDoc is 1 here, so my
> > index is correct.
> >
> > is it possible I have a problem with the reader? (because my index is
> > allright)
> >
> > Thanks
> >
> > (when I talk about termDocs, it means number of documents in which term
> > appears)
> >
> >
> >
> > On 24/10/06, Grant Ingersoll <[hidden email]> wrote:
> > >
> > > You can also use Term Vectors, at the cost of extra storage.  Search
> > > this list for Term Vectors for info on how to implement.
> > >
> > > On Oct 23, 2006, at 5:50 AM, beatriz ramos wrote:
> > >
> > > > Hello,
> > > > I´m working with Lucene. I need to get the number of occurrences of
> > > > the term
> > > > in the document. I had seen the documentations ant I don´t find
> > > > anything.
> > > > Do you have any idea?
> > > > Thanks.
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > Sr. Software Engineer
> > > Center for Natural Language Processing
> > > Syracuse University
> > > 335 Hinds Hall
> > > Syracuse, NY 13244
> > > http://www.cnlp.org
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>