How to get mapping of query terms to number of their occurrences in a doc?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

How to get mapping of query terms to number of their occurrences in a doc?

Dmitry Goldenberg
Given a query, I want to be able to, for each query term, get the number of occurrences of the term.  I have tried what I'm including below and it does not seem to provide reliable results.  Seems to work fine with exact matching but as soon as stemming kicks in, all bets are off as to value of the number of occurrences returned.
 
Any ideas, anyone?  Can this be written in a simpler and/or more efficient way?
Thanks -
 
      int totalOccurrences = 0;
 
      reader = IndexReader.open(getDirectory(indexDirPath));
      HashSet terms = new HashSet();
      query.extractTerms(terms);
 
      TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
      if (tfvs != null) {

        // For each term frequency vector (i.e. for each field)
        for (int i = 0; i < tfvs.length; i++) {
          String field = tfvs[i].getField();
          String[] strTerms = tfvs[i].getTerms();
          int[] tfs = tfvs[i].getTermFrequencies();
 
          if (strTerms != null) {

            // For each term in the query
            for (Iterator iter = terms.iterator(); iter.hasNext();) {

              Term term = (Term) iter.next();
              // For each term in the vector
              for (int j = 0; j < strTerms.length; j++) {

                // If found the query term among the vector terms
                if (field.equals(term.field()) && strTerms[j].equals(term.text())) {

                  // Add the term frequency to the total
                  totalOccurrences += tfs[j];

                }
              }
            }
          }
        }
      }
Reply | Threaded
Open this post in threaded view
|

Re: How to get mapping of query terms to number of their occurrences in a doc?

Chris Hostetter-3

A cursory reading of your code looks ok ... stemming shouldn't be an issue
as long as your measure of success is comparing docs that match your
orriginal query with the counts you get out.

What i mean by that is that any stemming should have already been taken
care of when your query object was constructed (either by you manually, or
by QueryParser).  the direct equals comparisons you are dong should be
fine.

have you tried adding logging of the raw term field/text and the freq
counts you get back to see if that helps you spot the problem?


: Date: Mon, 6 Feb 2006 14:34:05 -0800
: From: Dmitry Goldenberg <[hidden email]>
: Reply-To: [hidden email]
: To: [hidden email]
: Subject: How to get mapping of query terms to number of their occurrences
:     in a doc?
:
: Given a query, I want to be able to, for each query term, get the number of occurrences of the term.  I have tried what I'm including below and it does not seem to provide reliable results.  Seems to work fine with exact matching but as soon as stemming kicks in, all bets are off as to value of the number of occurrences returned.
:
: Any ideas, anyone?  Can this be written in a simpler and/or more efficient way?
: Thanks -
:
:       int totalOccurrences = 0;
:
:       reader = IndexReader.open(getDirectory(indexDirPath));
:       HashSet terms = new HashSet();
:       query.extractTerms(terms);
:
:       TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
:       if (tfvs != null) {
:
:         // For each term frequency vector (i.e. for each field)
:         for (int i = 0; i < tfvs.length; i++) {
:           String field = tfvs[i].getField();
:           String[] strTerms = tfvs[i].getTerms();
:           int[] tfs = tfvs[i].getTermFrequencies();
:
:           if (strTerms != null) {
:
:             // For each term in the query
:             for (Iterator iter = terms.iterator(); iter.hasNext();) {
:
:               Term term = (Term) iter.next();
:               // For each term in the vector
:               for (int j = 0; j < strTerms.length; j++) {
:
:                 // If found the query term among the vector terms
:                 if (field.equals(term.field()) && strTerms[j].equals(term.text())) {
:
:                   // Add the term frequency to the total
:                   totalOccurrences += tfs[j];
:
:                 }
:               }
:             }
:           }
:         }
:       }
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: How to get mapping of query terms to number of their occurrences in a doc?

Dmitry Goldenberg
Chris,
 
That's what I did, for debugging.  The query is "biology", and here's what the API tells me for term frequencies:
biolog 15
biologi 31
biologist 4

I actually see 13 occurrences of "biologist" and "biologists", 64 occurrences of "biology", 27 occurrences of "biological".

I see "inform 22" but the actual count of the word "information" in the document is 33.  But "ioniz 7" is correct.

I don't see much correlation between what the API is telling me and what I see in the actual document.  Am I missing something?

Thanks

________________________________

From: Chris Hostetter [mailto:[hidden email]]
Sent: Tue 2/7/2006 4:10 PM
To: [hidden email]
Subject: Re: How to get mapping of query terms to number of their occurrences in a doc?




A cursory reading of your code looks ok ... stemming shouldn't be an issue
as long as your measure of success is comparing docs that match your
orriginal query with the counts you get out.

What i mean by that is that any stemming should have already been taken
care of when your query object was constructed (either by you manually, or
by QueryParser).  the direct equals comparisons you are dong should be
fine.

have you tried adding logging of the raw term field/text and the freq
counts you get back to see if that helps you spot the problem?


: Date: Mon, 6 Feb 2006 14:34:05 -0800
: From: Dmitry Goldenberg <[hidden email]>
: Reply-To: [hidden email]
: To: [hidden email]
: Subject: How to get mapping of query terms to number of their occurrences
:     in a doc?
:
: Given a query, I want to be able to, for each query term, get the number of occurrences of the term.  I have tried what I'm including below and it does not seem to provide reliable results.  Seems to work fine with exact matching but as soon as stemming kicks in, all bets are off as to value of the number of occurrences returned.
:
: Any ideas, anyone?  Can this be written in a simpler and/or more efficient way?
: Thanks -
:
:       int totalOccurrences = 0;
:
:       reader = IndexReader.open(getDirectory(indexDirPath));
:       HashSet terms = new HashSet();
:       query.extractTerms(terms);
:
:       TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
:       if (tfvs != null) {
:
:         // For each term frequency vector (i.e. for each field)
:         for (int i = 0; i < tfvs.length; i++) {
:           String field = tfvs[i].getField();
:           String[] strTerms = tfvs[i].getTerms();
:           int[] tfs = tfvs[i].getTermFrequencies();
:
:           if (strTerms != null) {
:
:             // For each term in the query
:             for (Iterator iter = terms.iterator(); iter.hasNext();) {
:
:               Term term = (Term) iter.next();
:               // For each term in the vector
:               for (int j = 0; j < strTerms.length; j++) {
:
:                 // If found the query term among the vector terms
:                 if (field.equals(term.field()) && strTerms[j].equals(term.text())) {
:
:                   // Add the term frequency to the total
:                   totalOccurrences += tfs[j];
:
:                 }
:               }
:             }
:           }
:         }
:       }
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: How to get mapping of query terms to number of their occurrences in a doc?

Chris Hostetter-3

: That's what I did, for debugging.  The query is "biology", and here's
: what the API tells me for term frequencies:
: biolog 15
: biologi 31
: biologist 4
:
: I actually see 13 occurrences of "biologist" and "biologists", 64
: occurrences of "biology", 27 occurrences of "biological".
:
: I see "inform 22" but the actual count of the word "information" in the
: document is 33.  But "ioniz 7" is correct.

I think I missunderstood what you ment when you said the counts don't
match up.  Are you comparing the number you get from that code with the
number of times you personally see the word in the source document before
it has been analyzed?

If so, then there could be a couple of things going on ... i would start
by using a tool like Luke to see the actual lists of Terms for each doc --
there may be something else your analyzer is doing that you don't realize.

It's also possible that you are hitting the maxFieldLength in the
IndexWriter ... when that happens IndexWriter throws away any remaining
tokens, so if your documenst are really large.

Lastly, I would add a *lot* more debugging to your code.  Print out the
contents of "terms", when you loop over "tfvs" print out the field and the
full list of strTerms, in the inner most loop when you incriment the
count, print out the field/text/and count.

that's the best advise i have for spotting what's wrong.



: ________________________________
:
: From: Chris Hostetter [mailto:[hidden email]]
: Sent: Tue 2/7/2006 4:10 PM
: To: [hidden email]
: Subject: Re: How to get mapping of query terms to number of their occurrences in a doc?
:
:
:
:
: A cursory reading of your code looks ok ... stemming shouldn't be an issue
: as long as your measure of success is comparing docs that match your
: orriginal query with the counts you get out.
:
: What i mean by that is that any stemming should have already been taken
: care of when your query object was constructed (either by you manually, or
: by QueryParser).  the direct equals comparisons you are dong should be
: fine.
:
: have you tried adding logging of the raw term field/text and the freq
: counts you get back to see if that helps you spot the problem?
:
:
: : Date: Mon, 6 Feb 2006 14:34:05 -0800
: : From: Dmitry Goldenberg <[hidden email]>
: : Reply-To: [hidden email]
: : To: [hidden email]
: : Subject: How to get mapping of query terms to number of their occurrences
: :     in a doc?
: :
: : Given a query, I want to be able to, for each query term, get the number of occurrences of the term.  I have tried what I'm including below and it does not seem to provide reliable results.  Seems to work fine with exact matching but as soon as stemming kicks in, all bets are off as to value of the number of occurrences returned.
: :
: : Any ideas, anyone?  Can this be written in a simpler and/or more efficient way?
: : Thanks -
: :
: :       int totalOccurrences = 0;
: :
: :       reader = IndexReader.open(getDirectory(indexDirPath));
: :       HashSet terms = new HashSet();
: :       query.extractTerms(terms);
: :
: :       TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
: :       if (tfvs != null) {
: :
: :         // For each term frequency vector (i.e. for each field)
: :         for (int i = 0; i < tfvs.length; i++) {
: :           String field = tfvs[i].getField();
: :           String[] strTerms = tfvs[i].getTerms();
: :           int[] tfs = tfvs[i].getTermFrequencies();
: :
: :           if (strTerms != null) {
: :
: :             // For each term in the query
: :             for (Iterator iter = terms.iterator(); iter.hasNext();) {
: :
: :               Term term = (Term) iter.next();
: :               // For each term in the vector
: :               for (int j = 0; j < strTerms.length; j++) {
: :
: :                 // If found the query term among the vector terms
: :                 if (field.equals(term.field()) && strTerms[j].equals(term.text())) {
: :
: :                   // Add the term frequency to the total
: :                   totalOccurrences += tfs[j];
: :
: :                 }
: :               }
: :             }
: :           }
: :         }
: :       }
: :
:
:
:
: -Hoss
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [hidden email]
: For additional commands, e-mail: [hidden email]
:
:
:
:
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: How to get mapping of query terms to number of their occurrences in a doc?

Dmitry Goldenberg
Duh! Bingo! Mistery solved. I should have thought of this :)
The discrepancies come in with larger documents, definitely > 10K terms which is Lucene's default maxFieldLength.
 
Thanks for your help, Chris
 
- Dmitry

________________________________

From: Chris Hostetter [mailto:[hidden email]]
Sent: Wed 2/8/2006 10:04 AM
To: [hidden email]
Subject: RE: How to get mapping of query terms to number of their occurrences in a doc?




: That's what I did, for debugging.  The query is "biology", and here's
: what the API tells me for term frequencies:
: biolog 15
: biologi 31
: biologist 4
:
: I actually see 13 occurrences of "biologist" and "biologists", 64
: occurrences of "biology", 27 occurrences of "biological".
:
: I see "inform 22" but the actual count of the word "information" in the
: document is 33.  But "ioniz 7" is correct.

I think I missunderstood what you ment when you said the counts don't
match up.  Are you comparing the number you get from that code with the
number of times you personally see the word in the source document before
it has been analyzed?

If so, then there could be a couple of things going on ... i would start
by using a tool like Luke to see the actual lists of Terms for each doc --
there may be something else your analyzer is doing that you don't realize.

It's also possible that you are hitting the maxFieldLength in the
IndexWriter ... when that happens IndexWriter throws away any remaining
tokens, so if your documenst are really large.

Lastly, I would add a *lot* more debugging to your code.  Print out the
contents of "terms", when you loop over "tfvs" print out the field and the
full list of strTerms, in the inner most loop when you incriment the
count, print out the field/text/and count.

that's the best advise i have for spotting what's wrong.



: ________________________________
:
: From: Chris Hostetter [mailto:[hidden email]]
: Sent: Tue 2/7/2006 4:10 PM
: To: [hidden email]
: Subject: Re: How to get mapping of query terms to number of their occurrences in a doc?
:
:
:
:
: A cursory reading of your code looks ok ... stemming shouldn't be an issue
: as long as your measure of success is comparing docs that match your
: orriginal query with the counts you get out.
:
: What i mean by that is that any stemming should have already been taken
: care of when your query object was constructed (either by you manually, or
: by QueryParser).  the direct equals comparisons you are dong should be
: fine.
:
: have you tried adding logging of the raw term field/text and the freq
: counts you get back to see if that helps you spot the problem?
:
:
: : Date: Mon, 6 Feb 2006 14:34:05 -0800
: : From: Dmitry Goldenberg <[hidden email]>
: : Reply-To: [hidden email]
: : To: [hidden email]
: : Subject: How to get mapping of query terms to number of their occurrences
: :     in a doc?
: :
: : Given a query, I want to be able to, for each query term, get the number of occurrences of the term.  I have tried what I'm including below and it does not seem to provide reliable results.  Seems to work fine with exact matching but as soon as stemming kicks in, all bets are off as to value of the number of occurrences returned.
: :
: : Any ideas, anyone?  Can this be written in a simpler and/or more efficient way?
: : Thanks -
: :
: :       int totalOccurrences = 0;
: :
: :       reader = IndexReader.open(getDirectory(indexDirPath));
: :       HashSet terms = new HashSet();
: :       query.extractTerms(terms);
: :
: :       TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
: :       if (tfvs != null) {
: :
: :         // For each term frequency vector (i.e. for each field)
: :         for (int i = 0; i < tfvs.length; i++) {
: :           String field = tfvs[i].getField();
: :           String[] strTerms = tfvs[i].getTerms();
: :           int[] tfs = tfvs[i].getTermFrequencies();
: :
: :           if (strTerms != null) {
: :
: :             // For each term in the query
: :             for (Iterator iter = terms.iterator(); iter.hasNext();) {
: :
: :               Term term = (Term) iter.next();
: :               // For each term in the vector
: :               for (int j = 0; j < strTerms.length; j++) {
: :
: :                 // If found the query term among the vector terms
: :                 if (field.equals(term.field()) && strTerms[j].equals(term.text())) {
: :
: :                   // Add the term frequency to the total
: :                   totalOccurrences += tfs[j];
: :
: :                 }
: :               }
: :             }
: :           }
: :         }
: :       }
: :
:
:
:
: -Hoss
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [hidden email]
: For additional commands, e-mail: [hidden email]
:
:
:
:
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: How to get mapping of query terms to number of their occurrences in a doc?

Erik Hatcher
This is a real gotcha with Lucene in it's out of the box  
configuration.  In the several applications I've built to index  
documents I've always hit this and had to set the maxFieldLength to  
its maximum possible value.  Is there still an argument to be made to  
keep the default at 10K or would it be reasonable to bump this up  
even if there are, the few, cases where setting it lower is  
desirable?   We made the compound file index be the default to  
prevent file handle limitations at the expense of some (generally  
irrelevant) performance, so maybe we could also make this more common  
setting the default also?

        Erik

On Feb 8, 2006, at 2:17 PM, Dmitry Goldenberg wrote:

> Duh! Bingo! Mistery solved. I should have thought of this :)
> The discrepancies come in with larger documents, definitely > 10K  
> terms which is Lucene's default maxFieldLength.
>
> Thanks for your help, Chris
>
> - Dmitry
>
> ________________________________
>
> From: Chris Hostetter [mailto:[hidden email]]
> Sent: Wed 2/8/2006 10:04 AM
> To: [hidden email]
> Subject: RE: How to get mapping of query terms to number of their  
> occurrences in a doc?
>
>
>
>
> : That's what I did, for debugging.  The query is "biology", and  
> here's
> : what the API tells me for term frequencies:
> : biolog 15
> : biologi 31
> : biologist 4
> :
> : I actually see 13 occurrences of "biologist" and "biologists", 64
> : occurrences of "biology", 27 occurrences of "biological".
> :
> : I see "inform 22" but the actual count of the word "information"  
> in the
> : document is 33.  But "ioniz 7" is correct.
>
> I think I missunderstood what you ment when you said the counts don't
> match up.  Are you comparing the number you get from that code with  
> the
> number of times you personally see the word in the source document  
> before
> it has been analyzed?
>
> If so, then there could be a couple of things going on ... i would  
> start
> by using a tool like Luke to see the actual lists of Terms for each  
> doc --
> there may be something else your analyzer is doing that you don't  
> realize.
>
> It's also possible that you are hitting the maxFieldLength in the
> IndexWriter ... when that happens IndexWriter throws away any  
> remaining
> tokens, so if your documenst are really large.
>
> Lastly, I would add a *lot* more debugging to your code.  Print out  
> the
> contents of "terms", when you loop over "tfvs" print out the field  
> and the
> full list of strTerms, in the inner most loop when you incriment the
> count, print out the field/text/and count.
>
> that's the best advise i have for spotting what's wrong.
>
>
>
> : ________________________________
> :
> : From: Chris Hostetter [mailto:[hidden email]]
> : Sent: Tue 2/7/2006 4:10 PM
> : To: [hidden email]
> : Subject: Re: How to get mapping of query terms to number of their  
> occurrences in a doc?
> :
> :
> :
> :
> : A cursory reading of your code looks ok ... stemming shouldn't be  
> an issue
> : as long as your measure of success is comparing docs that match your
> : orriginal query with the counts you get out.
> :
> : What i mean by that is that any stemming should have already been  
> taken
> : care of when your query object was constructed (either by you  
> manually, or
> : by QueryParser).  the direct equals comparisons you are dong  
> should be
> : fine.
> :
> : have you tried adding logging of the raw term field/text and the  
> freq
> : counts you get back to see if that helps you spot the problem?
> :
> :
> : : Date: Mon, 6 Feb 2006 14:34:05 -0800
> : : From: Dmitry Goldenberg <[hidden email]>
> : : Reply-To: [hidden email]
> : : To: [hidden email]
> : : Subject: How to get mapping of query terms to number of their  
> occurrences
> : :     in a doc?
> : :
> : : Given a query, I want to be able to, for each query term, get  
> the number of occurrences of the term.  I have tried what I'm  
> including below and it does not seem to provide reliable results.  
> Seems to work fine with exact matching but as soon as stemming  
> kicks in, all bets are off as to value of the number of occurrences  
> returned.
> : :
> : : Any ideas, anyone?  Can this be written in a simpler and/or  
> more efficient way?
> : : Thanks -
> : :
> : :       int totalOccurrences = 0;
> : :
> : :       reader = IndexReader.open(getDirectory(indexDirPath));
> : :       HashSet terms = new HashSet();
> : :       query.extractTerms(terms);
> : :
> : :       TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
> : :       if (tfvs != null) {
> : :
> : :         // For each term frequency vector (i.e. for each field)
> : :         for (int i = 0; i < tfvs.length; i++) {
> : :           String field = tfvs[i].getField();
> : :           String[] strTerms = tfvs[i].getTerms();
> : :           int[] tfs = tfvs[i].getTermFrequencies();
> : :
> : :           if (strTerms != null) {
> : :
> : :             // For each term in the query
> : :             for (Iterator iter = terms.iterator(); iter.hasNext
> ();) {
> : :
> : :               Term term = (Term) iter.next();
> : :               // For each term in the vector
> : :               for (int j = 0; j < strTerms.length; j++) {
> : :
> : :                 // If found the query term among the vector terms
> : :                 if (field.equals(term.field()) && strTerms
> [j].equals(term.text())) {
> : :
> : :                   // Add the term frequency to the total
> : :                   totalOccurrences += tfs[j];
> : :
> : :                 }
> : :               }
> : :             }
> : :           }
> : :         }
> : :       }
> : :
> :
> :
> :
> : -Hoss
> :
> :
> :  
> ---------------------------------------------------------------------
> : To unsubscribe, e-mail: [hidden email]
> : For additional commands, e-mail: [hidden email]
> :
> :
> :
> :
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to get mapping of query terms to number of their occurrences in a doc?

Otis Gospodnetic-2
In reply to this post by Dmitry Goldenberg
There is a HighFreqTerms class in contib/misc..... that may be interesting to you.  I just modified it slightly locally last night to limit things to a specific field, and will commit it later.

Otis

----- Original Message ----
From: Dmitry Goldenberg <[hidden email]>
To: [hidden email]
Sent: Mon 06 Feb 2006 05:34:05 PM EST
Subject: How to get mapping of query terms to number of their occurrences in a doc?

Given a query, I want to be able to, for each query term, get the number of occurrences of the term.  I have tried what I'm including below and it does not seem to provide reliable results.  Seems to work fine with exact matching but as soon as stemming kicks in, all bets are off as to value of the number of occurrences returned.
 
Any ideas, anyone?  Can this be written in a simpler and/or more efficient way?
Thanks -
 
      int totalOccurrences = 0;
 
      reader = IndexReader.open(getDirectory(indexDirPath));
      HashSet terms = new HashSet();
      query.extractTerms(terms);
 
      TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
      if (tfvs != null) {

        // For each term frequency vector (i.e. for each field)
        for (int i = 0; i < tfvs.length; i++) {
          String field = tfvs[i].getField();
          String[] strTerms = tfvs[i].getTerms();
          int[] tfs = tfvs[i].getTermFrequencies();
 
          if (strTerms != null) {

            // For each term in the query
            for (Iterator iter = terms.iterator(); iter.hasNext();) {

              Term term = (Term) iter.next();
              // For each term in the vector
              for (int j = 0; j < strTerms.length; j++) {

                // If found the query term among the vector terms
                if (field.equals(term.field()) && strTerms[j].equals(term.text())) {

                  // Add the term frequency to the total
                  totalOccurrences += tfs[j];

                }
              }
            }
          }
        }
      }




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: How to get mapping of query terms to number of their occurrences in a doc?

Dmitry Goldenberg
In reply to this post by Erik Hatcher
It seems, from the javadoc, that the 10K default is enforced to avoid a possible OutOfMemoryError.  I wonder how safe/unsafe it is to set the value to maximum possible, if we don't impose any limit on customers' document sizes.  Perhaps, the best solution is to expose the value as configurable by the customer, rather than impose the 10K limit or run the risk of an out-of-memory condition...

________________________________

From: Erik Hatcher [mailto:[hidden email]]
Sent: Thu 2/9/2006 1:10 AM
To: [hidden email]
Subject: Re: How to get mapping of query terms to number of their occurrences in a doc?



This is a real gotcha with Lucene in it's out of the box
configuration.  In the several applications I've built to index
documents I've always hit this and had to set the maxFieldLength to
its maximum possible value.  Is there still an argument to be made to
keep the default at 10K or would it be reasonable to bump this up
even if there are, the few, cases where setting it lower is
desirable?   We made the compound file index be the default to
prevent file handle limitations at the expense of some (generally
irrelevant) performance, so maybe we could also make this more common
setting the default also?

        Erik

On Feb 8, 2006, at 2:17 PM, Dmitry Goldenberg wrote:

> Duh! Bingo! Mistery solved. I should have thought of this :)
> The discrepancies come in with larger documents, definitely > 10K
> terms which is Lucene's default maxFieldLength.
>
> Thanks for your help, Chris
>
> - Dmitry
>
> ________________________________
>
> From: Chris Hostetter [mailto:[hidden email]]
> Sent: Wed 2/8/2006 10:04 AM
> To: [hidden email]
> Subject: RE: How to get mapping of query terms to number of their
> occurrences in a doc?
>
>
>
>
> : That's what I did, for debugging.  The query is "biology", and
> here's
> : what the API tells me for term frequencies:
> : biolog 15
> : biologi 31
> : biologist 4
> :
> : I actually see 13 occurrences of "biologist" and "biologists", 64
> : occurrences of "biology", 27 occurrences of "biological".
> :
> : I see "inform 22" but the actual count of the word "information"
> in the
> : document is 33.  But "ioniz 7" is correct.
>
> I think I missunderstood what you ment when you said the counts don't
> match up.  Are you comparing the number you get from that code with
> the
> number of times you personally see the word in the source document
> before
> it has been analyzed?
>
> If so, then there could be a couple of things going on ... i would
> start
> by using a tool like Luke to see the actual lists of Terms for each
> doc --
> there may be something else your analyzer is doing that you don't
> realize.
>
> It's also possible that you are hitting the maxFieldLength in the
> IndexWriter ... when that happens IndexWriter throws away any
> remaining
> tokens, so if your documenst are really large.
>
> Lastly, I would add a *lot* more debugging to your code.  Print out
> the
> contents of "terms", when you loop over "tfvs" print out the field
> and the
> full list of strTerms, in the inner most loop when you incriment the
> count, print out the field/text/and count.
>
> that's the best advise i have for spotting what's wrong.
>
>
>
> : ________________________________
> :
> : From: Chris Hostetter [mailto:[hidden email]]
> : Sent: Tue 2/7/2006 4:10 PM
> : To: [hidden email]
> : Subject: Re: How to get mapping of query terms to number of their
> occurrences in a doc?
> :
> :
> :
> :
> : A cursory reading of your code looks ok ... stemming shouldn't be
> an issue
> : as long as your measure of success is comparing docs that match your
> : orriginal query with the counts you get out.
> :
> : What i mean by that is that any stemming should have already been
> taken
> : care of when your query object was constructed (either by you
> manually, or
> : by QueryParser).  the direct equals comparisons you are dong
> should be
> : fine.
> :
> : have you tried adding logging of the raw term field/text and the
> freq
> : counts you get back to see if that helps you spot the problem?
> :
> :
> : : Date: Mon, 6 Feb 2006 14:34:05 -0800
> : : From: Dmitry Goldenberg <[hidden email]>
> : : Reply-To: [hidden email]
> : : To: [hidden email]
> : : Subject: How to get mapping of query terms to number of their
> occurrences
> : :     in a doc?
> : :
> : : Given a query, I want to be able to, for each query term, get
> the number of occurrences of the term.  I have tried what I'm
> including below and it does not seem to provide reliable results.  
> Seems to work fine with exact matching but as soon as stemming
> kicks in, all bets are off as to value of the number of occurrences
> returned.
> : :
> : : Any ideas, anyone?  Can this be written in a simpler and/or
> more efficient way?
> : : Thanks -
> : :
> : :       int totalOccurrences = 0;
> : :
> : :       reader = IndexReader.open(getDirectory(indexDirPath));
> : :       HashSet terms = new HashSet();
> : :       query.extractTerms(terms);
> : :
> : :       TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
> : :       if (tfvs != null) {
> : :
> : :         // For each term frequency vector (i.e. for each field)
> : :         for (int i = 0; i < tfvs.length; i++) {
> : :           String field = tfvs[i].getField();
> : :           String[] strTerms = tfvs[i].getTerms();
> : :           int[] tfs = tfvs[i].getTermFrequencies();
> : :
> : :           if (strTerms != null) {
> : :
> : :             // For each term in the query
> : :             for (Iterator iter = terms.iterator(); iter.hasNext
> ();) {
> : :
> : :               Term term = (Term) iter.next();
> : :               // For each term in the vector
> : :               for (int j = 0; j < strTerms.length; j++) {
> : :
> : :                 // If found the query term among the vector terms
> : :                 if (field.equals(term.field()) && strTerms
> [j].equals(term.text())) {
> : :
> : :                   // Add the term frequency to the total
> : :                   totalOccurrences += tfs[j];
> : :
> : :                 }
> : :               }
> : :             }
> : :           }
> : :         }
> : :       }
> : :
> :
> :
> :
> : -Hoss
> :
> :
> :
> ---------------------------------------------------------------------
> : To unsubscribe, e-mail: [hidden email]
> : For additional commands, e-mail: [hidden email]
> :
> :
> :
> :
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]