Searching for instances within a document

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Searching for instances within a document

jnance
Hi,

I am indexing lots of text files and need to see how many times a certain word comes up in each text file. Right now I have this constructor for "search":

 static void search(Searcher searcher, String queryString) throws ParseException, IOException {
                 QueryParser parser = new QueryParser("content", new StandardAnalyzer());
                 Query query = parser.parse(queryString);
                 Hits hits = searcher.search(query);
                 
                 int hitCount = hits.length();
                 if (hitCount == 0) {
                         System.out.println("0 documents contain the word \"" + queryString + ".\"");
                 }
                 else {
                         System.out.println(hitCount + " documents contain the word \"" + queryString + ".\"");
                 }
         }

This tells me how many documents contain the word I'm looking for... but how do I get it to tell me how many times the word occurs within that document?

Thanks,

James
Reply | Threaded
Open this post in threaded view
|

Re: Searching for instances within a document

Erick Erickson
I know this has been discussed before, so if you search the
archive you might find an answer more quickly. I don't
remember what the resolution was, so I can't help there.

Best
Erick

On Wed, Jul 9, 2008 at 9:49 AM, jnance <[hidden email]> wrote:

>
> Hi,
>
> I am indexing lots of text files and need to see how many times a certain
> word comes up in each text file. Right now I have this constructor for
> "search":
>
>  static void search(Searcher searcher, String queryString) throws
> ParseException, IOException {
>                 QueryParser parser = new QueryParser("content", new
> StandardAnalyzer());
>                 Query query = parser.parse(queryString);
>                 Hits hits = searcher.search(query);
>
>                 int hitCount = hits.length();
>                 if (hitCount == 0) {
>                         System.out.println("0 documents contain the word
> \"" + queryString +
> ".\"");
>                 }
>                 else {
>                         System.out.println(hitCount + " documents contain
> the word \"" +
> queryString + ".\"");
>                 }
>         }
>
> This tells me how many documents contain the word I'm looking for... but
> how
> do I get it to tell me how many times the word occurs within that document?
>
> Thanks,
>
> James
> --
> View this message in context:
> http://www.nabble.com/Searching-for-instances-within-a-document-tp18362075p18362075.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Searching for instances within a document

jnance
Ok, I'll see if I can find anything.

Thanks,

James
Reply | Threaded
Open this post in threaded view
|

Re: Searching for instances within a document

Karl Wettin
In reply to this post by jnance
Maybe you are looking for the document TermFreqVector?


        karl

9 jul 2008 kl. 15.49 skrev jnance:

>
> Hi,
>
> I am indexing lots of text files and need to see how many times a  
> certain
> word comes up in each text file. Right now I have this constructor for
> "search":
>
> static void search(Searcher searcher, String queryString) throws
> ParseException, IOException {
> QueryParser parser = new QueryParser("content", new  
> StandardAnalyzer());
> Query query = parser.parse(queryString);
> Hits hits = searcher.search(query);
>
> int hitCount = hits.length();
> if (hitCount == 0) {
> System.out.println("0 documents contain the word \"" +  
> queryString +
> ".\"");
> }
> else {
> System.out.println(hitCount + " documents contain the word \"" +
> queryString + ".\"");
> }
> }
>
> This tells me how many documents contain the word I'm looking for...  
> but how
> do I get it to tell me how many times the word occurs within that  
> document?
>
> Thanks,
>
> James
> --
> View this message in context: http://www.nabble.com/Searching-for-instances-within-a-document-tp18362075p18362075.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Searching for instances within a document

Ajay Lakhani
Hi James,

Try this:

    Searcher searcher = new IndexSearcher(dir);
    QueryParser parser = new QueryParser("content", new StandardAnalyzer());
    Query query = parser.parse(queryString);

    HashSet queryTerms = new HashSet();
    query.extractTerms(queryTerms);

    Hits hits = searcher.search(query);

    IndexReader reader = IndexReader.open(dir);

    for (int i =0; i < hits.length() ; i ++){
      Document d = hits.doc(i);
      Field fid = d.getField("id");
      Field ftitle = d.getField("title");
      System.out.println("id is " + fid.stringValue());
      System.out.println("title is " + ftitle.stringValue());

      TermFreqVector tfv = reader.getTermFreqVector(hits.id(i), "content");
      String[] terms = tfv.getTerms();
      int [] freqs = tfv.getTermFrequencies();//get the frequencies

      // for each term in the query
      for (Iterator iter = queryTerms.iterator(); iter.hasNext();) {
        Term term = (Term) iter.next();

        // for each term in the vector
        for (int j = 0; j < terms.length; j++) {
          if (terms[j].equals(term.text())) {
            System.out.println("frequency of term ["+ term.text() +"] is " +
freqs[j] );
          }
        }
      }
    }

Let me know if this helps.
Cheers
AJ

2008/7/10 Karl Wettin <[hidden email]>:

> Maybe you are looking for the document TermFreqVector?
>
>
>       karl
>
> 9 jul 2008 kl. 15.49 skrev jnance:
>
>
>> Hi,
>>
>> I am indexing lots of text files and need to see how many times a certain
>> word comes up in each text file. Right now I have this constructor for
>> "search":
>>
>> static void search(Searcher searcher, String queryString) throws
>> ParseException, IOException {
>>                 QueryParser parser = new QueryParser("content", new
>> StandardAnalyzer());
>>                 Query query = parser.parse(queryString);
>>                 Hits hits = searcher.search(query);
>>
>>                 int hitCount = hits.length();
>>                 if (hitCount == 0) {
>>                         System.out.println("0 documents contain the word
>> \"" + queryString +
>> ".\"");
>>                 }
>>                 else {
>>                         System.out.println(hitCount + " documents contain
>> the word \"" +
>> queryString + ".\"");
>>                 }
>>         }
>>
>> This tells me how many documents contain the word I'm looking for... but
>> how
>> do I get it to tell me how many times the word occurs within that
>> document?
>>
>> Thanks,
>>
>> James
>> --
>> View this message in context:
>> http://www.nabble.com/Searching-for-instances-within-a-document-tp18362075p18362075.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Searching for instances within a document

jnance
Yes, the term frequency vector is exactly what I needed. Thanks!

-James

Ajay Lakhani wrote
Hi James,

Try this:

    Searcher searcher = new IndexSearcher(dir);
    QueryParser parser = new QueryParser("content", new StandardAnalyzer());
    Query query = parser.parse(queryString);

    HashSet queryTerms = new HashSet();
    query.extractTerms(queryTerms);

    Hits hits = searcher.search(query);

    IndexReader reader = IndexReader.open(dir);

    for (int i =0; i < hits.length() ; i ++){
      Document d = hits.doc(i);
      Field fid = d.getField("id");
      Field ftitle = d.getField("title");
      System.out.println("id is " + fid.stringValue());
      System.out.println("title is " + ftitle.stringValue());

      TermFreqVector tfv = reader.getTermFreqVector(hits.id(i), "content");
      String[] terms = tfv.getTerms();
      int [] freqs = tfv.getTermFrequencies();//get the frequencies

      // for each term in the query
      for (Iterator iter = queryTerms.iterator(); iter.hasNext();) {
        Term term = (Term) iter.next();

        // for each term in the vector
        for (int j = 0; j < terms.length; j++) {
          if (terms[j].equals(term.text())) {
            System.out.println("frequency of term ["+ term.text() +"] is " +
freqs[j] );
          }
        }
      }
    }

Let me know if this helps.
Cheers
AJ

2008/7/10 Karl Wettin <karl.wettin@gmail.com>:

> Maybe you are looking for the document TermFreqVector?
>
>
>       karl
>
> 9 jul 2008 kl. 15.49 skrev jnance:
>
>
>> Hi,
>>
>> I am indexing lots of text files and need to see how many times a certain
>> word comes up in each text file. Right now I have this constructor for
>> "search":
>>
>> static void search(Searcher searcher, String queryString) throws
>> ParseException, IOException {
>>                 QueryParser parser = new QueryParser("content", new
>> StandardAnalyzer());
>>                 Query query = parser.parse(queryString);
>>                 Hits hits = searcher.search(query);
>>
>>                 int hitCount = hits.length();
>>                 if (hitCount == 0) {
>>                         System.out.println("0 documents contain the word
>> \"" + queryString +
>> ".\"");
>>                 }
>>                 else {
>>                         System.out.println(hitCount + " documents contain
>> the word \"" +
>> queryString + ".\"");
>>                 }
>>         }
>>
>> This tells me how many documents contain the word I'm looking for... but
>> how
>> do I get it to tell me how many times the word occurs within that
>> document?
>>
>> Thanks,
>>
>> James
>> --
>> View this message in context:
>> http://www.nabble.com/Searching-for-instances-within-a-document-tp18362075p18362075.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Searching for instances within a document

jnance
The TermFrequencyVector works perfectly for normal query strings. But if I add a wild card (*) onto words to search for different forms of the word I get an ArrayIndexOutOfBoundsException because the index is -1. Why does this happen? And is there anyway to avoid it?

Thanks,

James


jnance wrote
Yes, the term frequency vector is exactly what I needed. Thanks!

-James

Ajay Lakhani wrote
Hi James,

Try this:

    Searcher searcher = new IndexSearcher(dir);
    QueryParser parser = new QueryParser("content", new StandardAnalyzer());
    Query query = parser.parse(queryString);

    HashSet queryTerms = new HashSet();
    query.extractTerms(queryTerms);

    Hits hits = searcher.search(query);

    IndexReader reader = IndexReader.open(dir);

    for (int i =0; i < hits.length() ; i ++){
      Document d = hits.doc(i);
      Field fid = d.getField("id");
      Field ftitle = d.getField("title");
      System.out.println("id is " + fid.stringValue());
      System.out.println("title is " + ftitle.stringValue());

      TermFreqVector tfv = reader.getTermFreqVector(hits.id(i), "content");
      String[] terms = tfv.getTerms();
      int [] freqs = tfv.getTermFrequencies();//get the frequencies

      // for each term in the query
      for (Iterator iter = queryTerms.iterator(); iter.hasNext();) {
        Term term = (Term) iter.next();

        // for each term in the vector
        for (int j = 0; j < terms.length; j++) {
          if (terms[j].equals(term.text())) {
            System.out.println("frequency of term ["+ term.text() +"] is " +
freqs[j] );
          }
        }
      }
    }

Let me know if this helps.
Cheers
AJ

2008/7/10 Karl Wettin <karl.wettin@gmail.com>:

> Maybe you are looking for the document TermFreqVector?
>
>
>       karl
>
> 9 jul 2008 kl. 15.49 skrev jnance:
>
>
>> Hi,
>>
>> I am indexing lots of text files and need to see how many times a certain
>> word comes up in each text file. Right now I have this constructor for
>> "search":
>>
>> static void search(Searcher searcher, String queryString) throws
>> ParseException, IOException {
>>                 QueryParser parser = new QueryParser("content", new
>> StandardAnalyzer());
>>                 Query query = parser.parse(queryString);
>>                 Hits hits = searcher.search(query);
>>
>>                 int hitCount = hits.length();
>>                 if (hitCount == 0) {
>>                         System.out.println("0 documents contain the word
>> \"" + queryString +
>> ".\"");
>>                 }
>>                 else {
>>                         System.out.println(hitCount + " documents contain
>> the word \"" +
>> queryString + ".\"");
>>                 }
>>         }
>>
>> This tells me how many documents contain the word I'm looking for... but
>> how
>> do I get it to tell me how many times the word occurs within that
>> document?
>>
>> Thanks,
>>
>> James
>> --
>> View this message in context:
>> http://www.nabble.com/Searching-for-instances-within-a-document-tp18362075p18362075.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Searching for instances within a document

Karl Wettin

11 jul 2008 kl. 15.28 skrev jnance:
>
> The TermFrequencyVector works perfectly for normal query strings.  
> But if I
> add a wild card (*) onto words to search for different forms of the  
> word I
> get an ArrayIndexOutOfBoundsException because the index is -1. Why  
> does this
> happen? And is there anyway to avoid it?

It is because

> public interface TermFreqVector {
>
>   /** Return an index in the term numbers array returned from
>    *  <code>getTerms</code> at which the term with the specified
>    *  <code>term</code> appears. If this term does not appear in the  
> array,
>    *  return -1.
>    */
>   public int indexOf(String term);

accepts a term text value and you try feed it with an unparsed query  
string. The respose is negative as there is no term in the document  
with the term text value 'foo*'.

One way to solve your problem is to enumerate the terms of the vector  
and see if it.startsWith("foo");


You should probably explain what it is you try to achive by doing this.


     karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]