Catching BooleanQuery.TooManyClauses

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Catching BooleanQuery.TooManyClauses

bb-6
Hi Lucene Users,

I would like to catch BooleanQuery.TooManyClauses exception for certain
wildcard searches and display a 'subset' of results.  I have used the
WildcardTermEnum to give me the first X documents matching the wildcard
query.  Below is the code I use to implement the solution.  

Without any performance concerns is this the best solution?
Or should I just tell the user to refine their query!?

Thanks

Ben

===== QueryParserTest.java ================================================
...
public class QueryParserTest extends LuceneTestCase {
        ...
        private static int MAX_HITS = 10;
        public void testCatchTooManyClauses() throws Exception {
                reader = IndexReader.open(directory);
                String queryStr = "9*";
                String field = "PART_NBR";
                Hits hits = null;
                Vector docList;
                try {
                        System.out.println("query: " + queryStr);
                        System.out.println("field: " + field);
                        hits =
searcher.search(parser.parse(field+":"+queryStr));
                        docList = new Vector(hits.length());
                        Iterator docListIt = hits.iterator();
                        while(docListIt.hasNext())
       
docList.add(((Hit)docListIt.next()).getDocument());
                }
                catch(BooleanQuery.TooManyClauses ex) {
                        System.out.println("catch
BooleanQuery.TooManyClauses, refining query");
                        Term term = new Term(field, queryStr);

                        WildcardTermEnum wte = new WildcardTermEnum(reader,
term);
                        int cnt = 0;
                        docList = new Vector(MAX_HITS);
                        while(wte.next() && cnt++ < MAX_HITS) {
                                term = wte.term();
                                TermQuery query = new TermQuery(new
Term(field, term.text()));
                                System.out.println("search for " +
query.getTerm().text());
                                hits = searcher.search(query);
                                Iterator docListIt = hits.iterator();
                                while(docListIt.hasNext())
       
docList.add(((Hit)docListIt.next()).getDocument());
                        }
                }
                System.out.println("found:" + docList.size());

        }
...
===== QueryParserTest.java ================================================

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.385 / Virus Database: 268.4.1/312 - Release Date: 14/04/2006
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Catching BooleanQuery.TooManyClauses

Erick Erickson
With the warning that I'm not the most experienced Lucene user in the
world...

I *think*, that rather than search for each term, it's more efficient to
just use IndexReader.termDocs..... i.e.

Indexreader ir = <whatever>;
TermDocs termDocs = ir.TermDocs();
WildcardTermEnum wildEnum = <whatever>;

for (Term term = null; (term = wildEnum.term()) != null; wildEnum.next()) {
      termDocs.seek(term);
      while (termDocs.next()) {
            Document doc = reader.document(termDocs.doc())
      }
}

I know that for loop looks odd, but I just peeked at the source code for the
TermEnum classes and see why it works.

One warning, as the folks on the board have pointed out to me is that the
Hits object is not entirely efficient when you fetch lots of docs (more than
100 has been mentioned) and you should think about TopDocs or some such.

Also, if you can avoid fetching the document (i.e. get everything you want
from the index) you'll add efficiency. I have no clue how much you're
returning to the user, so I don't know whether that would work for you.....

Hope this helps
Erick

P.S. I feel kind of odd writing things like this given that Chris, Yonik,
Erik & etc. are looking over my shoulder, but if I actually offer good
advice, maybe I can save them some time since they've certainly helped me
out. And if they make alternate suggestions, they'll be doing code reviews
for me! Cool!!!!! <G>
Reply | Threaded
Open this post in threaded view
|

Re: Catching BooleanQuery.TooManyClauses

Paul Elschot
On Saturday 15 April 2006 13:44, Erick Erickson wrote:

> With the warning that I'm not the most experienced Lucene user in the
> world...
>
> I *think*, that rather than search for each term, it's more efficient to
> just use IndexReader.termDocs..... i.e.
>
> Indexreader ir = <whatever>;
> TermDocs termDocs = ir.TermDocs();
> WildcardTermEnum wildEnum = <whatever>;
>
> for (Term term = null; (term = wildEnum.term()) != null; wildEnum.next()) {
>       termDocs.seek(term);

This avoids the buffer space needed for each TermDocs by using each term
separately. A BooleanQuery over all the terms will use the termDocs.next() and
termDocs.doc() for all terms at the same time. It has to, because more terms
might match each document and it has to compute the query score for each
document.

>       while (termDocs.next()) {
>             Document doc = reader.document(termDocs.doc())

The methods termDocs.next() and reader.document()
go to different places in the Lucene index (see the index format),
so this will send the disk head up and down.
It's better to collect the termDocs.doc() values first,  for example in a
BitSet, and then retrieve the Document's in numerical order.
Btw., this is what the ConstantScoreRangeQuery does to avoid using all terms
at the same time.

>       }
> }
>
> I know that for loop looks odd, but I just peeked at the source code for the
> TermEnum classes and see why it works.
>
> One warning, as the folks on the board have pointed out to me is that the
> Hits object is not entirely efficient when you fetch lots of docs (more than
> 100 has been mentioned) and you should think about TopDocs or some such.
>
> Also, if you can avoid fetching the document (i.e. get everything you want
> from the index) you'll add efficiency. I have no clue how much you're
> returning to the user, so I don't know whether that would work for you.....

In other words, one can use the above BitSet in a Filter lateron
during an IndexSearcher.search() (or in a ConstantScoreQuery),
and use Hits or TopDocs for document retrieval.

Regards,
Paul Elschot.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Catching BooleanQuery.TooManyClauses

Erick Erickson
Cool, thanks for the clarification...

Erick
Reply | Threaded
Open this post in threaded view
|

RE: Catching BooleanQuery.TooManyClauses

bb-6
In reply to this post by Paul Elschot
Thanks Erick & Paul,

I also found a great example of a custom filter in LIA (6.4 Using a custom
filter)

Here's my updated testcase if anybody is interested...

===== QueryParserTest.java ================================================
...
public class QueryParserTest extends LuceneTestCase {
        ...
        private static int MAX_HITS = 10;

        public void testCatchTooManyClauses() throws Exception {
                System.out.println("===>testCatchTooManyClauses");
                Vector docList = null;
                try {
                        causeTooManyClauses();
                }
                catch(BooleanQuery.TooManyClauses ex) {
                        Term term = new Term(field, queryStr);
                        final BitSet bs = new BitSet(reader.maxDoc());
                        TermDocs termDocs = reader.termDocs();
                        WildcardTermEnum wte = new WildcardTermEnum(reader,
term);
                        int cnt = 0;
                        docList = new Vector(MAX_HITS);
                        /*
                        the methods termDocs.next() and reader.document() go
to different places in
                        the Lucene index so this will send the disk head up
and down.
                        see
http://lucene.apache.org/java/docs/fileformats.html
                        */
                        for (term = null; (term = wte.term()) != null && cnt
< MAX_HITS; wte.next()) {
                                // get doc ids from .frq file
                                termDocs.seek(term);
                                while (termDocs.next() && cnt++ < MAX_HITS)
{
                                        bs.set(termDocs.doc());
                                }
                        }
                        termDocs.close();
                        // retrieve the Document's in numerical order
                        for(int i=bs.nextSetBit(0); i>=0;
i=bs.nextSetBit(i+1)) {
                                docList.add(reader.document(i));
                        }
                }
                System.out.println("found:" + docList.size());
                assertTrue(docList.size() == MAX_HITS);
        }
...
===== QueryParserTest.java ================================================
 

> -----Original Message-----
> From: Paul Elschot [mailto:[hidden email]]
> Sent: Sunday, 16 April 2006 5:13 AM
> To: [hidden email]
> Subject: Re: Catching BooleanQuery.TooManyClauses
>
>
> On Saturday 15 April 2006 13:44, Erick Erickson wrote:
> > With the warning that I'm not the most experienced Lucene
> user in the
> > world...
> >
> > I *think*, that rather than search for each term, it's more
> efficient to
> > just use IndexReader.termDocs..... i.e.
> >
> > Indexreader ir = <whatever>;
> > TermDocs termDocs = ir.TermDocs();
> > WildcardTermEnum wildEnum = <whatever>;
> >
> > for (Term term = null; (term = wildEnum.term()) != null;
> wildEnum.next()) {
> >       termDocs.seek(term);
>
> This avoids the buffer space needed for each TermDocs by
> using each term
> separately. A BooleanQuery over all the terms will use the
> termDocs.next() and
> termDocs.doc() for all terms at the same time. It has to,
> because more terms
> might match each document and it has to compute the query
> score for each
> document.
>
> >       while (termDocs.next()) {
> >             Document doc = reader.document(termDocs.doc())
>
> The methods termDocs.next() and reader.document()
> go to different places in the Lucene index (see the index format),
> so this will send the disk head up and down.
> It's better to collect the termDocs.doc() values first,  for
> example in a
> BitSet, and then retrieve the Document's in numerical order.
> Btw., this is what the ConstantScoreRangeQuery does to avoid
> using all terms
> at the same time.
>
> >       }
> > }
> >
> > I know that for loop looks odd, but I just peeked at the
> source code for the
> > TermEnum classes and see why it works.
> >
> > One warning, as the folks on the board have pointed out to
> me is that the
> > Hits object is not entirely efficient when you fetch lots
> of docs (more than
> > 100 has been mentioned) and you should think about TopDocs
> or some such.
> >
> > Also, if you can avoid fetching the document (i.e. get
> everything you want
> > from the index) you'll add efficiency. I have no clue how
> much you're
> > returning to the user, so I don't know whether that would
> work for you.....
>
> In other words, one can use the above BitSet in a Filter lateron
> during an IndexSearcher.search() (or in a ConstantScoreQuery),
> and use Hits or TopDocs for document retrieval.
>
> Regards,
> Paul Elschot.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.385 / Virus Database: 268.4.1/313 - Release
> Date: 15/04/2006
>  
>

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.385 / Virus Database: 268.4.2/314 - Release Date: 16/04/2006
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]