TooManyClauses in BooleanQuery

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

TooManyClauses in BooleanQuery

Harald Stowasser
Hello lucene-list readers,

first I want to introduce myself a little. Because I am new at this List:

I am a programmer in a publishing company, 32 years of Age and you can
find my picture at http://www.idowa.de/service/kontakt.
We release some local newspapers and a website (http://www.idowa.de)
with the main focus on regional content.

We use Lucene to create an index over the whole newspaper and website
content. So there is more than 2GB text to indicate.

And now I will tell you my problems in my implementation[1]:

1. Sorting by Date is ruinously slow. So I deactivated it.
2. Because the sorting is so slow, I want to allow the user specifying a
Date-Range. But Lucene throws an BooleanQuery$TooManyClauses[2].
Anywhere I read if you give lucene a higher MaxClauseCount, this will
solve that Problem. But it doesn't work :-(
3. I also read that we should save the Date as YYYYMMDD-String. I don't
like this solution, because I don't know that this will work. And then I
have to reindex the whole Data!

So could you give me a little hint, how i can solve my Date-Prblems?



[1]
Implementation:

  BooleanQuery query= new BooleanQuery();
  query.setMaxClauseCount(262144);
  Query q1= QueryParser.parse(query,"content",analyzer);
  query.add(q1,true,false);
  if(area.length()>2)
  {
    Query q2=new TermQuery( new Term("bereich",area) );
    query.add(q2,true,false);
  }
  try {
    DateFormat df = DateFormat.getDateInstance(
       DateFormat.DATE_FIELD, Locale.GERMAN);
    df.setLenient(true);
    Date d1 = df.parse(date_from);
    Date d2 = df.parse(date_to);
    date_from = DateField.dateToString(d1);
    date_to = DateField.dateToString(d2);
  }   catch (Exception e) { }
  Query q3=new RangeQuery( new Term("datum",date_from),
                           new Term("datum",date_to),true );
  query.add(q3,true,false);
  /*Sort csort= new Sort();
  if (sort.length()>2)
  {
     csort.setSort(sort,reverse);
  }*/
  Hits hits = searcher.search(query);
  //Hits hits = searcher.search(query,csort);
  makeOutput(hits, start, length);
  Date ende= new Date();
  long zeit=(ende.getTime()-anfang.getTime())/100 ;
  ausgabe.append("|" + (float)zeit/10);



  private void makeOutput(Hits hits,int start,int length)
    throws Exception
  {
    int i=start;
    if (hits.length()>0)
    {
      ausgabe.append("<table>");
      for (;(i<hits.length() && (i<start+length));i++)
      {
        Document doc=hits.doc(i);
        ausgabe.append("<tr><td>");
        ausgabe.append(doc.getField("bereich").stringValue()
        ausgabe.append(""</td><td>"");
        DateFormat df = DateFormat.getDateInstance(
          DateFormat.DATE_FIELD, Locale.GERMAN);
        df.setLenient(true);
        ausgabe.append(df.format(
          DateField.stringToDate(doc.getField("datum").stringValue())));
        ausgabe.append("</td><td>");
        ausgabe.append("<a href=\""+doc.getField("link").stringValue());
        ausgabe.append(doc.getField("content_id").stringValue()+ "\">");
        ausgabe.append(doc.getField("content_vorschau").stringValue()
        ausgabe.append("</a>");
        ausgabe.append("</td></tr>");
      }
      ausgabe.append("</table>");
    }
    ausgabe.append("|X|" + hits.length() + "|" + start + "|" + i);
  }

__________________________________________________

[2]
StackTrace:

org.apache.lucene.search.BooleanQuery$TooManyClauses
        at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:79)
        at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:71)
        at org.apache.lucene.search.RangeQuery.rewrite(RangeQuery.java:99)
        at
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:243)
        at
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:166)
        at org.apache.lucene.search.Query.weight(Query.java:84)
        at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:117)
        at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
        at org.apache.lucene.search.Hits.<init>(Hits.java:51)
        at org.apache.lucene.search.Searcher.search(Searcher.java:41)
        at suchmaschine.LuceneSearcher.erweitert(LuceneSearcher.java:138)
        at suchmaschine.XmlRpcSearcher.erweitert(XmlRpcSearcher.java:49)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.xmlrpc.Invoker.execute(Invoker.java:168)
        at
org.apache.xmlrpc.XmlRpcWorker.invokeHandler(XmlRpcWorker.java:123)
        at org.apache.xmlrpc.XmlRpcWorker.execute(XmlRpcWorker.java:185)
        at org.apache.xmlrpc.XmlRpcServer.execute(XmlRpcServer.java:151)
        at org.apache.xmlrpc.XmlRpcServer.execute(XmlRpcServer.java:139)
        at org.apache.xmlrpc.WebServer$Connection.run(WebServer.java:773)
        at org.apache.xmlrpc.WebServer$Runner.run(WebServer.java:656)
        at java.lang.Thread.run(Thread.java:595)

__________________________________________________
[3]
My Fields:
  neu.setBoost( boost  );
  neu.add(Field.UnStored("content",content));
  neu.add(Field.Keyword("keyword",keyword));
  ConfDate date = new ConfDate(datum);
  neu.add(Field.Keyword("datum",(Date)date.getUtilDate()));
  neu.add(Field.UnIndexed("content_vorschau",content_vorschau));
  neu.add(Field.UnIndexed("content_id",""+content_id));
  neu.add(Field.UnIndexed("zeitstempel",zeitstempel));
  neu.add(Field.UnIndexed("link",link));
  neu.add(Field.Keyword("bereich",bereich));
  index.addDocument(neu);



signature.asc (258 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TooManyClauses in BooleanQuery

a.herberger
Hi Harald,

its nice too see, that there are others out there in Germany dealing with
the same problems as we have been doing in the past years :-)

So for the "too many clauses" problem I have a solution for you, that I
want to share:
Just include somewhere at the very beginning of your program (retrieval
part) the call:

BooleanQuery.setMaxClauseCount(1000*1000);

We have had similar problems (it applies also to searches with left
truncation: *word) and could work around this quite good with increasing
this setting.

Regarding the sorting we have also implemented our own class (at the time
beeing there was no sorting support in Lucene), but this was very
application specific and we had to limit it to about 5000 hits we are
sorting due to speed limitations. I can give you more information on this,
if you want.

Hope, I have been of some help
best regards from Wiesbaden

Andreas M. Herberger
mailto: [hidden email]
http://www.makrolog.de






Harald Stowasser <[hidden email]>
13.06.2005 13:47

Please respond to
[hidden email]


To
[hidden email]
cc

Subject
TooManyClauses in BooleanQuery






Hello lucene-list readers,

first I want to introduce myself a little. Because I am new at this List:

I am a programmer in a publishing company, 32 years of Age and you can
find my picture at http://www.idowa.de/service/kontakt.
We release some local newspapers and a website (http://www.idowa.de)
with the main focus on regional content.

We use Lucene to create an index over the whole newspaper and website
content. So there is more than 2GB text to indicate.

And now I will tell you my problems in my implementation[1]:

1. Sorting by Date is ruinously slow. So I deactivated it.
2. Because the sorting is so slow, I want to allow the user specifying a
Date-Range. But Lucene throws an BooleanQuery$TooManyClauses[2].
Anywhere I read if you give lucene a higher MaxClauseCount, this will
solve that Problem. But it doesn't work :-(
3. I also read that we should save the Date as YYYYMMDD-String. I don't
like this solution, because I don't know that this will work. And then I
have to reindex the whole Data!

So could you give me a little hint, how i can solve my Date-Prblems?



[1]
Implementation:

  BooleanQuery query= new BooleanQuery();
  query.setMaxClauseCount(262144);
  Query q1= QueryParser.parse(query,"content",analyzer);
  query.add(q1,true,false);
  if(area.length()>2)
  {
    Query q2=new TermQuery( new Term("bereich",area) );
    query.add(q2,true,false);
  }
  try {
    DateFormat df = DateFormat.getDateInstance(
       DateFormat.DATE_FIELD, Locale.GERMAN);
    df.setLenient(true);
    Date d1 = df.parse(date_from);
    Date d2 = df.parse(date_to);
    date_from = DateField.dateToString(d1);
    date_to = DateField.dateToString(d2);
  }   catch (Exception e) { }
  Query q3=new RangeQuery( new Term("datum",date_from),
                           new Term("datum",date_to),true );
  query.add(q3,true,false);
  /*Sort csort= new Sort();
  if (sort.length()>2)
  {
     csort.setSort(sort,reverse);
  }*/
  Hits hits = searcher.search(query);
  //Hits hits = searcher.search(query,csort);
  makeOutput(hits, start, length);
  Date ende= new Date();
  long zeit=(ende.getTime()-anfang.getTime())/100 ;
  ausgabe.append("|" + (float)zeit/10);



  private void makeOutput(Hits hits,int start,int length)
    throws Exception
  {
    int i=start;
    if (hits.length()>0)
    {
      ausgabe.append("<table>");
      for (;(i<hits.length() && (i<start+length));i++)
      {
        Document doc=hits.doc(i);
        ausgabe.append("<tr><td>");
        ausgabe.append(doc.getField("bereich").stringValue()
        ausgabe.append(""</td><td>"");
        DateFormat df = DateFormat.getDateInstance(
          DateFormat.DATE_FIELD, Locale.GERMAN);
        df.setLenient(true);
        ausgabe.append(df.format(
          DateField.stringToDate(doc.getField("datum").stringValue())));
        ausgabe.append("</td><td>");
        ausgabe.append("<a href=\""+doc.getField("link").stringValue());
        ausgabe.append(doc.getField("content_id").stringValue()+ "\">");
        ausgabe.append(doc.getField("content_vorschau").stringValue()
        ausgabe.append("</a>");
        ausgabe.append("</td></tr>");
      }
      ausgabe.append("</table>");
    }
    ausgabe.append("|X|" + hits.length() + "|" + start + "|" + i);
  }

__________________________________________________

[2]
StackTrace:

org.apache.lucene.search.BooleanQuery$TooManyClauses
        at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:79)
        at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:71)
        at org.apache.lucene.search.RangeQuery.rewrite(RangeQuery.java:99)
        at
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:243)
        at
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:166)
        at org.apache.lucene.search.Query.weight(Query.java:84)
        at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:117)
        at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
        at org.apache.lucene.search.Hits.<init>(Hits.java:51)
        at org.apache.lucene.search.Searcher.search(Searcher.java:41)
        at suchmaschine.LuceneSearcher.erweitert(LuceneSearcher.java:138)
        at suchmaschine.XmlRpcSearcher.erweitert(XmlRpcSearcher.java:49)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.xmlrpc.Invoker.execute(Invoker.java:168)
        at
org.apache.xmlrpc.XmlRpcWorker.invokeHandler(XmlRpcWorker.java:123)
        at org.apache.xmlrpc.XmlRpcWorker.execute(XmlRpcWorker.java:185)
        at org.apache.xmlrpc.XmlRpcServer.execute(XmlRpcServer.java:151)
        at org.apache.xmlrpc.XmlRpcServer.execute(XmlRpcServer.java:139)
        at org.apache.xmlrpc.WebServer$Connection.run(WebServer.java:773)
        at org.apache.xmlrpc.WebServer$Runner.run(WebServer.java:656)
        at java.lang.Thread.run(Thread.java:595)

__________________________________________________
[3]
My Fields:
  neu.setBoost( boost  );
  neu.add(Field.UnStored("content",content));
  neu.add(Field.Keyword("keyword",keyword));
  ConfDate date = new ConfDate(datum);
  neu.add(Field.Keyword("datum",(Date)date.getUtilDate()));
  neu.add(Field.UnIndexed("content_vorschau",content_vorschau));
  neu.add(Field.UnIndexed("content_id",""+content_id));
  neu.add(Field.UnIndexed("zeitstempel",zeitstempel));
  neu.add(Field.UnIndexed("link",link));
  neu.add(Field.Keyword("bereich",bereich));
  index.addDocument(neu);


[attachment "signature.asc" deleted by Andreas Herberger/Makrolog]
ForwardSourceID:NT000DE0DA
Reply | Threaded
Open this post in threaded view
|

Re: TooManyClauses in BooleanQuery

Harald Stowasser
[hidden email] schrieb:

> Hi Harald,
>
> its nice too see, that there are others out there in Germany dealing with
> the same problems as we have been doing in the past years :-)
>
> So for the "too many clauses" problem I have a solution for you, that I
> want to share:
> Just include somewhere at the very beginning of your program (retrieval
> part) the call:
>
> BooleanQuery.setMaxClauseCount(1000*1000);
As you can see in the source code, I tried this already:
  query.setMaxClauseCount(262144);
It even don't work with higher values, it just crashed with Not enough
Memory -Error :-(

signature.asc (258 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TooManyClauses in BooleanQuery

Erik Hatcher
In reply to this post by Harald Stowasser

On Jun 13, 2005, at 7:47 AM, Harald Stowasser wrote:
> 1. Sorting by Date is ruinously slow. So I deactivated it.

How were you sorting by date?

> 3. I also read that we should save the Date as YYYYMMDD-String. I  
> don't
> like this solution, because I don't know that this will work. And  
> then I
> have to reindex the whole Data!

It will work :)  Terms need to be lexicographically orderable - and  
using YYYYMMDD will do just that as long as you don't need  
granularity beyond day.  However, before reindexing with YYYYMMDD -  
what are your searching/sorting needs?  If day is the granularity,  
then YYYYMMDD will be fine.  However you may want to break it into  
more fields such as year, month, and day separately.  Note: keep  
numbers padded to the same number of characters (1 for a day field  
should be "01" for example).

For sorting, you may find that once you've used YYYYMMDD that you can  
then sort with the field type as INT on that same field (use  
Field.Keyword for indexing).

> [3]
> My Fields:
>   neu.setBoost( boost  );
>   neu.add(Field.UnStored("content",content));
>   neu.add(Field.Keyword("keyword",keyword));
>   ConfDate date = new ConfDate(datum);
>   neu.add(Field.Keyword("datum",(Date)date.getUtilDate()));
>   neu.add(Field.UnIndexed("content_vorschau",content_vorschau));
>   neu.add(Field.UnIndexed("content_id",""+content_id));
>   neu.add(Field.UnIndexed("zeitstempel",zeitstempel));
>   neu.add(Field.UnIndexed("link",link));
>   neu.add(Field.Keyword("bereich",bereich));
>   index.addDocument(neu);

What kind of granularity for dates does ConfDate.getUtilDate() return?

Using Date for Field.Keyword indexes to the millisecond granularity -  
that is very unlikely to be of use to you at that level.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: TooManyClauses in BooleanQuery

Harald Stowasser
In reply to this post by Harald Stowasser
Harald Stowasser schrieb:

P.S.
I tried now to use DateFilter. This works, but is very slow on longer
Date-Ranges. (30sec. )



signature.asc (258 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: TooManyClauses in BooleanQuery

Omar Didi
In reply to this post by Harald Stowasser
if you get an OutOfMemoryException, I beleive the only thing you can do
is just increase the JVM heap to a larger size.

-----Original Message-----
From: Harald Stowasser [mailto:[hidden email]]
Sent: Monday, June 13, 2005 8:28 AM
To: [hidden email]
Subject: Re: TooManyClauses in BooleanQuery


[hidden email] schrieb:

> Hi Harald,
>
> its nice too see, that there are others out there in Germany dealing
with
> the same problems as we have been doing in the past years :-)
>
> So for the "too many clauses" problem I have a solution for you, that
I
> want to share:
> Just include somewhere at the very beginning of your program
(retrieval
> part) the call:
>
> BooleanQuery.setMaxClauseCount(1000*1000);

As you can see in the source code, I tried this already:
  query.setMaxClauseCount(262144);
It even don't work with higher values, it just crashed with Not enough
Memory -Error :-(

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: TooManyClauses in BooleanQuery

Erik Hatcher
In reply to this post by Harald Stowasser

On Jun 13, 2005, at 8:44 AM, Harald Stowasser wrote:

> Harald Stowasser schrieb:
>
> P.S.
> I tried now to use DateFilter. This works, but is very slow on longer
> Date-Ranges. (30sec. )

Filters in general were meant for one-time creation and caching.  If  
the date ranges are fixed and the index not being updated, then  
DateFilters will work fine as you only create each filter once.  If  
the index updates, thats ok, as you can simply reinstantiate the  
filters when that occurs.

My recommendation is for you to consider using YYYYMMDD format for  
your dates to begin with, but I'd like to see more about the range of  
dates that you're indexing and what kind of ranges you need to  
accommodate for searching.

     Erik



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]