PrefixQuery,WildcardQuery,RangeQuery and FuzzyQuery PROBLEM

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

PrefixQuery,WildcardQuery,RangeQuery and FuzzyQuery PROBLEM

MariLuz Elola
Hello
 
    I have been readed about "Too many clauses"...........   If the max was set too high, the inefficiency would make the search unsable.
    I am testing the performance of Lucene and the time that spend Lucene in searching is too high. Moreover I´ve got OutOfMemory error several times.....
    I am speaking about an index with 250.000 documents more or less, but in the future will be necessary an index with millions of documents.
 
These are the kinds of queries:
 
1. Greater than or lower than request

RangeQuery with Integer.MAX_VALUE for greater than or Integer.MIN_VALUE for lower than

2. RangeQuery

Example:

        Field:[minValue to maxValue]

3.WildcardQuery

Example:

    Field:value*

ect....

The problem is that PrefixQuery,WildcardQuery,RangeQuery and FuzzyQuery all expand to a series of OR'ed boolean queries.
 
I have read about BitSetQuery, FilteringQuery, ConstrantScoreQuery.......... I am confused!!!!!!
 
I can´t use a Filter (DateFilter, QueryFilter ect...) because the client wants to search for all the documents without filter for anything.
I can´t divide a field in subfields to do the query more specific. For example, the user wants the date with format YYYMMDDHHMMSS, not 6 fields, one with the year, one with the month, one with the day, one with de hour ect....
I can´t add more system resources.
 
My environment is the next:
----LUCENE 1.4.3-------
INDEX ==> 200.000 documents to million of documents
EACH DOCUMENT +- 20 fields (metadatas)
SIZE TEXT DOCUMENT 1k
-----SERVER (dedicated) -------
Red Hat
2 GB Memory
jboss + lucene
JAVA_OPTS -Xmx640M -Xms640M
 
My question is very simple...... Is it possible to use Lucene like full text search engine with the environment I have explained before, with the server that I have explained before, and doing the queries that I have explained before with an efficient performance and without OutOfMemoryError????
 
Thanks in advance
 
                                        Mari Luz

 

 

---------------------------------------------------
Mari Luz Elola
Developer Engineer


Caleruega, 67
28033 Madrid (Spain)
Tel.: +34 91 768 46 58
mailto: [hidden email]

---------------------------------------------------

Privileged/Confidential Information may be contained in this message and is intended solely for the use of the named addressee(s). Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, any disclosure, copying, distribution or re-use of the information contained in it is prohibited and may be unlawful. Opinions, conclusions and any other information contained in this message that do not relate to the official business of Seinet shall be understood as neither given nor endorsed by it. If you have received this communication in error, please notify us immediately by replying to this mail and deleting it from your computer.
Thank you.


Reply | Threaded
Open this post in threaded view
|

Re: PrefixQuery,WildcardQuery,RangeQuery and FuzzyQuery PROBLEM

Erik Hatcher

On Jul 13, 2005, at 6:21 AM, MariLuz Elola wrote:

>     I have been readed about "Too many clauses"...........   If the  
> max was set too high, the inefficiency would make the search unsable.
>     I am testing the performance of Lucene and the time that spend  
> Lucene in searching is too high. Moreover I´ve got OutOfMemory  
> error several times.....
>     I am speaking about an index with 250.000 documents more or  
> less, but in the future will be necessary an index with millions of  
> documents.
>
> These are the kinds of queries:
>
> 1. Greater than or lower than request
> RangeQuery with Integer.MAX_VALUE for greater than or  
> Integer.MIN_VALUE for lower than
>
> 2. RangeQuery
>
> Example:
>
>         Field:[minValue to maxValue]

Keep in mind that dealing with numeric information requires some  
adjustments both at how you index and how RangeQuerys are formed.  
For example, if you index "1" through "10" doing a RangeQuery of [1  
TO 5] will also find "10" unless you account for it with a special  
QueryParser subclass.

> 3.WildcardQuery
>
> Example:
>
>     Field:value*
>
> ect....
>
> The problem is that PrefixQuery,WildcardQuery,RangeQuery and  
> FuzzyQuery all expand to a series of OR'ed boolean queries.
>
> I have read about BitSetQuery, FilteringQuery,  
> ConstrantScoreQuery.......... I am confused!!!!!!

There certainly are lots of options.  The Query classes you mention,  
though, are not currently exposed via QueryParser, so you would need  
to subclass QueryParser to have them created instead, or create your  
own parser, or mix and match some query expression parsing and join  
it with some API created Querys via BooleanQuery.

>  I can´t use a Filter (DateFilter, QueryFilter ect...) because the  
> client wants to search for all the documents without filter for  
> anything.

This doesn't make sense to me.  Implicitly the user is "filtering"  
documents by adding constraints to a query expression using  
Field:value* or Field:[min TO max].

> I can´t divide a field in subfields to do the query more specific.  
> For example, the user wants the date with format YYYMMDDHHMMSS, not  
> 6 fields, one with the year, one with the month, one with the day,  
> one with de hour ect....

The index structure needs to be a bit more abstracted from the user  
in your case, it seems.  The user does not need to know explicitly  
that the index is split into multiple fields for dates in order to  
make searching more efficient.  If the user is not doing queries down  
to the second level, but rather always at the day level, then  you  
can build the index to account for that type of usage and improve the  
experience.

I encourage you to reconsider your "can't"'s and investigate  
alternative approaches.  Such considerations might be - does the user  
really need FuzzyQuery?  Are WildcardQuery's desired?  If so, what  
types of wildcard queries are needed?  (this can affect how you index  
and construct queries - a WildcardQuery literally is not the only way  
to achieve the same sort of thing, as has been mentioned using a  
PhraseQuery for numeric information)  Can the user interface be  
crafted to be more structured rather than just a Google-like search  
box where the user has to enter field selectors and know QueryParser  
voodoo?  (perhaps the date field constraint can use a date picker  
rather than a textual expression?)

> My question is very simple...... Is it possible to use Lucene like  
> full text search engine with the environment I have explained  
> before, with the server that I have explained before, and doing the  
> queries that I have explained before with an efficient performance  
> and without OutOfMemoryError????

Short answer: yes.

Longer answer: see above for some techniques to consider

     Erik

Reply | Threaded
Open this post in threaded view
|

Re: PrefixQuery,WildcardQuery,RangeQuery and FuzzyQuery PROBLEM

Paul Elschot
On Wednesday 13 July 2005 12:53, Erik Hatcher wrote:

>
> On Jul 13, 2005, at 6:21 AM, MariLuz Elola wrote:
> >     I have been readed about "Too many clauses"...........   If the  
> > max was set too high, the inefficiency would make the search unsable.
> >     I am testing the performance of Lucene and the time that spend  
> > Lucene in searching is too high. Moreover I´ve got OutOfMemory  
> > error several times.....
> >     I am speaking about an index with 250.000 documents more or  
> > less, but in the future will be necessary an index with millions of  
> > documents.
> >
> > These are the kinds of queries:
> >
> > 1. Greater than or lower than request
> > RangeQuery with Integer.MAX_VALUE for greater than or  
> > Integer.MIN_VALUE for lower than
> >
> > 2. RangeQuery
> >
> > Example:
> >
> >         Field:[minValue to maxValue]
>
> Keep in mind that dealing with numeric information requires some  
> adjustments both at how you index and how RangeQuerys are formed.  
> For example, if you index "1" through "10" doing a RangeQuery of [1  
> TO 5] will also find "10" unless you account for it with a special  
> QueryParser subclass.
>
> > 3.WildcardQuery
> >
> > Example:
> >
> >     Field:value*
> >
> > ect....
> >
> > The problem is that PrefixQuery,WildcardQuery,RangeQuery and  
> > FuzzyQuery all expand to a series of OR'ed boolean queries.
> >
> > I have read about BitSetQuery, FilteringQuery,  
> > ConstrantScoreQuery.......... I am confused!!!!!!
>
> There certainly are lots of options.  The Query classes you mention,  
> though, are not currently exposed via QueryParser, so you would need  
> to subclass QueryParser to have them created instead, or create your  
> own parser, or mix and match some query expression parsing and join  
> it with some API created Querys via BooleanQuery.
>
> >  I can´t use a Filter (DateFilter, QueryFilter ect...) because the  
> > client wants to search for all the documents without filter for  
> > anything.
>
> This doesn't make sense to me.  Implicitly the user is "filtering"  
> documents by adding constraints to a query expression using  
> Field:value* or Field:[min TO max].
>
> > I can´t divide a field in subfields to do the query more specific.  
> > For example, the user wants the date with format YYYMMDDHHMMSS, not  
> > 6 fields, one with the year, one with the month, one with the day,  
> > one with de hour ect....
>
> The index structure needs to be a bit more abstracted from the user  
> in your case, it seems.  The user does not need to know explicitly  
> that the index is split into multiple fields for dates in order to  
> make searching more efficient.  If the user is not doing queries down  
> to the second level, but rather always at the day level, then  you  
> can build the index to account for that type of usage and improve the  
> experience.

One can also index all of these (or even more) and hide them from the user:

YYY
YYYMM
YYYMMDD
YYYMMDDHH
YYYMMDDHHMM
YYYMMDDHHMMSS

With this, searching ranges would require subclassing the QueryParser
with classes that implement the range search using as few terms as possible.
That should bring down the number of terms used to a some logarithm
of the total range size.

Regards,
Paul Elschot

Reply | Threaded
Open this post in threaded view
|

RE: PrefixQuery,WildcardQuery,RangeQuery and FuzzyQuery PROBLEM

james-17
> One can also index all of these (or even more) and hide them from the
> user:
>
> YYY
> YYYMM
> YYYMMDD
> YYYMMDDHH
> YYYMMDDHHMM
> YYYMMDDHHMMSS
>
> With this, searching ranges would require subclassing the QueryParser
> with classes that implement the range search using as few terms as
> possible.
> That should bring down the number of terms used to a some logarithm
> of the total range size.

Yep, that's what we do for large date ranges and it works fine.  Just a
little logic to determine the way to expand the user input to the least
possible terms.

Sincerely,
James