Lucene search optimization

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene search optimization

Sami Dalouche
Hi,

I have 2 million documents, with a name property. (~15 to 20
characters).
Fuzzy searching against this property takes around 3 seconds, which is
way too much for what I plan to do, so I am considering the possible
optimizations. I can add a property to each of the documents, that could
partition the document space into 400 spaces. Each space would then be
limited to 5000 documents, which should be small enough to make the
fuzzy search faster.

However, my question is : how do I take advantage of this additional
property ? Using a traditional RDBMS, I would add an index on the field,
but on Lucene, I'm not sure of how to proceed. Would filters be the way
to go ?
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Filter.html)
Could a Caching Wrapperfilter help even more ?
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/CachingWrapperFilter.html)

Additionnally, the additional property is an id, so can I store it as a
number so that it is faster (I guess) than string comparison ?

Thanks a lot,
Sami Dalouche




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

Erik Hatcher
Sami,

You're on to the right approach seeking something other than  
FuzzyQuery.  FuzzyQuery is rarely generally useful and there are  
other ways to achieve the same sort of thing (soundex, metaphone) in  
an efficient manner.

If you could share some details about these properties and how you  
need to query them I'm sure the community could offer suggestions on  
an efficient and clean implementation.  Without details, its not  
possible to (easily) know how recommend a specific technique.

        Erik


On May 30, 2006, at 11:12 AM, Sami Dalouche wrote:

> Hi,
>
> I have 2 million documents, with a name property. (~15 to 20
> characters).
> Fuzzy searching against this property takes around 3 seconds, which is
> way too much for what I plan to do, so I am considering the possible
> optimizations. I can add a property to each of the documents, that  
> could
> partition the document space into 400 spaces. Each space would then be
> limited to 5000 documents, which should be small enough to make the
> fuzzy search faster.
>
> However, my question is : how do I take advantage of this additional
> property ? Using a traditional RDBMS, I would add an index on the  
> field,
> but on Lucene, I'm not sure of how to proceed. Would filters be the  
> way
> to go ?
> (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ 
> Filter.html)
> Could a Caching Wrapperfilter help even more ?
> (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ 
> CachingWrapperFilter.html)
>
> Additionnally, the additional property is an id, so can I store it  
> as a
> number so that it is faster (I guess) than string comparison ?
>
> Thanks a lot,
> Sami Dalouche
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

mark harwood
In reply to this post by Sami Dalouche
Take a look at "FuzzyLikeThisQuery" in
contrib\queries.

I use it for name searches on large indexes.
Unlike FuzzyQuery it:
a) limits the number of query terms produced
b) provides better ranking (disables idf factor which
otherwise boosts rare misspellings)

The cost of running a query is strongly related to the
quantity of terms in the query.
FuzzyQuery only limits the number of terms by quality
(which means you can unexpectedly produce a large
quantity of terms and therefore have a slow query).
FuzzyLikeThis is more explicit - it limits the
*quantity* of terms used (and automatically shortlists
to the best quality terms using the same edit-distance
metric as FuzzyQuery for ranking quality).


Cheers,
Mark



       
       
               
___________________________________________________________
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine
http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

Sami Dalouche
In reply to this post by Erik Hatcher
Hi,

I didn't want to bother you with the exact details of my document, but
since you're asking.. :-)

So, I have the list of all world cities, and would like to let the users
search for their city, allowing them to do small mistakes.
Additionnally, since cities have sometimes different names, spellings,
etc (like a small city near mine which is called Le Perray en Yvelines,
sometimes spellt Le-Perray-en-Yvelines, Le Perray en Ynes, Le Perray,
etc).

The way to limit the number of returned documents that I was thinking of
was to specify the country, which would then divide the search space,
but if you think of something better, I am open to any suggestion.

Soundex and metaphones are specific to languages, right ? Would it work
for cities ?

The cities are available as XML from http://www.sirika.com/data/xmlgz/

If you need more information, just ask.
Regards,
Sami Dalouche


Le mardi 30 mai 2006 à 11:22 -0400, Erik Hatcher a écrit :

> Sami,
>
> You're on to the right approach seeking something other than  
> FuzzyQuery.  FuzzyQuery is rarely generally useful and there are  
> other ways to achieve the same sort of thing (soundex, metaphone) in  
> an efficient manner.
>
> If you could share some details about these properties and how you  
> need to query them I'm sure the community could offer suggestions on  
> an efficient and clean implementation.  Without details, its not  
> possible to (easily) know how recommend a specific technique.
>
> Erik
>
>
> On May 30, 2006, at 11:12 AM, Sami Dalouche wrote:
>
> > Hi,
> >
> > I have 2 million documents, with a name property. (~15 to 20
> > characters).
> > Fuzzy searching against this property takes around 3 seconds, which is
> > way too much for what I plan to do, so I am considering the possible
> > optimizations. I can add a property to each of the documents, that  
> > could
> > partition the document space into 400 spaces. Each space would then be
> > limited to 5000 documents, which should be small enough to make the
> > fuzzy search faster.
> >
> > However, my question is : how do I take advantage of this additional
> > property ? Using a traditional RDBMS, I would add an index on the  
> > field,
> > but on Lucene, I'm not sure of how to proceed. Would filters be the  
> > way
> > to go ?
> > (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ 
> > Filter.html)
> > Could a Caching Wrapperfilter help even more ?
> > (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ 
> > CachingWrapperFilter.html)
> >
> > Additionnally, the additional property is an id, so can I store it  
> > as a
> > number so that it is faster (I guess) than string comparison ?
> >
> > Thanks a lot,
> > Sami Dalouche
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

Chris Hostetter-3
In reply to this post by Sami Dalouche

: Fuzzy searching against this property takes around 3 seconds, which is
: way too much for what I plan to do, so I am considering the possible

whenever anyone has a question about how to speed up a search, and the
current amount of time the search takes is more then a second, there are a
few questions i allways want to ask:

 1) what method exactly on the Searcher interface are you using the
    execute the search?
 2) what exactly are you timing? (the time the search method call takes?,
    the time it takes you to iterate over the results? etc...)
 3) are you sorting by any particular field?
 4) are you reusing the Searcher instance for more then one query?   are
    you timing more then one query and taking the average?


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

Sami Dalouche
Hi,

1) Actually, I am not using Lucene directly, but a wrapper called
compass. I am using the find() method of the CompassSession, which code
is :
public CompassHits find(String query) throws CompassException {
        return createQueryBuilder().queryString(query).toQuery().hits();
    }
And all of these objects are pure wrappers around lucene equivalents,
nothing more.


2) What I am timing is only the find call :
-- start timer
CompassHits hits = compassSession.find("cityName:"+ name+"~");
-- stop timer

3) I am not sorting anything, but lucene is returning the hits by
relevance. Does this count as sorting ?

4) I tried to time the thing for ~10 queries, and the results are
roughly the same. Can go down to 2 seconds, which is still way too
much...

Thanks for helping
sami Dalouche

On Tue, 2006-05-30 at 13:58 -0700, Chris Hostetter wrote:

> : Fuzzy searching against this property takes around 3 seconds, which is
> : way too much for what I plan to do, so I am considering the possible
>
> whenever anyone has a question about how to speed up a search, and the
> current amount of time the search takes is more then a second, there are a
> few questions i allways want to ask:
>
>  1) what method exactly on the Searcher interface are you using the
>     execute the search?
>  2) what exactly are you timing? (the time the search method call takes?,
>     the time it takes you to iterate over the results? etc...)
>  3) are you sorting by any particular field?
>  4) are you reusing the Searcher instance for more then one query?   are
>     you timing more then one query and taking the average?
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

mark harwood
>>Actually, I am not using Lucene directly, but a
wrapper called compass


I don't know what controls it offers you then.
One option which could offer a speed up is to raise
the minimum quality match threshold above the default
of 0.5 and use a query string like this:

  cityName:London~0.8

This would reduce the number of alternative terms
considered and therefore the query time.


--- Sami Dalouche <[hidden email]> wrote:

> Hi,
>
> 1) Actually, I am not using Lucene directly, but a
> wrapper called
> compass. I am using the find() method of the
> CompassSession, which code
> is :
> public CompassHits find(String query) throws
> CompassException {
>         return
>
createQueryBuilder().queryString(query).toQuery().hits();

>     }
> And all of these objects are pure wrappers around
> lucene equivalents,
> nothing more.
>
>
> 2) What I am timing is only the find call :
> -- start timer
> CompassHits hits = compassSession.find("cityName:"+
> name+"~");
> -- stop timer
>
> 3) I am not sorting anything, but lucene is
> returning the hits by
> relevance. Does this count as sorting ?
>
> 4) I tried to time the thing for ~10 queries, and
> the results are
> roughly the same. Can go down to 2 seconds, which is
> still way too
> much...
>
> Thanks for helping
> sami Dalouche
>
> On Tue, 2006-05-30 at 13:58 -0700, Chris Hostetter
> wrote:
> > : Fuzzy searching against this property takes
> around 3 seconds, which is
> > : way too much for what I plan to do, so I am
> considering the possible
> >
> > whenever anyone has a question about how to speed
> up a search, and the
> > current amount of time the search takes is more
> then a second, there are a
> > few questions i allways want to ask:
> >
> >  1) what method exactly on the Searcher interface
> are you using the
> >     execute the search?
> >  2) what exactly are you timing? (the time the
> search method call takes?,
> >     the time it takes you to iterate over the
> results? etc...)
> >  3) are you sorting by any particular field?
> >  4) are you reusing the Searcher instance for more
> then one query?   are
> >     you timing more then one query and taking the
> average?
> >
> >
> > -Hoss
> >
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> [hidden email]
> > For additional commands, e-mail:
> [hidden email]
> >
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [hidden email]
> For additional commands, e-mail:
> [hidden email]
>
>



               
___________________________________________________________
The all-new Yahoo! Mail goes wherever you go - free your email address from your Internet provider. http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

Chris Hostetter-3
In reply to this post by Sami Dalouche

: public CompassHits find(String query) throws CompassException {
:         return createQueryBuilder().queryString(query).toQuery().hits();
:     }
: And all of these objects are pure wrappers around lucene equivalents,
: nothing more.

: 2) What I am timing is only the find call :
: -- start timer
: CompassHits hits = compassSession.find("cityName:"+ name+"~");
: -- stop timer

ok, but a thin wrapper arround *which* lucene equivilents? .. there are
many different methods for doing a search in lucene, each with a different
usage pattern and performance characteristics ... if for example that code
uses a HitCollector and just pulls back the IDs into the CompassHits
that's going to be faster then if it gets a Hits obejct and then iterates
over each Hit storing the full Document in the CompassHits object --
especially if you've got more then 50 or so results ... in which case
using a Hits object will acctaully result in your search being executed
again and again as you iterate farther down the list of results.

exactly what those methods do can make a big difference.

then again: maybe they don't,  maybe fuzzy queries really are that slow (i
don't know, i've never used them) I just want to make sure you think about
those issues.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

Sami Dalouche
In reply to this post by mark harwood

Hi,

Compass offers me any kind of control Lucene does. it gives access to
the low level Lucene API if you want too, so if you have a nice way of
optimizing it, I can have Compass adapt to that.


I tried the cityName:city~0.8, and it is still not fast enough..
something around 2 seconds... to return only 2 results...
(city:Rambouillet~0.8)

Sami Dalouche

Le mercredi 31 mai 2006 à 09:28 +0100, mark harwood a écrit :

> >>Actually, I am not using Lucene directly, but a
> wrapper called compass
>
>
> I don't know what controls it offers you then.
> One option which could offer a speed up is to raise
> the minimum quality match threshold above the default
> of 0.5 and use a query string like this:
>
>   cityName:London~0.8
>
> This would reduce the number of alternative terms
> considered and therefore the query time.
>
>
> --- Sami Dalouche <[hidden email]> wrote:
>
> > Hi,
> >
> > 1) Actually, I am not using Lucene directly, but a
> > wrapper called
> > compass. I am using the find() method of the
> > CompassSession, which code
> > is :
> > public CompassHits find(String query) throws
> > CompassException {
> >         return
> >
> createQueryBuilder().queryString(query).toQuery().hits();
> >     }
> > And all of these objects are pure wrappers around
> > lucene equivalents,
> > nothing more.
> >
> >
> > 2) What I am timing is only the find call :
> > -- start timer
> > CompassHits hits = compassSession.find("cityName:"+
> > name+"~");
> > -- stop timer
> >
> > 3) I am not sorting anything, but lucene is
> > returning the hits by
> > relevance. Does this count as sorting ?
> >
> > 4) I tried to time the thing for ~10 queries, and
> > the results are
> > roughly the same. Can go down to 2 seconds, which is
> > still way too
> > much...
> >
> > Thanks for helping
> > sami Dalouche
> >
> > On Tue, 2006-05-30 at 13:58 -0700, Chris Hostetter
> > wrote:
> > > : Fuzzy searching against this property takes
> > around 3 seconds, which is
> > > : way too much for what I plan to do, so I am
> > considering the possible
> > >
> > > whenever anyone has a question about how to speed
> > up a search, and the
> > > current amount of time the search takes is more
> > then a second, there are a
> > > few questions i allways want to ask:
> > >
> > >  1) what method exactly on the Searcher interface
> > are you using the
> > >     execute the search?
> > >  2) what exactly are you timing? (the time the
> > search method call takes?,
> > >     the time it takes you to iterate over the
> > results? etc...)
> > >  3) are you sorting by any particular field?
> > >  4) are you reusing the Searcher instance for more
> > then one query?   are
> > >     you timing more then one query and taking the
> > average?
> > >
> > >
> > > -Hoss
> > >
> > >
> > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > [hidden email]
> > > For additional commands, e-mail:
> > [hidden email]
> > >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > [hidden email]
> > For additional commands, e-mail:
> > [hidden email]
> >
> >
>
>
>
>
> ___________________________________________________________
> The all-new Yahoo! Mail goes wherever you go - free your email address from your Internet provider. http://uk.docs.yahoo.com/nowyoucan.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

Sami Dalouche
In reply to this post by mark harwood
Hi,

thanks for the tip.. However, my slowness issues do not seem to be
caused by the number of search results returned, since cityName:XX~0.8
took 2 seconds to return 2 results....

So, the problem seems to be more related to scanning the index...

Thanks,
Sami Dalouche

Le mardi 30 mai 2006 à 16:55 +0100, mark harwood a écrit :

> Take a look at "FuzzyLikeThisQuery" in
> contrib\queries.
>
> I use it for name searches on large indexes.
> Unlike FuzzyQuery it:
> a) limits the number of query terms produced
> b) provides better ranking (disables idf factor which
> otherwise boosts rare misspellings)
>
> The cost of running a query is strongly related to the
> quantity of terms in the query.
> FuzzyQuery only limits the number of terms by quality
> (which means you can unexpectedly produce a large
> quantity of terms and therefore have a slow query).
> FuzzyLikeThis is more explicit - it limits the
> *quantity* of terms used (and automatically shortlists
> to the best quality terms using the same edit-distance
> metric as FuzzyQuery for ranking quality).
>
>
> Cheers,
> Mark
>
>
>
>
>
>
> ___________________________________________________________
> All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine
> http://uk.docs.yahoo.com/nowyoucan.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

mark harwood
In reply to this post by Sami Dalouche
>>I tried the cityName:city~0.8, and it is still not fast enough..
>>something around 2 seconds... to return only 2 results...

OK, so we trimmed down the search terms we actually used in the query but I suspect what you are seeing is the effect of having to perform edit-distance comparisons on ALL town names to get to this shortlist. If this is the case then you'll probably be seeing a lot of CPU activity. One way of avoiding this is to set the "prefix length" parameter on fuzzy queries to at least one. This determines if you are comparing Rambouillet with ALL terms (as the default zero prefix length setting does) or just those beginning with "R".

Assuming an even spread of town names to letters that would cut the computation down to 1/26th of the original cost.

Cheers
Mark




Sami Dalouche wrote:

>Hi,
>
>Compass offers me any kind of control Lucene does. it gives access to
>the low level Lucene API if you want too, so if you have a nice way of
>optimizing it, I can have Compass adapt to that.
>
>
>I tried the cityName:city~0.8, and it is still not fast enough..
>something around 2 seconds... to return only 2 results...
>(city:Rambouillet~0.8)
>
>Sami Dalouche
>
>Le mercredi 31 mai 2006 à 09:28 +0100, mark harwood a écrit :
>  
>
>>>>Actually, I am not using Lucene directly, but a
>>>>        
>>>>
>>wrapper called compass
>>
>>
>>I don't know what controls it offers you then.
>>One option which could offer a speed up is to raise
>>the minimum quality match threshold above the default
>>of 0.5 and use a query string like this:
>>
>>  cityName:London~0.8
>>
>>This would reduce the number of alternative terms
>>considered and therefore the query time.
>>
>>
>>--- Sami Dalouche <[hidden email]> wrote:
>>
>>    
>>
>>>Hi,
>>>
>>>1) Actually, I am not using Lucene directly, but a
>>>wrapper called
>>>compass. I am using the find() method of the
>>>CompassSession, which code
>>>is :
>>>public CompassHits find(String query) throws
>>>CompassException {
>>>        return
>>>
>>>      
>>>
>>createQueryBuilder().queryString(query).toQuery().hits();
>>    
>>
>>>    }
>>>And all of these objects are pure wrappers around
>>>lucene equivalents,
>>>nothing more.
>>>
>>>
>>>2) What I am timing is only the find call :
>>>-- start timer
>>>CompassHits hits = compassSession.find("cityName:"+
>>>name+"~");
>>>-- stop timer
>>>
>>>3) I am not sorting anything, but lucene is
>>>returning the hits by
>>>relevance. Does this count as sorting ?
>>>
>>>4) I tried to time the thing for ~10 queries, and
>>>the results are
>>>roughly the same. Can go down to 2 seconds, which is
>>>still way too
>>>much...
>>>
>>>Thanks for helping
>>>sami Dalouche
>>>
>>>On Tue, 2006-05-30 at 13:58 -0700, Chris Hostetter
>>>wrote:
>>>      
>>>
>>>>: Fuzzy searching against this property takes
>>>>        
>>>>
>>>around 3 seconds, which is
>>>      
>>>
>>>>: way too much for what I plan to do, so I am
>>>>        
>>>>
>>>considering the possible
>>>      
>>>
>>>>whenever anyone has a question about how to speed
>>>>        
>>>>
>>>up a search, and the
>>>      
>>>
>>>>current amount of time the search takes is more
>>>>        
>>>>
>>>then a second, there are a
>>>      
>>>
>>>>few questions i allways want to ask:
>>>>
>>>> 1) what method exactly on the Searcher interface
>>>>        
>>>>
>>>are you using the
>>>      
>>>
>>>>    execute the search?
>>>> 2) what exactly are you timing? (the time the
>>>>        
>>>>
>>>search method call takes?,
>>>      
>>>
>>>>    the time it takes you to iterate over the
>>>>        
>>>>
>>>results? etc...)
>>>      
>>>
>>>> 3) are you sorting by any particular field?
>>>> 4) are you reusing the Searcher instance for more
>>>>        
>>>>
>>>then one query?   are
>>>      
>>>
>>>>    you timing more then one query and taking the
>>>>        
>>>>
>>>average?
>>>      
>>>
>>>>-Hoss
>>>>
>>>>
>>>>
>>>>        
>>>>
>>---------------------------------------------------------------------
>>    
>>
>>>>To unsubscribe, e-mail:
>>>>        
>>>>
>>>[hidden email]
>>>      
>>>
>>>>For additional commands, e-mail:
>>>>        
>>>>
>>>[hidden email]
>>>      
>>>
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>    
>>
>>>To unsubscribe, e-mail:
>>>[hidden email]
>>>For additional commands, e-mail:
>>>[hidden email]
>>>
>>>
>>>      
>>>
>>
>>
>>___________________________________________________________
>>The all-new Yahoo! Mail goes wherever you go - free your email address from your Internet provider. http://uk.docs.yahoo.com/nowyoucan.html
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: [hidden email]
>>For additional commands, e-mail: [hidden email]
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>
>  
>


Send instant messages to your online friends http://uk.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

eks dev
or you could try n-gram approach with  Spellchecker (you will find it contrib area).
get suggestSimilars() and form your query, or even better ConstantScoringQuery via Filter. It works OK.

Or if you have not so many Terms (could spare to load all terms in memory),  you could try  TernarySearchTree, get all Terms that have max N diferent characters and than calculate EditDistance only on them, Form Query.... and there you go


Lucene is fast, calculating edit distance is O(n*m) is slow, you have to figure it out how to reduce number of comparisons....

good luck


----- Original Message ----
From: markharw00d <[hidden email]>
To: [hidden email]
Sent: Wednesday, 31 May, 2006 9:53:27 PM
Subject: Re: Lucene search optimization

>>I tried the cityName:city~0.8, and it is still not fast enough..
>>something around 2 seconds... to return only 2 results...

OK, so we trimmed down the search terms we actually used in the query but I suspect what you are seeing is the effect of having to perform edit-distance comparisons on ALL town names to get to this shortlist. If this is the case then you'll probably be seeing a lot of CPU activity. One way of avoiding this is to set the "prefix length" parameter on fuzzy queries to at least one. This determines if you are comparing Rambouillet with ALL terms (as the default zero prefix length setting does) or just those beginning with "R".

Assuming an even spread of town names to letters that would cut the computation down to 1/26th of the original cost.

Cheers
Mark




Sami Dalouche wrote:

>Hi,
>
>Compass offers me any kind of control Lucene does. it gives access to
>the low level Lucene API if you want too, so if you have a nice way of
>optimizing it, I can have Compass adapt to that.
>
>
>I tried the cityName:city~0.8, and it is still not fast enough..
>something around 2 seconds... to return only 2 results...
>(city:Rambouillet~0.8)
>
>Sami Dalouche
>
>Le mercredi 31 mai 2006 à 09:28 +0100, mark harwood a écrit :
>  
>
>>>>Actually, I am not using Lucene directly, but a
>>>>        
>>>>
>>wrapper called compass
>>
>>
>>I don't know what controls it offers you then.
>>One option which could offer a speed up is to raise
>>the minimum quality match threshold above the default
>>of 0.5 and use a query string like this:
>>
>>  cityName:London~0.8
>>
>>This would reduce the number of alternative terms
>>considered and therefore the query time.
>>
>>
>>--- Sami Dalouche <[hidden email]> wrote:
>>
>>    
>>
>>>Hi,
>>>
>>>1) Actually, I am not using Lucene directly, but a
>>>wrapper called
>>>compass. I am using the find() method of the
>>>CompassSession, which code
>>>is :
>>>public CompassHits find(String query) throws
>>>CompassException {
>>>        return
>>>
>>>      
>>>
>>createQueryBuilder().queryString(query).toQuery().hits();
>>    
>>
>>>    }
>>>And all of these objects are pure wrappers around
>>>lucene equivalents,
>>>nothing more.
>>>
>>>
>>>2) What I am timing is only the find call :
>>>-- start timer
>>>CompassHits hits = compassSession.find("cityName:"+
>>>name+"~");
>>>-- stop timer
>>>
>>>3) I am not sorting anything, but lucene is
>>>returning the hits by
>>>relevance. Does this count as sorting ?
>>>
>>>4) I tried to time the thing for ~10 queries, and
>>>the results are
>>>roughly the same. Can go down to 2 seconds, which is
>>>still way too
>>>much...
>>>
>>>Thanks for helping
>>>sami Dalouche
>>>
>>>On Tue, 2006-05-30 at 13:58 -0700, Chris Hostetter
>>>wrote:
>>>      
>>>
>>>>: Fuzzy searching against this property takes
>>>>        
>>>>
>>>around 3 seconds, which is
>>>      
>>>
>>>>: way too much for what I plan to do, so I am
>>>>        
>>>>
>>>considering the possible
>>>      
>>>
>>>>whenever anyone has a question about how to speed
>>>>        
>>>>
>>>up a search, and the
>>>      
>>>
>>>>current amount of time the search takes is more
>>>>        
>>>>
>>>then a second, there are a
>>>      
>>>
>>>>few questions i allways want to ask:
>>>>
>>>> 1) what method exactly on the Searcher interface
>>>>        
>>>>
>>>are you using the
>>>      
>>>
>>>>    execute the search?
>>>> 2) what exactly are you timing? (the time the
>>>>        
>>>>
>>>search method call takes?,
>>>      
>>>
>>>>    the time it takes you to iterate over the
>>>>        
>>>>
>>>results? etc...)
>>>      
>>>
>>>> 3) are you sorting by any particular field?
>>>> 4) are you reusing the Searcher instance for more
>>>>        
>>>>
>>>then one query?   are
>>>      
>>>
>>>>    you timing more then one query and taking the
>>>>        
>>>>
>>>average?
>>>      
>>>
>>>>-Hoss
>>>>
>>>>
>>>>
>>>>        
>>>>
>>---------------------------------------------------------------------
>>    
>>
>>>>To unsubscribe, e-mail:
>>>>        
>>>>
>>>[hidden email]
>>>      
>>>
>>>>For additional commands, e-mail:
>>>>        
>>>>
>>>[hidden email]
>>>      
>>>
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>    
>>
>>>To unsubscribe, e-mail:
>>>[hidden email]
>>>For additional commands, e-mail:
>>>[hidden email]
>>>
>>>
>>>      
>>>
>>
>>        
>>___________________________________________________________
>>The all-new Yahoo! Mail goes wherever you go - free your email address from your Internet provider. http://uk.docs.yahoo.com/nowyoucan.html
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: [hidden email]
>>For additional commands, e-mail: [hidden email]
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>
>  
>


Send instant messages to your online friends http://uk.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

Sami Dalouche
In reply to this post by mark harwood
Hi,

thanks for the tip !! Yes, basically, I would like to reduce the number
of comparisons.  Using this prefix length seems doable for my problem..
(even though I'm not 100% sure it is appropriate, this has to be
investigated)

Is there a way to use this prefix length (or something similar) on some
other property than the city ? In fact, I can also index the Country, so
if this prefix length could be useable on the country, this could easily
divide the search space by 400, which is way better than /26...

 
Any idea ?
Sami Dalouche


Le mercredi 31 mai 2006 à 20:53 +0100, markharw00d a écrit :

> >>I tried the cityName:city~0.8, and it is still not fast enough..
> >>something around 2 seconds... to return only 2 results...
>
> OK, so we trimmed down the search terms we actually used in the query but I suspect what you are seeing is the effect of having to perform edit-distance comparisons on ALL town names to get to this shortlist. If this is the case then you'll probably be seeing a lot of CPU activity. One way of avoiding this is to set the "prefix length" parameter on fuzzy queries to at least one. This determines if you are comparing Rambouillet with ALL terms (as the default zero prefix length setting does) or just those beginning with "R".
>
> Assuming an even spread of town names to letters that would cut the computation down to 1/26th of the original cost.
>
> Cheers
> Mark
>
>
>
>
> Sami Dalouche wrote:
>
> >Hi,
> >
> >Compass offers me any kind of control Lucene does. it gives access to
> >the low level Lucene API if you want too, so if you have a nice way of
> >optimizing it, I can have Compass adapt to that.
> >
> >
> >I tried the cityName:city~0.8, and it is still not fast enough..
> >something around 2 seconds... to return only 2 results...
> >(city:Rambouillet~0.8)
> >
> >Sami Dalouche
> >
> >Le mercredi 31 mai 2006 à 09:28 +0100, mark harwood a écrit :
> >  
> >
> >>>>Actually, I am not using Lucene directly, but a
> >>>>        
> >>>>
> >>wrapper called compass
> >>
> >>
> >>I don't know what controls it offers you then.
> >>One option which could offer a speed up is to raise
> >>the minimum quality match threshold above the default
> >>of 0.5 and use a query string like this:
> >>
> >>  cityName:London~0.8
> >>
> >>This would reduce the number of alternative terms
> >>considered and therefore the query time.
> >>
> >>
> >>--- Sami Dalouche <[hidden email]> wrote:
> >>
> >>    
> >>
> >>>Hi,
> >>>
> >>>1) Actually, I am not using Lucene directly, but a
> >>>wrapper called
> >>>compass. I am using the find() method of the
> >>>CompassSession, which code
> >>>is :
> >>>public CompassHits find(String query) throws
> >>>CompassException {
> >>>        return
> >>>
> >>>      
> >>>
> >>createQueryBuilder().queryString(query).toQuery().hits();
> >>    
> >>
> >>>    }
> >>>And all of these objects are pure wrappers around
> >>>lucene equivalents,
> >>>nothing more.
> >>>
> >>>
> >>>2) What I am timing is only the find call :
> >>>-- start timer
> >>>CompassHits hits = compassSession.find("cityName:"+
> >>>name+"~");
> >>>-- stop timer
> >>>
> >>>3) I am not sorting anything, but lucene is
> >>>returning the hits by
> >>>relevance. Does this count as sorting ?
> >>>
> >>>4) I tried to time the thing for ~10 queries, and
> >>>the results are
> >>>roughly the same. Can go down to 2 seconds, which is
> >>>still way too
> >>>much...
> >>>
> >>>Thanks for helping
> >>>sami Dalouche
> >>>
> >>>On Tue, 2006-05-30 at 13:58 -0700, Chris Hostetter
> >>>wrote:
> >>>      
> >>>
> >>>>: Fuzzy searching against this property takes
> >>>>        
> >>>>
> >>>around 3 seconds, which is
> >>>      
> >>>
> >>>>: way too much for what I plan to do, so I am
> >>>>        
> >>>>
> >>>considering the possible
> >>>      
> >>>
> >>>>whenever anyone has a question about how to speed
> >>>>        
> >>>>
> >>>up a search, and the
> >>>      
> >>>
> >>>>current amount of time the search takes is more
> >>>>        
> >>>>
> >>>then a second, there are a
> >>>      
> >>>
> >>>>few questions i allways want to ask:
> >>>>
> >>>> 1) what method exactly on the Searcher interface
> >>>>        
> >>>>
> >>>are you using the
> >>>      
> >>>
> >>>>    execute the search?
> >>>> 2) what exactly are you timing? (the time the
> >>>>        
> >>>>
> >>>search method call takes?,
> >>>      
> >>>
> >>>>    the time it takes you to iterate over the
> >>>>        
> >>>>
> >>>results? etc...)
> >>>      
> >>>
> >>>> 3) are you sorting by any particular field?
> >>>> 4) are you reusing the Searcher instance for more
> >>>>        
> >>>>
> >>>then one query?   are
> >>>      
> >>>
> >>>>    you timing more then one query and taking the
> >>>>        
> >>>>
> >>>average?
> >>>      
> >>>
> >>>>-Hoss
> >>>>
> >>>>
> >>>>
> >>>>        
> >>>>
> >>---------------------------------------------------------------------
> >>    
> >>
> >>>>To unsubscribe, e-mail:
> >>>>        
> >>>>
> >>>[hidden email]
> >>>      
> >>>
> >>>>For additional commands, e-mail:
> >>>>        
> >>>>
> >>>[hidden email]
> >>>      
> >>>
> >>>
> >>>      
> >>>
> >>---------------------------------------------------------------------
> >>    
> >>
> >>>To unsubscribe, e-mail:
> >>>[hidden email]
> >>>For additional commands, e-mail:
> >>>[hidden email]
> >>>
> >>>
> >>>      
> >>>
> >>
> >>
> >>___________________________________________________________
> >>The all-new Yahoo! Mail goes wherever you go - free your email address from your Internet provider. http://uk.docs.yahoo.com/nowyoucan.html
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: [hidden email]
> >>For additional commands, e-mail: [hidden email]
> >>
> >>    
> >>
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: [hidden email]
> >For additional commands, e-mail: [hidden email]
> >
> >
> >
> >  
> >
>
>
> Send instant messages to your online friends http://uk.messenger.yahoo.com 
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search optimization

mark harwood
See QueryParser.setFuzzyPrefixLength()

This will apply to all fields parsed by the parser and is probably
generally advisable anyway to avoid server CPU overload.
Many production apps disable fuzzy searching completely in the search
syntax for this reason.






               
___________________________________________________________
Try the all-new Yahoo! Mail. "The New Version is radically easier to use" � The Wall Street Journal
http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]