Quantcast

Sort by length percentage match

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Sort by length percentage match

Alejandro Cuesta
Hi,

I have a field  containing "cities" and I'd like to sort the results based
on length percentage match.

Example:

Asuming I've got these cities in the index:

   london, south west london, londonderry, oxford

And I search for "london", I'd like to get a list sorted like this:

london                    (6/6, 100% match)
londonderry             (6/11, 54% match)
south west london   (6/17, 35% match)

I know Lucene uses a different scoring algorithm base on term frecuency and
inverse document frecuency (tf & idf) but in my specific example I need to
use this scoring strategy.

Can anyone give a clue or start point please?
Is there a better technology to perform this kind of search?

Thanks,

Alejandro
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Sort by length percentage match

steve_rowe
Hi Alejandro,

N-grams <http://en.wikipedia.org/wiki/N-gram> might be a good fit.

Using bigrams (n-grams of length 2) for "london", you'd get tokens "lo", "on", "nd", "do", "on".  This should provide the hit ordering you want.

Although it's not listed on Solr's analysis factories wiki page <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>, there is an NGramFilterFactory, with attributes maxGramSize and minGramSize.  See the example usage on the javadocs here: <http://lucene.apache.org/solr/api/org/apache/solr/analysis/NGramFilterFactory.html>.  Also a tokenizer variant: <http://lucene.apache.org/solr/api/org/apache/solr/analysis/NGramTokenizerFactory.html>.

Steve

-----Original Message-----
From: Alejandro Cuesta [mailto:[hidden email]]
Sent: Wednesday, May 16, 2012 12:51 PM
To: [hidden email]
Subject: Sort by length percentage match

Hi,

I have a field  containing "cities" and I'd like to sort the results based on length percentage match.

Example:

Asuming I've got these cities in the index:

   london, south west london, londonderry, oxford

And I search for "london", I'd like to get a list sorted like this:

london                    (6/6, 100% match)
londonderry             (6/11, 54% match)
south west london   (6/17, 35% match)

I know Lucene uses a different scoring algorithm base on term frecuency and inverse document frecuency (tf & idf) but in my specific example I need to use this scoring strategy.

Can anyone give a clue or start point please?
Is there a better technology to perform this kind of search?

Thanks,

Alejandro
Loading...