Partial token matches

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Partial token matches

Eric Isakson
Hi All,

Just wanted to throw out something I'm working on. It is working well for me, but I wanted to see if anyone can suggest any other alternatives that might perform better than what I'm doing now.

I have a field in my index that contains keywords (back of the book index terms) and a UI feature that allows the user to find documents that contain a partial keyword supplied by the user. So a particular document in my index might have the token "informat" in the keywords field and the user may supply "form" in the UI and I should get a match.

My old implementation does not use Lucene and just uses String.matches with a regular expression that looks like ".*form.*". I reimplemented using Lucene and just tokenize the field so I get the tokens

informat
nformat
format
ormat
rmat
mat
at
t

Then I use a prefix query to find hits. Both implementations ignore case in the search and the hit order is controlled by another field that I'm sorting on, so relevance ranking is not important in this use case. Search time performance is crucial, time to create the index and index size are not really important. The index is created statically at application startup or possibly delivered to the application and is not updated while the application is using it.

Thanks for any suggestions,
Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Partial token matches

Erick Erickson
I'm sure the guys will chime in, but I think you're in significant danger of
getting a "too many clauses" exception thrown. Try searching on, say, "an".
Under the covers, Lucene expands your query to have a clause for *every*
item in your index that starts with "an", so there's a clause for "an" "ana"
"anb", "anaa", "anab", ....... The shorter your term, the more there'll be,
and if there are more than 1024, you'll get the exception above. You can set
the number of clauses to a bigger number, but that may not scale well.

Consider writing a filter (see Lucene In Action). The filter will return a
bitset with a bit turned on for each potential match, and avoid this issue.
RegexTermEnum helps a lot here.

Try searching the archive for a thread started by me, titled "I just don't
get wildcards at all" for an exposition by the guys on this sort of thing.
That thread centers on wildcard queries, but I'm pretty sure PrefixQuery
suffers from the same issue.

Chris, Erik, Yonik... Do I have this right????

Erick
Reply | Threaded
Open this post in threaded view
|

Re: Partial token matches

Chris Hostetter-3

: I'm sure the guys will chime in, but I think you're in significant danger of
: getting a "too many clauses" exception thrown. Try searching on, say, "an".
: Under the covers, Lucene expands your query to have a clause for *every*
: item in your index that starts with "an", so there's a clause for "an" "ana"
: "anb", "anaa", "anab", ....... The shorter your term, the more there'll be,
: and if there are more than 1024, you'll get the exception above. You can set
: the number of clauses to a bigger number, but that may not scale well.

When using any of the queries that expand into a BooleanQuery, there is
almost allways the possibility of hitting TooManyClauses -- but this
approach of using PrefixQuery is definitely safer/faster then a straight
use of WildCardQuery -- at the expense of a Bigger index.

The idea mentioned in this thread is basically the same as an idea Erik
Hatcher has suggested in the past, which i've taken to refering to as
"wildcard term rotating"...
  http://www.mail-archive.com/lucene-user@.../msg12261.html

: Consider writing a filter (see Lucene In Action). The filter will return a
: bitset with a bit turned on for each potential match, and avoid this issue.

very true -- but at the expense of scoring information (ie: how many times
does the term appear in the document?) ... it's all a question of
priorities.





-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Partial token matches

Paul.Illingworth
In reply to this post by Eric Isakson





Another approach maybe to use n-grams. Index each word as follows

2 gram field
in  nf  fo  or rm  ma  at

3 gram field
inf  nfo  for  orm rma  mat

4 gram field
info nfor form orm rmat

and so on.

To search for term "form" simply search the 4 gram field.

The prefix query approach may suffer from the too many clauses exception if
you have lots of words beginning with, in your exmaple, "form". The
approach above would avoid this but will obviously have a much bigger index
size.

Regards

Paul I.



                                                                           
             "Eric Isakson"                                                
             <Eric.Isakson@sas                                            
             .com>                                                      To
                                       <[hidden email]>      
             26/04/2006 17:20                                           cc
                                                                           
                                                                   Subject
             Please respond to         Partial token matches              
             java-user@lucene.                                            
                apache.org                                                
                                                                           
                                                                           
                                                                           
                                                                           




Hi All,

Just wanted to throw out something I'm working on. It is working well for
me, but I wanted to see if anyone can suggest any other alternatives that
might perform better than what I'm doing now.

I have a field in my index that contains keywords (back of the book index
terms) and a UI feature that allows the user to find documents that contain
a partial keyword supplied by the user. So a particular document in my
index might have the token "informat" in the keywords field and the user
may supply "form" in the UI and I should get a match.

My old implementation does not use Lucene and just uses String.matches with
a regular expression that looks like ".*form.*". I reimplemented using
Lucene and just tokenize the field so I get the tokens

informat
nformat
format
ormat
rmat
mat
at
t

Then I use a prefix query to find hits. Both implementations ignore case in
the search and the hit order is controlled by another field that I'm
sorting on, so relevance ranking is not important in this use case. Search
time performance is crucial, time to create the index and index size are
not really important. The index is created statically at application
startup or possibly delivered to the application and is not updated while
the application is using it.

Thanks for any suggestions,
Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Partial token matches

Karl Wettin-3

27 apr 2006 kl. 10.05 skrev [hidden email]:
>
> Another approach maybe to use n-grams.

The spell checker in contrib could probably be used as a code base  
for that.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Partial token matches

Eric Isakson
In reply to this post by Eric Isakson
Thank you all for the ideas and thanks to the developers for producing such a great tool. I hadn't considered the "too many clauses" problem in my original implementation and I'm definitely hitting it.

I decided to use a bi-gram tokenization approach combined with a PhraseQuery to get the "terms containing" feature. This keeps me from needing N fields each with different length n-grams at the expense of a little more work at query time. My indexed field ends up looking like:

in  nf  fo  or rm  ma  at

and my query for "form" looks like:

"fo or rm"

I build the query using the PhraseQuery object and my bi-gram tokenizer. I haven't tried to solve single character queries yet, but I don't expect that to be too much trouble. I'll probably just use a single character tokenization strategy in another field of the index and make that a special case.

This bi-gram approach actually produces an index smaller than the one I was getting with my previous algorithm and the search performance is great.

I also had the "too many clauses" problem in my "terms starting with" feature which used PrefixQuery. I solved this case using this same bi-gram tokenized field. For this case, I need the user supplied text at the beginning of the field. So I build a SpanNearQuery with the tokens to get a phrase like query using spans and then a SpanFirstQuery with the number of tokens to force the query to only match at the start of the field. So a "starts with" query for "form" looks like

SpanQuery fo = new SpanTermQuery(new Term(FIELD_CONTAINS, "fo");
SpanQuery or = new SpanTermQuery(new Term(FIELD_CONTAINS, "or");
SpanQuery rm = new SpanTermQuery(new Term(FIELD_CONTAINS, "rm");
SpanQuery form = new SpanNearQuery(new SpanQuery[] {fo, or, rm}, 0, true);
SpanQuery query = new SpanFirstQuery(form, 3);

From what I read, this strategy will not perform as fast as a prefix query, but it will allow me to use arbitrarily short query text without running into the "too many clauses" problem.

Thanks again for the suggestions!
Eric

-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Thursday, April 27, 2006 4:06 AM
To: [hidden email]
Subject: Re: Partial token matches






Another approach maybe to use n-grams. Index each word as follows

2 gram field
in  nf  fo  or rm  ma  at

3 gram field
inf  nfo  for  orm rma  mat

4 gram field
info nfor form orm rmat

and so on.

To search for term "form" simply search the 4 gram field.

The prefix query approach may suffer from the too many clauses exception if you have lots of words beginning with, in your exmaple, "form". The approach above would avoid this but will obviously have a much bigger index size.

Regards

Paul I.



                                                                           
             "Eric Isakson"                                                
             <Eric.Isakson@sas                                            
             .com>                                                      To
                                       <[hidden email]>      
             26/04/2006 17:20                                           cc
                                                                           
                                                                   Subject
             Please respond to         Partial token matches              
             java-user@lucene.                                            
                apache.org                                                
                                                                           
                                                                           
                                                                           
                                                                           




Hi All,

Just wanted to throw out something I'm working on. It is working well for me, but I wanted to see if anyone can suggest any other alternatives that might perform better than what I'm doing now.

I have a field in my index that contains keywords (back of the book index
terms) and a UI feature that allows the user to find documents that contain a partial keyword supplied by the user. So a particular document in my index might have the token "informat" in the keywords field and the user may supply "form" in the UI and I should get a match.

My old implementation does not use Lucene and just uses String.matches with a regular expression that looks like ".*form.*". I reimplemented using Lucene and just tokenize the field so I get the tokens

informat
nformat
format
ormat
rmat
mat
at
t

Then I use a prefix query to find hits. Both implementations ignore case in the search and the hit order is controlled by another field that I'm sorting on, so relevance ranking is not important in this use case. Search time performance is crucial, time to create the index and index size are not really important. The index is created statically at application startup or possibly delivered to the application and is not updated while the application is using it.

Thanks for any suggestions,
Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]