Splitting and matching words

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Splitting and matching words

Eric Jain
I'd like to have "PowerShot", "powershot" and "power-shot" match each
other. Solr has a WordDelimiterFilter, which works quite well, except that
"powershot" still won't match "PowerShot" (tokenized into "power (shot
powershot)", so "power powershot" would match..."). Any suggestions?
Reply | Threaded
Open this post in threaded view
|

Re: Splitting and matching words

Yonik Seeley
On 6/25/06, Eric Jain <[hidden email]> wrote:
> I'd like to have "PowerShot", "powershot" and "power-shot" match each
> other. Solr has a WordDelimiterFilter, which works quite well, except that
> "powershot" still won't match "PowerShot" (tokenized into "power (shot
> powershot)", so "power powershot" would match..."). Any suggestions?

You mean if the indexed text was "powershot" and the query text was
"PowerShot" then it wouldn't match (but the reverse case will).

That is a problem... if one does both catenation and splitting on the
query side, you end up with "Power" in the first position, and both
"Shot" and "PowerShot" in the second.  While this works fine for the
indexing side, on the query side it's interpreted as a
MultiPhraseQuery meaning "Power" followed by either "Shot" or
"PowerShot".

Workarounds:
  1) a new QueryParser smart enough to make a boolean query instead of
a MultiPhraseQuery.   "Power Shot" OR "PowerShot"
  2) index the field a second time via copyField, but have the query
analyzer catenate instead of split subwords.  query across both
fields.
  3) do more client-side processing... change "PowerShot" to
      "PowerShot" OR "powershot" (i.e. create a boolean query with the
second option
     removing subword delimiters yourself).

(1) is much harder to do in a generic way, but would be most useful.
(2) is much easier and can be done now.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Splitting and matching words

Yonik Seeley
On 6/25/06, Yonik Seeley <[hidden email]> wrote:
>   1) a new QueryParser smart enough to make a boolean query instead of
> a MultiPhraseQuery.   "Power Shot" OR "PowerShot"

Thinking about this option a bit more...
The problem is ambiguity.  Sometimes a MultiPhraseQuery is the correct
interpretation and sometimes a boolean query is needed.  The same
problem exists on the query side for multi-token synonyms.  There
isn't enough information about what the "synonyms" actually are.

Take the case of lap/0 top/1 notebook/1  (where /0 and /1 are token positions).
There isn't enough info to understand if notebook is a synonym for
"top" or for "lap top".
Even if we added extra info (I recently committed a Lucene patch to
allow subclassing Token), it's not an easy problem.

Consider something like "my PowerShot lap-top", and trying to
represent that with a boolean query of phrase queries... you need all
the possibilities.

"my Power Shot lap top"
"my PowerShot lap top"
"my Power Shot laptop"
"my PowerShot laptop"

perhaps span queries could avoid generating all the possibilities...

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Splitting and matching words

Chris Hostetter-3

: perhaps span queries could avoid generating all the possibilities...

I remember coming up for a design for dealing with cases like this a while
back ... it did involve using SpanNear/SpanOr queries -- but it also
required added information in the Tokens at query time to resolve the
"lap/0 top/1 notebook/1" ambiguity.

I'll see if i can dig that up (not sure if i ever typed it up, or if it
was just a whiteboard thing that got erased when i never did anything with
it).



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Splitting and matching words

Eric Jain
In reply to this post by Eric Jain
Eric Jain wrote:
> I'd like to have "PowerShot", "powershot" and "power-shot" match each
> other. Solr has a WordDelimiterFilter, which works quite well, except
> that "powershot" still won't match "PowerShot" (tokenized into "power
> (shot powershot)", so "power powershot" would match..."). Any suggestions?

The workaround I'll probably use for the time being is to lowercase the
tokens before applying the WordDelimiterFilter, in the analyzer that is
used for parsing queries (but for indexing the order remains unchanged).

This way matches are case-insensitive, which is essential for our
application. "power-shot" (query) still won't match "powershot" (index),
but all the other combinations should work.