How to promote an unstemmed match over a stemmed match in an index that's stemmed...

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to promote an unstemmed match over a stemmed match in an index that's stemmed...

Michael Stoppelman
Hi all,
I've got an index with tokens that are stemmed. Sometimes I really need to
boost the unstemmed
version of a query word to get the most relevant documents.

Example:
Query: [olives].

I don't want to match documents with the words: oliver, oliver's, etc...

Since I'm stemming when creating the index is there a way to store both
versions (stemmed/unstemmed) with
setIncrementPosition()? Is that the correct way to deal with this? I was
reading old archives and this didn't seem
to be a great way decision since it breaks PhraseQuery [1].

It seems like it would be useful if at query scoring time if I could see the
original string values of the tokens in this case
at least.

Thanks in advance,

-M

[1] http://www.mail-archive.com/lucene-user@.../msg07416.html
Reply | Threaded
Open this post in threaded view
|

Re: How to promote an unstemmed match over a stemmed match in an index that's stemmed...

Erick Erickson
You have to bet a bit clever. You can certainly inject the original with an
increment of 0. See SynonymAnalyzer in Lucene In Action. This will not
break phrase queries since your two tokens occupy the same position.

But you'll have to do something like add a $ to the original at index time.
That way, for exact matches you can search on olive$, boosted however
you  want. When you want the stemmed version you can search for olive.
Or you could add a clause with the unstemmed version boosted. Or
something like that <G>.... Note that whether you add the $ to the stemmed
or unstemmed version is up to you.......

Watch what analyzer you use to be sure it doesn't strip out the special
symbol....

Best
Erick

On Feb 11, 2008 12:56 PM, Michael Stoppelman <[hidden email]> wrote:

> Hi all,
> I've got an index with tokens that are stemmed. Sometimes I really need to
> boost the unstemmed
> version of a query word to get the most relevant documents.
>
> Example:
> Query: [olives].
>
> I don't want to match documents with the words: oliver, oliver's, etc...
>
> Since I'm stemming when creating the index is there a way to store both
> versions (stemmed/unstemmed) with
> setIncrementPosition()? Is that the correct way to deal with this? I was
> reading old archives and this didn't seem
> to be a great way decision since it breaks PhraseQuery [1].
>
> It seems like it would be useful if at query scoring time if I could see
> the
> original string values of the tokens in this case
> at least.
>
> Thanks in advance,
>
> -M
>
> [1]
> http://www.mail-archive.com/lucene-user@.../msg07416.html
>
Reply | Threaded
Open this post in threaded view
|

Re: How to promote an unstemmed match over a stemmed match in an index that's stemmed...

Michael Stoppelman
Ah, very cool. Thanks for the tip.

-M

On Feb 11, 2008 10:58 AM, Erick Erickson <[hidden email]> wrote:

> You have to bet a bit clever. You can certainly inject the original with
> an
> increment of 0. See SynonymAnalyzer in Lucene In Action. This will not
> break phrase queries since your two tokens occupy the same position.
>
> But you'll have to do something like add a $ to the original at index
> time.
> That way, for exact matches you can search on olive$, boosted however
> you  want. When you want the stemmed version you can search for olive.
> Or you could add a clause with the unstemmed version boosted. Or
> something like that <G>.... Note that whether you add the $ to the stemmed
> or unstemmed version is up to you.......
>
> Watch what analyzer you use to be sure it doesn't strip out the special
> symbol....
>
> Best
> Erick
>
> On Feb 11, 2008 12:56 PM, Michael Stoppelman <[hidden email]> wrote:
>
> > Hi all,
> > I've got an index with tokens that are stemmed. Sometimes I really need
> to
> > boost the unstemmed
> > version of a query word to get the most relevant documents.
> >
> > Example:
> > Query: [olives].
> >
> > I don't want to match documents with the words: oliver, oliver's, etc...
> >
> > Since I'm stemming when creating the index is there a way to store both
> > versions (stemmed/unstemmed) with
> > setIncrementPosition()? Is that the correct way to deal with this? I was
> > reading old archives and this didn't seem
> > to be a great way decision since it breaks PhraseQuery [1].
> >
> > It seems like it would be useful if at query scoring time if I could see
> > the
> > original string values of the tokens in this case
> > at least.
> >
> > Thanks in advance,
> >
> > -M
> >
> > [1]
> > http://www.mail-archive.com/lucene-user@.../msg07416.html
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: How to promote an unstemmed match over a stemmed match in an index that's stemmed...

Jake Mannix
In reply to this post by Michael Stoppelman
The way I've always done this was to index two fields: say, "contents"
and "contents_unstemmed",  (using a PerFieldAnalyzer) and then query
on both of them.  This has the double effect of a) boosting unstemmed
hits, because every unstemmed match is also a stemmed one, so the
BooleanQuery combining the stemmed and unstemmed queries gets higher
weight in this case; and b) it allows you to query by *only* the
unstemmed variant if e.g. the user puts their search term in quotes,
indicating they really want an exact match.

  -jake



On 2/11/08, Michael Stoppelman <[hidden email]> wrote:

> Hi all,
> I've got an index with tokens that are stemmed. Sometimes I really need to
> boost the unstemmed
> version of a query word to get the most relevant documents.
>
> Example:
> Query: [olives].
>
> I don't want to match documents with the words: oliver, oliver's, etc...
>
> Since I'm stemming when creating the index is there a way to store both
> versions (stemmed/unstemmed) with
> setIncrementPosition()? Is that the correct way to deal with this? I was
> reading old archives and this didn't seem
> to be a great way decision since it breaks PhraseQuery [1].
>
> It seems like it would be useful if at query scoring time if I could see the
> original string values of the tokens in this case
> at least.
>
> Thanks in advance,
>
> -M
>
> [1] http://www.mail-archive.com/lucene-user@.../msg07416.html
>

--
Sent from Gmail for mobile | mobile.google.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]