Searching on plurals and phrases in a single field

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Searching on plurals and phrases in a single field

Lucifer Hammer
Hi,

We've got a requirement that we need to give our users  the ability to
search on exact phrases within a field, or, if they prefer, they can match
on plurals(either via stems, or another plural algorithm).  However, the
cases are mutually exclusive, for example given the following field in the
index:

IndexField1: "The quick brown dog jumped over the lazy fox"

If the user chooses an exact phrase search such as "lazy dog jumped", then
it'll match, however, if they also choose an exact phrase  and search for:
"lazy dogs" it shouldn't match.

If the user chooses a plural search, then  both of the above searches should
match.

So... the question really is:  Can I do this all in one field, or will I
have to index the data twice, once in a field that has the exact text, and
in a second field in which I index the terms, and wordstack stems(or
plurals).

If it's possible to do this in a single field, that would be much
preferred...

Thanks for any help!
Lucifer
Reply | Threaded
Open this post in threaded view
|

Re: Searching on plurals and phrases in a single field

Erick Erickson
I faced a very similar requirement and solved it by indexing multiple
tokens at the same place. For instance, say you're indexing
the word "foxes". Index something like fox$ and foxes at the same
position (see SynonymAnalyzer in Lucene In Action for an example).
You probably MUST index the multiple terms with an increment gap
of 0 (more later).

Example phrase "red foxes are plentiful"

Now you have the capability of distinguishing between a stemmed
and unstemmed version of the word and can search for exactly
"red foxes". But if instead you want to search for the stemmed
version, you can search for "red fox$". But "red fox" will NOT match.

The reason you need to index these with an increment gap of 0 is so
phrase queries work. If you let the gap increment for each token, and
indexed a phrase like "red foxes are plentiful", then did a
proximity search on "red plentiful"~2, it would fail because
you'd have fox$ and foxes each taking up one position. But if
fox$ and foxes both have the same position, it'll work.

And it's all in the same index, one field, etc.

Hope this helps
Erick

On Dec 12, 2007 1:25 PM, Lucifer Hammer <[hidden email]> wrote:

> Hi,
>
> We've got a requirement that we need to give our users  the ability to
> search on exact phrases within a field, or, if they prefer, they can match
> on plurals(either via stems, or another plural algorithm).  However, the
> cases are mutually exclusive, for example given the following field in the
> index:
>
> IndexField1: "The quick brown dog jumped over the lazy fox"
>
> If the user chooses an exact phrase search such as "lazy dog jumped", then
> it'll match, however, if they also choose an exact phrase  and search for:
> "lazy dogs" it shouldn't match.
>
> If the user chooses a plural search, then  both of the above searches
> should
> match.
>
> So... the question really is:  Can I do this all in one field, or will I
> have to index the data twice, once in a field that has the exact text, and
> in a second field in which I index the terms, and wordstack stems(or
> plurals).
>
> If it's possible to do this in a single field, that would be much
> preferred...
>
> Thanks for any help!
> Lucifer
>
Reply | Threaded
Open this post in threaded view
|

Re: Searching on plurals and phrases in a single field

Lucifer Hammer
Hi Erick,

Thanks for the great idea, it's exactly the kind of suggestion I was looking
for!

Lucifer

On Dec 12, 2007 2:34 PM, Erick Erickson <[hidden email]> wrote:

> I faced a very similar requirement and solved it by indexing multiple
> tokens at the same place. For instance, say you're indexing
> the word "foxes". Index something like fox$ and foxes at the same
> position (see SynonymAnalyzer in Lucene In Action for an example).
> You probably MUST index the multiple terms with an increment gap
> of 0 (more later).
>
> Example phrase "red foxes are plentiful"
>
> Now you have the capability of distinguishing between a stemmed
> and unstemmed version of the word and can search for exactly
> "red foxes". But if instead you want to search for the stemmed
> version, you can search for "red fox$". But "red fox" will NOT match.
>
> The reason you need to index these with an increment gap of 0 is so
> phrase queries work. If you let the gap increment for each token, and
> indexed a phrase like "red foxes are plentiful", then did a
> proximity search on "red plentiful"~2, it would fail because
> you'd have fox$ and foxes each taking up one position. But if
> fox$ and foxes both have the same position, it'll work.
>
> And it's all in the same index, one field, etc.
>
> Hope this helps
> Erick
>
> On Dec 12, 2007 1:25 PM, Lucifer Hammer <[hidden email]> wrote:
>
> > Hi,
> >
> > We've got a requirement that we need to give our users  the ability to
> > search on exact phrases within a field, or, if they prefer, they can
> match
> > on plurals(either via stems, or another plural algorithm).  However, the
> > cases are mutually exclusive, for example given the following field in
> the
> > index:
> >
> > IndexField1: "The quick brown dog jumped over the lazy fox"
> >
> > If the user chooses an exact phrase search such as "lazy dog jumped",
> then
> > it'll match, however, if they also choose an exact phrase  and search
> for:
> > "lazy dogs" it shouldn't match.
> >
> > If the user chooses a plural search, then  both of the above searches
> > should
> > match.
> >
> > So... the question really is:  Can I do this all in one field, or will I
> > have to index the data twice, once in a field that has the exact text,
> and
> > in a second field in which I index the terms, and wordstack stems(or
> > plurals).
> >
> > If it's possible to do this in a single field, that would be much
> > preferred...
> >
> > Thanks for any help!
> > Lucifer
> >
>