Making stop-words optional with DisMax?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Making stop-words optional with DisMax?

ronbraun
I've followed the stop-word discussion with some interest, but I've
yet to find a solution that completely satisfies our needs.  I was
wondering if anyone could suggest some other options to try short of a
custom handler or building our own queries (DisMax does such a fine
job generally!).

We are using DisMax, and indexing media titles (books, music).  We
want our queries to be sensitive to stop-words, but not so sensitive
that we fail to match on missing or incorrect stop-words.  For
example, here are a set of queries and desired behavior:

* it -> matches It by steven king (high relevance) and other titles
with it therein, e.g. Some Like It Hot (lower relevance)
* the the -> matches music by The The, other titles with the therein
at lower relevance are fine
* the sound of music -> matches The Sound of Music high relevance
* a sound of music -> still matches The Sound of Music, lower relevance is fine
* the doors -> matches music by The Doors, even though it is indexed
just as "Doors" (our data supplier drops the definite article)
* the life -> matches titles The Life with high relevance, matches
titles of just Life with lower relevance

Basically, we want direct matches (including stop-words) to be highly
relevant and we use the phrase query mechanism for that, but we also
want matches if the user mis-remembers the correct (stopped)
prepositions or inserts a few irrelevant stop-words (like articles).
We see this in the wild with non-trivial frequency -- the wrong choice
of preposition ("on mice and men") or an article used that our data
supplier didn't include in the original version ("doors").

One thing we tried is to include both a stopped version and a
non-stopped version of the title in the qf field, in the hopes that
this would retrieve all titles without stop-words and still allow us
to include pure stop-word queries ("it").  However, DisMax constructs
queries such that mixing stopped and non-stopped fields doesn't work
as one might hope, as described well here:

http://www.nabble.com/DisMax-request-handler-doesn%27t-work-with-stopwords--td11015905.html#a11112461

Since qf controls the initial set of results retrieved for DisMax, and
we don't want to use a pure stopped set of fields there (because we
won't match on "it" as a query) nor a pure non-stopped set (won't get
results for "a sound of music"), we'd seem to be out of luck unless we
can figure out a way to augment the qf coverage.

We've tried relaxing query term requirements to allow a missing word
or two in the query via mm, but recall is amped up too much since
non-stop-words tend to be dropped and you get a lot of results that
match primarily just across stop-words.

We've also considered creating a sort of equivalence class for all
stop-words (defining synonyms to map stops to some special token)
which would allow mis-remembered stop-words to be conflated, but then
something like "it" would match anything that contained any stop-word
-- again, too high on the recall.

What I think we want is something like an "optional stop-word DisMax"
that would mark stops as optional and construct queries such that
stop-words aren't passed into fields that apply stop-word removal in
query clauses (if that makes sense).  Has anyone done anything similar
or found a better way to handle stops that exhibits the desired
behavior?

Thanks in advance for any thoughts!  And, being new to Solr, apologies
if I'm confused in my reasoning somewhere.

Ron
Reply | Threaded
Open this post in threaded view
|

Re: Making stop-words optional with DisMax?

Otis Gospodnetic-2
Hi Ron,,

I skimmed your email.  You are indexing book and music titles.  Those tend to be short.  Do you really benefit from removing stop words in the first place?  I'd try keeping all the stop words and seeing if that has any negative side-effects in your context.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Ronald K. Braun <[hidden email]>
To: [hidden email]
Sent: Wednesday, March 26, 2008 11:41:46 AM
Subject: Making stop-words optional with DisMax?

I've followed the stop-word discussion with some interest, but I've
yet to find a solution that completely satisfies our needs.  I was
wondering if anyone could suggest some other options to try short of a
custom handler or building our own queries (DisMax does such a fine
job generally!).

We are using DisMax, and indexing media titles (books, music).  We
want our queries to be sensitive to stop-words, but not so sensitive
that we fail to match on missing or incorrect stop-words.  For
example, here are a set of queries and desired behavior:

* it -> matches It by steven king (high relevance) and other titles
with it therein, e.g. Some Like It Hot (lower relevance)
* the the -> matches music by The The, other titles with the therein
at lower relevance are fine
* the sound of music -> matches The Sound of Music high relevance
* a sound of music -> still matches The Sound of Music, lower relevance is fine
* the doors -> matches music by The Doors, even though it is indexed
just as "Doors" (our data supplier drops the definite article)
* the life -> matches titles The Life with high relevance, matches
titles of just Life with lower relevance

Basically, we want direct matches (including stop-words) to be highly
relevant and we use the phrase query mechanism for that, but we also
want matches if the user mis-remembers the correct (stopped)
prepositions or inserts a few irrelevant stop-words (like articles).
We see this in the wild with non-trivial frequency -- the wrong choice
of preposition ("on mice and men") or an article used that our data
supplier didn't include in the original version ("doors").

One thing we tried is to include both a stopped version and a
non-stopped version of the title in the qf field, in the hopes that
this would retrieve all titles without stop-words and still allow us
to include pure stop-word queries ("it").  However, DisMax constructs
queries such that mixing stopped and non-stopped fields doesn't work
as one might hope, as described well here:

http://www.nabble.com/DisMax-request-handler-doesn%27t-work-with-stopwords--td11015905.html#a11112461

Since qf controls the initial set of results retrieved for DisMax, and
we don't want to use a pure stopped set of fields there (because we
won't match on "it" as a query) nor a pure non-stopped set (won't get
results for "a sound of music"), we'd seem to be out of luck unless we
can figure out a way to augment the qf coverage.

We've tried relaxing query term requirements to allow a missing word
or two in the query via mm, but recall is amped up too much since
non-stop-words tend to be dropped and you get a lot of results that
match primarily just across stop-words.

We've also considered creating a sort of equivalence class for all
stop-words (defining synonyms to map stops to some special token)
which would allow mis-remembered stop-words to be conflated, but then
something like "it" would match anything that contained any stop-word
-- again, too high on the recall.

What I think we want is something like an "optional stop-word DisMax"
that would mark stops as optional and construct queries such that
stop-words aren't passed into fields that apply stop-word removal in
query clauses (if that makes sense).  Has anyone done anything similar
or found a better way to handle stops that exhibits the desired
behavior?

Thanks in advance for any thoughts!  And, being new to Solr, apologies
if I'm confused in my reasoning somewhere.

Ron



Reply | Threaded
Open this post in threaded view
|

Re: Making stop-words optional with DisMax?

ronbraun
In reply to this post by ronbraun
Hi Otis,

> I skimmed your email.  You are indexing book and music titles.  Those tend to be short.
> Do you really benefit from removing stop words in the first place?  I'd try keeping all the stop
> words and seeing if that has any negative side-effects in your context.

Thanks for your skim and response!  We do keep all stop-words -- as
you say, makes sense since we aren't dealing with long free text
fields and because some titles are pure stops.

The negative side-effects lie in stop-words being treated with the
same importance as non-stop-words for matching purposes.  This
manifests in two ways:  1. Users occasionally get the stop-words wrong
-- say, wrong choice of preposition, which torpedoes the query since
some of the query terms aren't present in the target.  For example "on
mice and men" may return nothing (no match for "on") even though it is
equivalent to "of mice and men" in a stopped sense.  2. Our original
indexed data doesn't always have leading articles and such.  For
example, we index on "Doors" since that is our sourced data but
frequently get queried for "The Doors".  Articles and prepositions
(the stuff of good stop-lists) seem to me to be in a fuzzier class --
use 'em if you have 'em during matching, but don't kill your queries
because of them.  Hence some desire to make them in some way
"optional" during matching.

Ron
Reply | Threaded
Open this post in threaded view
|

Re: Making stop-words optional with DisMax?

Walter Underwood, Netflix
We use two fields, one with and one without stopwords. The exact
field has a higher boost than the other. That works pretty well.

It helps to have an automated relevance test when tuning the boost
(and other things). I extracted queries and clicks from the logs
for a couple of months. Not perfect, but it is hard to argue with
32 million clicks.

wunder

On 3/26/08 6:05 PM, "Ronald K. Braun" <[hidden email]> wrote:

> Hi Otis,
>
>> I skimmed your email.  You are indexing book and music titles.  Those tend to
>> be short.
>> Do you really benefit from removing stop words in the first place?  I'd try
>> keeping all the stop
>> words and seeing if that has any negative side-effects in your context.
>
> Thanks for your skim and response!  We do keep all stop-words -- as
> you say, makes sense since we aren't dealing with long free text
> fields and because some titles are pure stops.
>
> The negative side-effects lie in stop-words being treated with the
> same importance as non-stop-words for matching purposes.  This
> manifests in two ways:  1. Users occasionally get the stop-words wrong
> -- say, wrong choice of preposition, which torpedoes the query since
> some of the query terms aren't present in the target.  For example "on
> mice and men" may return nothing (no match for "on") even though it is
> equivalent to "of mice and men" in a stopped sense.  2. Our original
> indexed data doesn't always have leading articles and such.  For
> example, we index on "Doors" since that is our sourced data but
> frequently get queried for "The Doors".  Articles and prepositions
> (the stuff of good stop-lists) seem to me to be in a fuzzier class --
> use 'em if you have 'em during matching, but don't kill your queries
> because of them.  Hence some desire to make them in some way
> "optional" during matching.
>
> Ron

Reply | Threaded
Open this post in threaded view
|

Re: Making stop-words optional with DisMax?

hossman
In reply to this post by ronbraun

: frequently get queried for "The Doors".  Articles and prepositions
: (the stuff of good stop-lists) seem to me to be in a fuzzier class --
: use 'em if you have 'em during matching, but don't kill your queries
: because of them.  Hence some desire to make them in some way
: "optional" during matching.

sure, but what logic would you suggest be used to decide when to make them
optional?  :)

based on your problem description (which was excellent by the way ...
questions full of details are so great, you never have to worry that you
are missunderstanding the problem)  the best suggestion i can give is one
that i usually discourage:  execute multiple queries.

start by hitting Solr using a qf with fields that contain stop words.  if
you get 0 hits, then query with a qf that contains all fields that don't
have stop words in them, (but you can leave them in pf).

In an ideal world, the DisMax handler would let you specify N qf options,
and each one would be used to build a separate DisjunctionMaxQuery and
then they'd all be combined into the uber BooleanQuery as optional clauses
-- but in the absense of that, two queries is probably your best bet.

(hmmm... actually qf is currently a single value param -- multiple values
aren't supported -- so if someone wrote a patch to do something like i
described it would be backward compatible ... anybody interested?)


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Making stop-words optional with DisMax?

Otis Gospodnetic-2
In reply to this post by ronbraun
If you have "doors" in your index and a person enters: "the doors", why not just drop stop-words at query time?
If a person searches for "music by the doors" and you have "music doors" in the index and really uses quotes to get the exact phrase, you can try it like Hoss said, and retry without stop words in you get inadequate response from the first query, or you could drop stop words from the phrase, but add some slop to the phrase to account for gaps.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Ronald K. Braun <[hidden email]>
To: [hidden email]
Sent: Wednesday, March 26, 2008 9:05:08 PM
Subject: Re: Making stop-words optional with DisMax?

Hi Otis,

> I skimmed your email.  You are indexing book and music titles.  Those tend to be short.
> Do you really benefit from removing stop words in the first place?  I'd try keeping all the stop
> words and seeing if that has any negative side-effects in your context.

Thanks for your skim and response!  We do keep all stop-words -- as
you say, makes sense since we aren't dealing with long free text
fields and because some titles are pure stops.

The negative side-effects lie in stop-words being treated with the
same importance as non-stop-words for matching purposes.  This
manifests in two ways:  1. Users occasionally get the stop-words wrong
-- say, wrong choice of preposition, which torpedoes the query since
some of the query terms aren't present in the target.  For example "on
mice and men" may return nothing (no match for "on") even though it is
equivalent to "of mice and men" in a stopped sense.  2. Our original
indexed data doesn't always have leading articles and such.  For
example, we index on "Doors" since that is our sourced data but
frequently get queried for "The Doors".  Articles and prepositions
(the stuff of good stop-lists) seem to me to be in a fuzzier class --
use 'em if you have 'em during matching, but don't kill your queries
because of them.  Hence some desire to make them in some way
"optional" during matching.

Ron



Reply | Threaded
Open this post in threaded view
|

Re: Making stop-words optional with DisMax?

ronbraun
In reply to this post by ronbraun
> We use two fields, one with and one without stopwords. The exact
> field has a higher boost than the other. That works pretty well.

Thanks for the tip, wunder!  We are doing likewise for our pf parm of
DisMax and that part works well -- exact matches are highly relevant
and stopped-matches less so but still present in the results set.  The
main problem is getting past the qf parm such that we don't have
invisible titles (stop-words removed by the qf pipeine leaving an
empty query) or over-specified generated queries (where stop-words
turn out to be required but can't match for various reasons).

> It helps to have an automated relevance test when tuning the boost
> (and other things). I extracted queries and clicks from the logs
> for a couple of months. Not perfect, but it is hard to argue with
> 32 million clicks.

I'd say -- a dream data set.  :-)  Good idea on the relevance test --
eyeballing boost changes seems definitely prone to unexpected effects
across all of the queries one didn't think to try.  (A dark art, boost
tuning...)

Ron
Reply | Threaded
Open this post in threaded view
|

Re: Making stop-words optional with DisMax?

ronbraun
In reply to this post by ronbraun
> sure, but what logic would you suggest be used to decide when to make them
> optional?  :)

Operationally, I was thinking a tokenizer could use the stop-word list
(or an optional-word list) to mark tokens as optional rather than
removing them from the token stream.  DisMaxOptional would then
generate appropriate queries with the non-optionals as the core and
then permute the optionals around those as optional clauses.  I say
this with no deep understanding of how DisMax does its thing, of
course, so feel free to call me naive.

As to what words to put in the optionals list, the function words
(articles and prepositions) seem to be the ones that folks either omit
or confuse, so they'd be good candidates.

> start by hitting Solr using a qf with fields that contain stop words.  if
> you get 0 hits, then query with a qf that contains all fields that don't
> have stop words in them, (but you can leave them in pf).

I think I've so internalized list advice *not* to generate multiple
queries that that didn't readily occur to me.  :-)   One problem I
suppose is that query might return some results but not the desired
one (perhaps there is a title On Men and Mice) and so I don't get to
the second query ("mice men" once stopped) that would get me Of Mice
and Men.  But an improvement in cases where no results come back from
an overspecified query, I'd agree.

The other thought I've had is to just do some query analysis up front
prior to submission -- if the query is all stops, send it to a
separate handler that doesn't do stop-word removal in the qf
specification, otherwise if any non-stop-word exists, send it to a
handler with a qf that does remove stops and rely on the pf component
to boost up exact matches.  I hate the analysis step which would
probably duplicate the tokenization done by solr, but might be worth
it.  There'd still be some problematic queries, but this may be as
close as it'll get.

Thanks for the suggestions, Hoss!

Ron
Reply | Threaded
Open this post in threaded view
|

Re: Making stop-words optional with DisMax?

hossman

: Operationally, I was thinking a tokenizer could use the stop-word list
: (or an optional-word list) to mark tokens as optional rather than
: removing them from the token stream.  DisMaxOptional would then
: generate appropriate queries with the non-optionals as the core and
: then permute the optionals around those as optional clauses.  I say
: this with no deep understanding of how DisMax does its thing, of
: course, so feel free to call me naive.

you're not naive ... the problem is just that *all* of the clauses are
allready optional (unless the term had a "+" or "-" in front of it),
that's where the mm param comes in, it decides how many of those optional
params should be mandatory.

it sounds like what you want is for a new DisMaxOptional parser to look at
this...

    on mice and men

and because it knows "on" and "and" are stop words, treat it the same as
if the current DisMax parsed this...

    on +mice and +men

which is another interesting idea, but it changes the meaning of "mm"
significantly, in that dismax with alow mm would not longer be tolerant of
mispelled (or missing) words unless they were stop words.

my gut tells me changing dismax so that having multiple qf params result
in multiple dismax queries would address your problem more directly.

: I think I've so internalized list advice *not* to generate multiple
: queries that that didn't readily occur to me.  :-)   One problem I
: suppose is that query might return some results but not the desired
: one (perhaps there is a title On Men and Mice) and so I don't get to
: the second query ("mice men" once stopped) that would get me Of Mice
: and Men.  But an improvement in cases where no results come back from
: an overspecified query, I'd agree.

...which is why multiple dismax queries as clauses in the main query
would be good ... the results from each would be blended together.

: The other thought I've had is to just do some query analysis up front
: prior to submission -- if the query is all stops, send it to a
        ...
: to boost up exact matches.  I hate the analysis step which would
: probably duplicate the tokenization done by solr, but might be worth
: it.  There'd still be some problematic queries, but this may be as
: close as it'll get.

you could probably skip the external analysis by swapping the order of
your queries and looking at the debuging output when hitting the "second"
query ... if your stopworded fields don't appear in the parsed query
structure, then it's all stopwords, so you do need your "first" query.


-Hoss