Index & search questions; special cases

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Index & search questions; special cases

Michael Imbeault
Hello again,

- Let's say I index "HIV-1" with <filter
class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="1"/>. Would a search on HIV AND 1 (or even HIV-1, which
after parsing by the above filter would yield HIV1 or HIV 1) also find
documents which have HIV and the number "1" somewhere in the document,
but not directly after HIV? If so, how should I fix this? I could boost
score by proximity, but I'm doing a sort on date anyway, so I guess it
would be pointless to do so.

- Somewhat related : Let's say I index "Polymyxin B". If I stopword
single letters, would a phrase search ("Polymyxin B") still find the
right documents (I don't think so, but still)? If not, I'll have to
index single letters; how do I prevent the same problem as in the first
question (i.e., a search on Polymyxin B yielding documents with
Polymyxin and B, but not close to one another).

My thought is to parse the user query and rephrase it to do phrase
searches on nearby terms containing single letters / numbers. If an user
search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR
("1 hepatitis" AND hiv). Is it a sensible solution?

Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Chris Hostetter-3

: - Let's say I index "HIV-1" with <filter
: class="solr.WordDelimiterFilterFactory" generateWordParts="1"
: generateNumberParts="1" catenateWords="1" catenateNumbers="1"
: catenateAll="1"/>. Would a search on HIV AND 1 (or even HIV-1, which
: after parsing by the above filter would yield HIV1 or HIV 1) also find
: documents which have HIV and the number "1" somewhere in the document,
: but not directly after HIV? If so, how should I fix this? I could boost
: score by proximity, but I'm doing a sort on date anyway, so I guess it
: would be pointless to do so.

A couple of things make your question really hard to answer ... first off,
you can specify differnet analyser chains for index time and query time --
shen dealing with the WordDelim filter (or the synonym fitler) this is
frequently neccessary -- so the ansers to your questions really depend on
wether you use WordDelim at both index time and query time (or if you do
use it in both cases, but configure it differnetly)

Have you by any chance played with the "Analysis" page on your Solr index?
  http://localhost:8983/solr/admin/analysis.jsp?name=&verbose=on&highlight=on&qverbose=on&

...it makes it really easy to see exactly how your various fields will get
parsed at index time and query time.  I would also suggest you use the
"debugQuery=on" option when doing some searches -- even if there aren't
nay documents in your index, that will help you see how your query is
getting parsed and what Query structure QueryParser is building based on
the tokens it gets from each of hte Anaalyzers.

: - Somewhat related : Let's say I index "Polymyxin B". If I stopword
: single letters, would a phrase search ("Polymyxin B") still find the
: right documents (I don't think so, but still)? If not, I'll have to

depends on what the "right documents" are .. if you strip stopwords out
both at index time and at query time then it will ultimately match exctly
the same thing as a query on "Polymyxin" which i guess must be the "right
documents" since no documents will container the letter "B" so what else
could be right? :)

: index single letters; how do I prevent the same problem as in the first
: question (i.e., a search on Polymyxin B yielding documents with
: Polymyxin and B, but not close to one another).
:
: My thought is to parse the user query and rephrase it to do phrase
: searches on nearby terms containing single letters / numbers. If an user
: search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR
: ("1 hepatitis" AND hiv). Is it a sensible solution?

that's kind of a strange behavior for a search application to have ... you
might just wnat to trust that your users will be smart and if they find
that 'HIV 1 hepatitis' is matching docs where "1" doesn't appear near
"HIV" or "hepatitis" then they will start entering '"HIV 1" hepatitis" (or
'HIV "1 hepatits"' if that's what they ment.)




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Michael Imbeault
Chris Hostetter wrote:
> A couple of things make your question really hard to answer ... first off,
> you can specify differnet analyser chains for index time and query time --
> shen dealing with the WordDelim filter (or the synonym fitler) this is
> frequently neccessary -- so the ansers to your questions really depend on
> wether you use WordDelim at both index time and query time (or if you do
> use it in both cases, but configure it differnetly)
>  
For clarification, I'm using the filter both at index and query time.

> Have you by any chance played with the "Analysis" page on your Solr index?
>   http://localhost:8983/solr/admin/analysis.jsp?name=&verbose=on&highlight=on&qverbose=on&
>
> ...it makes it really easy to see exactly how your various fields will get
> parsed at index time and query time.  I would also suggest you use the
> "debugQuery=on" option when doing some searches -- even if there aren't
> nay documents in your index, that will help you see how your query is
> getting parsed and what Query structure QueryParser is building based on
> the tokens it gets from each of hte Anaalyzers.
>  
Will try that, played with it in the past, but not for this particular
problem, good idea :)

> : My thought is to parse the user query and rephrase it to do phrase
> : searches on nearby terms containing single letters / numbers. If an user
> : search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR
> : ("1 hepatitis" AND hiv). Is it a sensible solution?
>
> that's kind of a strange behavior for a search application to have ... you
> might just wnat to trust that your users will be smart and if they find
> that 'HIV 1 hepatitis' is matching docs where "1" doesn't appear near
> "HIV" or "hepatitis" then they will start entering '"HIV 1" hepatitis" (or
> 'HIV "1 hepatits"' if that's what they ment.)
>  
Sadly I can't rely on users smartness for this :) I have concerns that
for stuff like Hepatitis A, it will match just about every document
containing hepatitis and the very common 'a' word, anywhere in the
document. I can't stopword single letters, cause then there would be no
way to find documents about 'hepatitis c' and not about 'hepatitis b'
for example. I will test my solution and report; if you have any other
ideas, just tell me.

And thanks for the help! :)

Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Walter Underwood, Netflix
On 11/12/06 8:52 PM, "Michael Imbeault" <[hidden email]>
wrote:

> Sadly I can't rely on users smartness for this :) I have concerns that
> for stuff like Hepatitis A, it will match just about every document
> containing hepatitis and the very common 'a' word, anywhere in the
> document. I can't stopword single letters, cause then there would be no
> way to find documents about 'hepatitis c' and not about 'hepatitis b'
> for example. I will test my solution and report; if you have any other
> ideas, just tell me.

Nutch has phrase pre-filtering which helps with this. It indexes the
phrase fragments as separate terms and uses that set of matches to
filter the set of matching documents.

Another approach is to implement protected phrases, similar to the
protected words in stemming. These would be protected from stopword
processing.

A list of exception word and phrases is a pretty common trick in
other engines. Otherwise, you go nuts trying to get your analyzer
to handle ".NET" and "vitamin a". I know that AltaVista and Inktomi
did this.

wunder
--
Walter Underwood
Search Guru, Netflix

 

Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Yonik Seeley-2
On 11/13/06, Walter Underwood <[hidden email]> wrote:
> Another approach is to implement protected phrases, similar to the
> protected words in stemming. These would be protected from stopword
> processing.

One could use the synonym filter (which can handle multi-token
synonyms) to get this effect.

WordDelimiterFilter => SynonymFilter => StopwordFilter => Stemmer

The SynonymFilter could have the following config:
hepatitis a, hepatitis_a

Do expand="true" on the indexing analyzer, and expand="false" on the
query analyzer

Then, a doc with "hepatitis a" will end up indexing "hepatitus" and
"hepatitis_a"
And at query time all the following searches will find the doc:
   text:hepatitus
   text:"hepatitis a"
   text:"hepatitis-a"

> A list of exception word and phrases is a pretty common trick in
> other engines. Otherwise, you go nuts trying to get your analyzer
> to handle ".NET" and "vitamin a". I know that AltaVista and Inktomi
> did this.

That's not a bad idea... most of the code from the multi-token
SynonymFilter could be reused to efficiently recognize multi-token
matches.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Chris Hostetter-3
In reply to this post by Walter Underwood, Netflix

: > Sadly I can't rely on users smartness for this :) I have concerns that
: > for stuff like Hepatitis A, it will match just about every document
: > containing hepatitis and the very common 'a' word, anywhere in the
: > document. I can't stopword single letters, cause then there would be no
: > way to find documents about 'hepatitis c' and not about 'hepatitis b'

: Nutch has phrase pre-filtering which helps with this. It indexes the
: phrase fragments as separate terms and uses that set of matches to
: filter the set of matching documents.

That reminds me ... i seem to remember someone saying once that Nutch lso
builds word based n-grams out of it's stop words, so searches on "the"
or "on" won't match anything because those words are never indexed as a
single tokens, but if a document contains "the dog in the house" it would
match a search on "in the" becaue the Analyzer would treat that as a
single token "in_the".

something like thta might work as well.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Yonik Seeley-2
In reply to this post by Michael Imbeault
On 11/12/06, Michael Imbeault <[hidden email]> wrote:
> - Somewhat related : Let's say I index "Polymyxin B". If I stopword
> single letters, would a phrase search ("Polymyxin B") still find the
> right documents (I don't think so, but still)? If not, I'll have to
> index single letters; how do I prevent the same problem as in the first
> question (i.e., a search on Polymyxin B yielding documents with
> Polymyxin and B, but not close to one another).

The general problem seems that you can tell what should be in a phrase
search and what shouldn't

You could try throwing everything in a sloppy phrase query, so at
least scores will go up when terms are closer together (in general).

You could also try an exact phrase query, and if you don't get enough
results, follow it up with another strategy (like what you have
below).

> My thought is to parse the user query and rephrase it to do phrase
> searches on nearby terms containing single letters / numbers. If an user
> search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR
> ("1 hepatitis" AND hiv). Is it a sensible solution?

That might work.
Whatever general strategy you end up trying, you can probably boost
relevancy with some domain specific knowledge injected with something
like the SynonymFilter.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Yonik Seeley-2
In reply to this post by Yonik Seeley-2
On 11/13/06, Yonik Seeley <[hidden email]> wrote:
> The SynonymFilter could have the following config:
> hepatitis a, hepatitis_a

Oops, the synonyms should be reversed like so:
hepatitis_a, hepatitis a
so that when expand="false" for querying, hepatitis a is mapped to hepatitis_a

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Erik Hatcher
In reply to this post by Chris Hostetter-3

On Nov 13, 2006, at 1:51 PM, Chris Hostetter wrote:
> That reminds me ... i seem to remember someone saying once that  
> Nutch lso
> builds word based n-grams out of it's stop words, so searches on "the"
> or "on" won't match anything because those words are never indexed  
> as a
> single tokens, but if a document contains "the dog in the house" it  
> would
> match a search on "in the" becaue the Analyzer would treat that as a
> single token "in_the".


Yup.... we covered this in LIA:

        <http://lucenebook.com/search?query=nutch+stop+words>


Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Otis Gospodnetic-2
In reply to this post by Michael Imbeault
Indeed.  CommonGrams.java in Nutch is the place to look.

Otis

----- Original Message ----
From: Erik Hatcher <[hidden email]>
To: [hidden email]
Sent: Monday, November 13, 2006 2:08:51 PM
Subject: Re: Index & search questions; special cases


On Nov 13, 2006, at 1:51 PM, Chris Hostetter wrote:
> That reminds me ... i seem to remember someone saying once that  
> Nutch lso
> builds word based n-grams out of it's stop words, so searches on "the"
> or "on" won't match anything because those words are never indexed  
> as a
> single tokens, but if a document contains "the dog in the house" it  
> would
> match a search on "in the" becaue the Analyzer would treat that as a
> single token "in_the".


Yup.... we covered this in LIA:

    <http://lucenebook.com/search?query=nutch+stop+words>





Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Michael Imbeault
In reply to this post by Chris Hostetter-3
Hello everyone,

Thanks for all your answers; synonyms based approaches won't work
because the medical / research field is evolving way too fast; it would
become unmaintainable very quickly, and the list would be huge. Anyway,
I can't rely on score because I'm sorting by date, so I need to
eliminate the 'hiv' in one part of the doc and '1' in another part
problem completely (if I want docs that fits HIV-1, or Polymyxin B, or
hepatitis A - I don't want docs that fits 'A patient was cured of
hepatitis C' if I search for 'hepatitis a').
> : Nutch has phrase pre-filtering which helps with this. It indexes the
> : phrase fragments as separate terms and uses that set of matches to
> : filter the set of matching documents.
>  
Is this a filter that I could implement easily into Solr? I never did
java, but it can't be that complicated I guess. Any help would be
appreciated.

> That reminds me ... i seem to remember someone saying once that Nutch lso
> builds word based n-grams out of it's stop words, so searches on "the"
> or "on" won't match anything because those words are never indexed as a
> single tokens, but if a document contains "the dog in the house" it would
> match a search on "in the" because the Analyzer would treat that as a
> single token "in_the".
>  

This looks like exactly what I'm looking for. Is it related to the above
'nutch pre-filtering'? This way if I stopword single letters and
numbers, it would still index 'hepatitis_a' as a single token, and match
a search on 'hepatitis a' (non-phrase search) without hitting 'a patient
has hepatitis'? I guess i'd have to apply the filter to the query too,
so it turns the query into hepatitis_a?

Basically, its another way to what I proposed as a solution - rewrite
the query to include phrase queries when you find a stopword, if you
index them anyway. Still, this solution looks better, as the size of the
index would probably be smaller than if I didn't stopword single letters
at all? For reference, what I proposed was:

> My thought is to parse the user query and rephrase it to do phrase
> searches on nearby terms containing single letters / numbers. If an
> user search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND
> hepatitis) OR ("1 hepatitis" AND hiv). Is it a sensible solution?
Any chance at all this kind of filter gets implemented into solr? If
not, indications on how to do it myself would be appreciated - I can't
say I have a clue right now (never did java, the only lucene programming
I did was via a php bridge).

Thanks for the help,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Reply | Threaded
Open this post in threaded view
|

Re: Re: Index & search questions; special cases

Mike Klaas
On 11/13/06, Michael Imbeault <[hidden email]> wrote:
> Hello everyone,
>
> Thanks for all your answers; synonyms based approaches won't work
> because the medical / research field is evolving way too fast; it would

Another approach is to extract the term explicitly.  An
easy-to-implement approach is the C/NC ATR algorithm.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Chris Hostetter-3
In reply to this post by Michael Imbeault

: > : Nutch has phrase pre-filtering which helps with this. It indexes the
: > : phrase fragments as separate terms and uses that set of matches to
: > : filter the set of matching documents.

: > That reminds me ... i seem to remember someone saying once that Nutch lso
: > builds word based n-grams out of it's stop words, so searches on "the"
: > or "on" won't match anything because those words are never indexed as a
: > single tokens, but if a document contains "the dog in the house" it would
: > match a search on "in the" because the Analyzer would treat that as a
: > single token "in_the".

: This looks like exactly what I'm looking for. Is it related to the above
: 'nutch pre-filtering'? This way if I stopword single letters and
: numbers, it would still index 'hepatitis_a' as a single token, and match
: a search on 'hepatitis a' (non-phrase search) without hitting 'a patient
: has hepatitis'? I guess i'd have to apply the filter to the query too,
: so it turns the query into hepatitis_a?

right ... i think we were both talking baout the same feature, which Otis
says is in Nutch's "CommonGrams" class...

http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/CommonGrams.java?view=markup

: Any chance at all this kind of filter gets implemented into solr? If
: not, indications on how to do it myself would be appreciated - I can't

CommonGrams itself seems to have some other dependencies on nutch because
of other utilities in the same class, but based on a quick skim, what you
really want is the nested "private static class Filter extends
TokenFilter" which doesn't really have any external dependencies.  If you
extract that class into some more specificly named "CommonGramsFilter",
all you need after that to use it in Solr is a simple little
"FilterFactory" so you can refrence it in your schema.xml ... you can use
the StopFilterFactory as a template since you'll need exactly the same
initalization (get the name of a word list file from the init params,
parse it, and build a word set out of it)...

http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/analysis/StopFilterFactory.java?view=markup

...all you really need to change is that the "create" method should return
a new "CommonGramsFilter" instead of a StopFilter.

Incidently: most of the code in CommonGrams.Filter seems to be dealing
with the buffering of tokens ... it may be easier to reimpliment the logic
with Solr's BufferedTokenStream as a base class.

-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Erik Hatcher

On Nov 14, 2006, at 2:00 PM, Chris Hostetter wrote:
> CommonGrams itself seems to have some other dependencies on nutch  
> because
> of other utilities in the same class, but based on a quick skim,  
> what you
> really want is the nested "private static class Filter extends
> TokenFilter" which doesn't really have any external dependencies.  
> If you
> extract that class into some more specificly named  
> "CommonGramsFilter",...

Yeah, the Nutch code is highly intertwined with its unique  
configuration infrastructure and makes it hard to pull pieces of it  
out like this.

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Sami Siren-2
Erik Hatcher wrote:

> Yeah, the Nutch code is highly intertwined with its unique configuration
> infrastructure and makes it hard to pull pieces of it out like this.

This is a critique that has been heard a lot (mainly because its true :)
It would be really cool if different camps of lucene could build these
nice utilities to be usable between projects. Not exactly sure how this
could be accomplished but anyway something to consider.

--
  Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Chris Hostetter-3

: > Yeah, the Nutch code is highly intertwined with its unique configuration
: > infrastructure and makes it hard to pull pieces of it out like this.

that CacheGrams inner Filter classe seemed like it could be extracted
easily enough.

: This is a critique that has been heard a lot (mainly because its true :)
: It would be really cool if different camps of lucene could build these
: nice utilities to be usable between projects. Not exactly sure how this
: could be accomplished but anyway something to consider.

general@lucene is probably the best place to raise this discussion if
you're interested in pursuing it ... i think the best way to deal with it
may just be on a case by case basis ... if you find cool code in
sub-project XYZ, start by working with XYZ-dev to refactor it into an
extractable chunk, then work with java-dev to "promote" it up in the
lucene Java code base, and then circle back to XYZ-dev to deprecate the
copy in the XYZ code repository and replace it with a dependency on the
newly promoted version.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Michael Imbeault
In reply to this post by Chris Hostetter-3
>
> CommonGrams itself seems to have some other dependencies on nutch because
> of other utilities in the same class, but based on a quick skim, what you
> really want is the nested "private static class Filter extends
> TokenFilter" which doesn't really have any external dependencies.  If you
> extract that class into some more specificly named "CommonGramsFilter",
> all you need after that to use it in Solr is a simple little
> "FilterFactory" so you can refrence it in your schema.xml ... you can use
> the StopFilterFactory as a template since you'll need exactly the same
> initalization (get the name of a word list file from the init params,
> parse it, and build a word set out of it)...  

Chris, thanks for the tips (or should I say, detailed explanation!). I
actually got it working! It was a pain at first (never did any java, and
all this ant, junit, war, jar, java, .classes are confusing!). I had
some compile errors that I cleaned up. Playing around with the filter in
the admin panel analyser yields expected results; I can't thank you
enough for your help. I now use :

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0"/>
<filter class="solr.CommonGramsFilterFactory"
words="stopwords-complete.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" words="stopwords-complete.txt"
ignoreCase="true"/>

And it works perfectly.

If Solr is interested in the filter, just tell me (and how should I do
to contribute it).

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



> http://svn.apache.org/viewvc/incubator/solr/trunk/src/java/org/apache/solr/analysis/StopFilterFactory.java?view=markup
>
> ...all you really need to change is that the "create" method should return
> a new "CommonGramsFilter" instead of a StopFilter.
>
> Incidently: most of the code in CommonGrams.Filter seems to be dealing
> with the buffering of tokens ... it may be easier to reimpliment the logic
> with Solr's BufferedTokenStream as a base class.
>  
Reply | Threaded
Open this post in threaded view
|

Re: Index & search questions; special cases

Chris Hostetter-3

: Chris, thanks for the tips (or should I say, detailed explanation!). I
: actually got it working! It was a pain at first (never did any java, and

good to know .. glad it worked out for you.

: If Solr is interested in the filter, just tell me (and how should I do
: to contribute it).

The full list of instructions on how to submit a patch can be found on the
wiki...
http://wiki.apache.org/solr/HowToContribute

...ideally a patch should include unit tests demonstrating the new
feature, but if you don't have any of those (and don't feel like writing
them) a patch can still be usefull to other people (who might be
interested in writing unit tests to encourage getting the changes added)


if you do open a Jira issue and attach your code, please note this thread
and the URL of the orriginal class in nutch, so people who may stumble
accross it in Jira know where the orriginal version is.

-Hoss

Reply | Threaded
Open this post in threaded view
|

Fuzzy searching, tildes and solr

Walter Lewis-2
In reply to this post by Yonik Seeley-2
This is quite possibly a Lucene question rather than a solr one, so my
apologies if you think its out of scope.

Underlying the solr search, are some very useful Lucene constructs.

One of the most powerful, imho, is the tilde number combination for a
"fuzzy" search.

In one of my data sets
    q=Sutherland returns 41 results
    q=Sutherland~0.75 returns 275
    q=Sutherland~0.70 returns 484
etc. all of which fits a pattern Add a first name and
   q=(James Sutherland) returns 13
   q=(James~0.75 Sutherland~0.75) returns 1
    q=(James~0.70 Sutherland~0.70) returns 97
Qualify only one term and there is a consistent pattern.  But routinely
qualifying two terms yields a smaller number than a string match.
Trying
   q=(James~0.75 AND Sutherland~0.75) returns the same record (the
schema has default set to AND)

Why would the ~0.75 *narrow* rather than broaden a search? Is there some
pattern in the solr syntax I'm overlooking?

Walter



   
   
Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy searching, tildes and solr

Yonik Seeley-2
On 1/23/07, Walter Lewis <[hidden email]> wrote:

> This is quite possibly a Lucene question rather than a solr one, so my
> apologies if you think its out of scope.
>
> Underlying the solr search, are some very useful Lucene constructs.
>
> One of the most powerful, imho, is the tilde number combination for a
> "fuzzy" search.
>
> In one of my data sets
>     q=Sutherland returns 41 results
>     q=Sutherland~0.75 returns 275
>     q=Sutherland~0.70 returns 484
> etc. all of which fits a pattern Add a first name and
>    q=(James Sutherland) returns 13
>    q=(James~0.75 Sutherland~0.75) returns 1
>     q=(James~0.70 Sutherland~0.70) returns 97
> Qualify only one term and there is a consistent pattern.  But routinely
> qualifying two terms yields a smaller number than a string match.
> Trying
>    q=(James~0.75 AND Sutherland~0.75) returns the same record (the
> schema has default set to AND)
>
> Why would the ~0.75 *narrow* rather than broaden a search? Is there some
> pattern in the solr syntax I'm overlooking?

That's a great question... that doesn't make sense.
Could you post your debug-query output (add debugQuery=on)?

-Yonik
12