Index & search questions; special cases

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy searching, tildes and solr

Walter Lewis-2
Yonik Seeley wrote:

> On 1/23/07, Walter Lewis <[hidden email]> wrote:
>> This is quite possibly a Lucene question rather than a solr one, so my
>> apologies if you think its out of scope.
>>
>> Underlying the solr search, are some very useful Lucene constructs.
>>
>> One of the most powerful, imho, is the tilde number combination for a
>> "fuzzy" search.
>>
>> In one of my data sets
>>     q=Sutherland returns 41 results
>>     q=Sutherland~0.75 returns 275
>>     q=Sutherland~0.70 returns 484
>> etc. all of which fits a pattern Add a first name and
>>    q=(James Sutherland) returns 13
>>    q=(James~0.75 Sutherland~0.75) returns 1
>>     q=(James~0.70 Sutherland~0.70) returns 97
>> Qualify only one term and there is a consistent pattern.  But routinely
>> qualifying two terms yields a smaller number than a string match.
>> Trying
>>    q=(James~0.75 AND Sutherland~0.75) returns the same record (the
>> schema has default set to AND)
>>
>> Why would the ~0.75 *narrow* rather than broaden a search? Is there some
>> pattern in the solr syntax I'm overlooking?
>
> That's a great question... that doesn't make sense.
> Could you post your debug-query output (add debugQuery=on)?
My apologies for the delay and for the generally excessive top quoting
here.  I thought it might save a bit of time to keep the alternatives
together.  I should also note that I simplified the queries  above.  
Each ran with a searchSet constraint, which was the same value.  The
"normal" queries also have a significant baggages of fields and facets,
which are also consistent across the whole set of them.

I ran the debug against the two following queries:

   q=(James Sutherland) returns 13
   q=(James~0.75 Sutherland~0.75) returns 1

I have attached the debug fragments below.

Walter


====
<lst name="debug">
<str name="rawquerystring">(james sutherland) searchSet:testSet</str>
<str name="querystring">(james sutherland) searchSet:testSet</str>
-
    <str name="parsedquery">
+(+text:jame +text:sutherland) +searchSet:testSet
</str>
-
    <str name="parsedquery_toString">
+(+text:jame +text:sutherland) +searchSet:testSet
</str>
-
    <lst name="explain">
-
    <str name="id=MHGL.502,internal_docid=80313">

2.2928324 = (MATCH) sum of:
  2.2204013 = (MATCH) sum of:
    0.444597 = (MATCH) weight(text:jame in 80313), product of:
      0.46986106 = queryWeight(text:jame), product of:
        4.370453 = idf(docFreq=3085)
        0.107508555 = queryNorm
      0.94623077 = (MATCH) fieldWeight(text:jame in 80313), product of:
        1.7320508 = tf(termFreq(text:jame)=3)
        4.370453 = idf(docFreq=3085)
        0.125 = fieldNorm(field=text, doc=80313)
    1.7758043 = (MATCH) weight(text:sutherland in 80313), product of:
      0.8738745 = queryWeight(text:sutherland), product of:
        8.128418 = idf(docFreq=71)
        0.107508555 = queryNorm
      2.0321045 = (MATCH) fieldWeight(text:sutherland in 80313), product of:
        2.0 = tf(termFreq(text:sutherland)=4)
        8.128418 = idf(docFreq=71)
        0.125 = fieldNorm(field=text, doc=80313)
  0.072431125 = (MATCH) weight(searchSet:testSet in 80313), product of:
    0.124795556 = queryWeight(searchSet:testSet), product of:
      1.1607965 = idf(docFreq=76441)
      0.107508555 = queryNorm
    0.58039826 = (MATCH) fieldWeight(searchSet:testSet in 80313),
product of:
      1.0 = tf(termFreq(searchSet:testSet)=1)
      1.1607965 = idf(docFreq=76441)
      0.5 = fieldNorm(field=searchSet, doc=80313)
</str>
-
    <str name="id=MHGL.503,internal_docid=80314">

2.1340907 = (MATCH) sum of:
  2.0616596 = (MATCH) sum of:
    0.43047923 = (MATCH) weight(text:jame in 80314), product of:
      0.46986106 = queryWeight(text:jame), product of:
        4.370453 = idf(docFreq=3085)
        0.107508555 = queryNorm
      0.91618407 = (MATCH) fieldWeight(text:jame in 80314), product of:
        2.236068 = tf(termFreq(text:jame)=5)
        4.370453 = idf(docFreq=3085)
        0.09375 = fieldNorm(field=text, doc=80314)
    1.6311804 = (MATCH) weight(text:sutherland in 80314), product of:
      0.8738745 = queryWeight(text:sutherland), product of:
        8.128418 = idf(docFreq=71)
        0.107508555 = queryNorm
      1.8666072 = (MATCH) fieldWeight(text:sutherland in 80314), product of:
        2.4494898 = tf(termFreq(text:sutherland)=6)
        8.128418 = idf(docFreq=71)
        0.09375 = fieldNorm(field=text, doc=80314)
  0.072431125 = (MATCH) weight(searchSet:testSet in 80314), product of:
    0.124795556 = queryWeight(searchSet:testSet), product of:
      1.1607965 = idf(docFreq=76441)
      0.107508555 = queryNorm
    0.58039826 = (MATCH) fieldWeight(searchSet:testSet in 80314),
product of:
      1.0 = tf(termFreq(searchSet:testSet)=1)
      1.1607965 = idf(docFreq=76441)
      0.5 = fieldNorm(field=searchSet, doc=80314)
</str>
-
    <str name="id=MHGL.501,internal_docid=80312">

1.5031691 = (MATCH) sum of:
  1.430738 = (MATCH) sum of:
    0.32086027 = (MATCH) weight(text:jame in 80312), product of:
      0.46986106 = queryWeight(text:jame), product of:
        4.370453 = idf(docFreq=3085)
        0.107508555 = queryNorm
      0.68288326 = (MATCH) fieldWeight(text:jame in 80312), product of:
        1.0 = tf(termFreq(text:jame)=1)
        4.370453 = idf(docFreq=3085)
        0.15625 = fieldNorm(field=text, doc=80312)
    1.1098777 = (MATCH) weight(text:sutherland in 80312), product of:
      0.8738745 = queryWeight(text:sutherland), product of:
        8.128418 = idf(docFreq=71)
        0.107508555 = queryNorm
      1.2700653 = (MATCH) fieldWeight(text:sutherland in 80312), product of:
        1.0 = tf(termFreq(text:sutherland)=1)
        8.128418 = idf(docFreq=71)
        0.15625 = fieldNorm(field=text, doc=80312)
  0.072431125 = (MATCH) weight(searchSet:testSet in 80312), product of:
    0.124795556 = queryWeight(searchSet:testSet), product of:
      1.1607965 = idf(docFreq=76441)
      0.107508555 = queryNorm
    0.58039826 = (MATCH) fieldWeight(searchSet:testSet in 80312),
product of:
      1.0 = tf(termFreq(searchSet:testSet)=1)
      1.1607965 = idf(docFreq=76441)
      0.5 = fieldNorm(field=searchSet, doc=80312)
</str>
-
    <str
name="id=http://archeion-aao.fis.utoronto.ca/cgi-bin/ifetch?DBRootName=ON&RecordKey=42&FieldKey=F&FilePath=E:\Documents\archeion\/ON00313f/ON00313-f0000358.xml,internal_docid=12073">

0.6628341 = (MATCH) sum of:
  0.5722952 = (MATCH) sum of:
    0.1283441 = (MATCH) weight(text:jame in 12073), product of:
      0.46986106 = queryWeight(text:jame), product of:
        4.370453 = idf(docFreq=3085)
        0.107508555 = queryNorm
      0.2731533 = (MATCH) fieldWeight(text:jame in 12073), product of:
        1.0 = tf(termFreq(text:jame)=1)
        4.370453 = idf(docFreq=3085)
        0.0625 = fieldNorm(field=text, doc=12073)
    0.44395107 = (MATCH) weight(text:sutherland in 12073), product of:
      0.8738745 = queryWeight(text:sutherland), product of:
        8.128418 = idf(docFreq=71)
        0.107508555 = queryNorm
      0.5080261 = (MATCH) fieldWeight(text:sutherland in 12073), product of:
        1.0 = tf(termFreq(text:sutherland)=1)
        8.128418 = idf(docFreq=71)
        0.0625 = fieldNorm(field=text, doc=12073)
  0.090538904 = (MATCH) weight(searchSet:testSet in 12073), product of:
    0.124795556 = queryWeight(searchSet:testSet), product of:
      1.1607965 = idf(docFreq=76441)
      0.107508555 = queryNorm
    0.72549784 = (MATCH) fieldWeight(searchSet:testSet in 12073),
product of:
      1.0 = tf(termFreq(searchSet:testSet)=1)
      1.1607965 = idf(docFreq=76441)
      0.625 = fieldNorm(field=searchSet, doc=12073)
</str>
-
    <str
name="id=http://archeion-aao.fis.utoronto.ca/cgi-bin/ifetch?DBRootName=ON&RecordKey=42&FieldKey=F&FilePath=ON00313f/ON00313-f0000358.xml,internal_docid=60185">

0.6628341 = (MATCH) sum of:
  0.5722952 = (MATCH) sum of:
    0.1283441 = (MATCH) weight(text:jame in 60185), product of:
      0.46986106 = queryWeight(text:jame), product of:
        4.370453 = idf(docFreq=3085)
        0.107508555 = queryNorm
      0.2731533 = (MATCH) fieldWeight(text:jame in 60185), product of:
        1.0 = tf(termFreq(text:jame)=1)
        4.370453 = idf(docFreq=3085)
        0.0625 = fieldNorm(field=text, doc=60185)
    0.44395107 = (MATCH) weight(text:sutherland in 60185), product of:
      0.8738745 = queryWeight(text:sutherland), product of:
        8.128418 = idf(docFreq=71)
        0.107508555 = queryNorm
      0.5080261 = (MATCH) fieldWeight(text:sutherland in 60185), product of:
        1.0 = tf(termFreq(text:sutherland)=1)
        8.128418 = idf(docFreq=71)
        0.0625 = fieldNorm(field=text, doc=60185)
  0.090538904 = (MATCH) weight(searchSet:testSet in 60185), product of:
    0.124795556 = queryWeight(searchSet:testSet), product of:
      1.1607965 = idf(docFreq=76441)
      0.107508555 = queryNorm
    0.72549784 = (MATCH) fieldWeight(searchSet:testSet in 60185),
product of:
      1.0 = tf(termFreq(searchSet:testSet)=1)
      1.1607965 = idf(docFreq=76441)
      0.625 = fieldNorm(field=searchSet, doc=60185)
</str>
-
    <str
name="id=http://archeion-aao.fis.utoronto.ca/cgi-bin/ifetch?DBRootName=ON&RecordKey=42&FieldKey=F&FilePath=E:\Documents\archeion\/ON00093f/ON00093-f93-9.xml,internal_docid=10564">

0.48144954 = (MATCH) sum of:
  0.39091066 = (MATCH) sum of:
    0.11344123 = (MATCH) weight(text:jame in 10564), product of:
      0.46986106 = queryWeight(text:jame), product of:
        4.370453 = idf(docFreq=3085)
        0.107508555 = queryNorm
      0.24143569 = (MATCH) fieldWeight(text:jame in 10564), product of:
        1.4142135 = tf(termFreq(text:jame)=2)
        4.370453 = idf(docFreq=3085)
        0.0390625 = fieldNorm(field=text, doc=10564)
    0.27746943 = (MATCH) weight(text:sutherland in 10564), product of:
      0.8738745 = queryWeight(text:sutherland), product of:
        8.128418 = idf(docFreq=71)
        0.107508555 = queryNorm
      0.31751633 = (MATCH) fieldWeight(text:sutherland in 10564),
product of:
        1.0 = tf(termFreq(text:sutherland)=1)
        8.128418 = idf(docFreq=71)
        0.0390625 = fieldNorm(field=text, doc=10564)
  0.090538904 = (MATCH) weight(searchSet:testSet in 10564), product of:
    0.124795556 = queryWeight(searchSet:testSet), product of:
      1.1607965 = idf(docFreq=76441)
      0.107508555 = queryNorm
    0.72549784 = (MATCH) fieldWeight(searchSet:testSet in 10564),
product of:
      1.0 = tf(termFreq(searchSet:testSet)=1)
      1.1607965 = idf(docFreq=76441)
      0.625 = fieldNorm(field=searchSet, doc=10564)
</str>
-
    <str
name="id=http://archeion-aao.fis.utoronto.ca/cgi-bin/ifetch?DBRootName=ON&RecordKey=42&FieldKey=F&FilePath=ON00093f/ON00093-f93-9.xml,internal_docid=58676">

0.48144954 = (MATCH) sum of:
  0.39091066 = (MATCH) sum of:
    0.11344123 = (MATCH) weight(text:jame in 58676), product of:
      0.46986106 = queryWeight(text:jame), product of:
        4.370453 = idf(docFreq=3085)
        0.107508555 = queryNorm
      0.24143569 = (MATCH) fieldWeight(text:jame in 58676), product of:
        1.4142135 = tf(termFreq(text:jame)=2)
        4.370453 = idf(docFreq=3085)
        0.0390625 = fieldNorm(field=text, doc=58676)
    0.27746943 = (MATCH) weight(text:sutherland in 58676), product of:
      0.8738745 = queryWeight(text:sutherland), product of:
        8.128418 = idf(docFreq=71)
        0.107508555 = queryNorm
      0.31751633 = (MATCH) fieldWeight(text:sutherland in 58676),
product of:
        1.0 = tf(termFreq(text:sutherland)=1)
        8.128418 = idf(docFreq=71)
        0.0390625 = fieldNorm(field=text, doc=58676)
  0.090538904 = (MATCH) weight(searchSet:testSet in 58676), product of:
    0.124795556 = queryWeight(searchSet:testSet), product of:
      1.1607965 = idf(docFreq=76441)
      0.107508555 = queryNorm
    0.72549784 = (MATCH) fieldWeight(searchSet:testSet in 58676),
product of:
      1.0 = tf(termFreq(searchSet:testSet)=1)
      1.1607965 = idf(docFreq=76441)
      0.625 = fieldNorm(field=searchSet, doc=58676)
</str>
-
    <str name="id=ECF.873,internal_docid=18553">

0.25359273 = (MATCH) sum of:
  0.16305381 = (MATCH) sum of:
    0.07981298 = (MATCH) weight(text:jame in 18553), product of:
      0.46986106 = queryWeight(text:jame), product of:
        4.370453 = idf(docFreq=3085)
        0.107508555 = queryNorm
      0.16986507 = (MATCH) fieldWeight(text:jame in 18553), product of:
        3.3166249 = tf(termFreq(text:jame)=11)
        4.370453 = idf(docFreq=3085)
        0.01171875 = fieldNorm(field=text, doc=18553)
    0.08324082 = (MATCH) weight(text:sutherland in 18553), product of:
      0.8738745 = queryWeight(text:sutherland), product of:
        8.128418 = idf(docFreq=71)
        0.107508555 = queryNorm
      0.0952549 = (MATCH) fieldWeight(text:sutherland in 18553), product of:
        1.0 = tf(termFreq(text:sutherland)=1)
        8.128418 = idf(docFreq=71)
        0.01171875 = fieldNorm(field=text, doc=18553)
  0.090538904 = (MATCH) weight(searchSet:testSet in 18553), product of:
    0.124795556 = queryWeight(searchSet:testSet), product of:
      1.1607965 = idf(docFreq=76441)
      0.107508555 = queryNorm
    0.72549784 = (MATCH) fieldWeight(searchSet:testSet in 18553),
product of:
      1.0 = tf(termFreq(searchSet:testSet)=1)
      1.1607965 = idf(docFreq=76441)
      0.625 = fieldNorm(field=searchSet, doc=18553)
</str>
-
    <str name="id=ECF.373,internal_docid=18055">

0.2336127 = (MATCH) sum of:
  0.1430738 = (MATCH) sum of:
    0.032086026 = (MATCH) weight(text:jame in 18055), product of:
      0.46986106 = queryWeight(text:jame), product of:
        4.370453 = idf(docFreq=3085)
        0.107508555 = queryNorm
      0.068288326 = (MATCH) fieldWeight(text:jame in 18055), product of:
        1.0 = tf(termFreq(text:jame)=1)
        4.370453 = idf(docFreq=3085)
        0.015625 = fieldNorm(field=text, doc=18055)
    0.11098777 = (MATCH) weight(text:sutherland in 18055), product of:
      0.8738745 = queryWeight(text:sutherland), product of:
        8.128418 = idf(docFreq=71)
        0.107508555 = queryNorm
      0.12700653 = (MATCH) fieldWeight(text:sutherland in 18055),
product of:
        1.0 = tf(termFreq(text:sutherland)=1)
        8.128418 = idf(docFreq=71)
        0.015625 = fieldNorm(field=text, doc=18055)
  0.090538904 = (MATCH) weight(searchSet:testSet in 18055), product of:
    0.124795556 = queryWeight(searchSet:testSet), product of:
      1.1607965 = idf(docFreq=76441)
      0.107508555 = queryNorm
    0.72549784 = (MATCH) fieldWeight(searchSet:testSet in 18055),
product of:
      1.0 = tf(termFreq(searchSet:testSet)=1)
      1.1607965 = idf(docFreq=76441)
      0.625 = fieldNorm(field=searchSet, doc=18055)
</str>
-
    <str name="id=ECF.2476,internal_docid=20148">

0.2336127 = (MATCH) sum of:
  0.1430738 = (MATCH) sum of:
    0.032086026 = (MATCH) weight(text:jame in 20148), product of:
      0.46986106 = queryWeight(text:jame), product of:
        4.370453 = idf(docFreq=3085)
        0.107508555 = queryNorm
      0.068288326 = (MATCH) fieldWeight(text:jame in 20148), product of:
        1.0 = tf(termFreq(text:jame)=1)
        4.370453 = idf(docFreq=3085)
        0.015625 = fieldNorm(field=text, doc=20148)
    0.11098777 = (MATCH) weight(text:sutherland in 20148), product of:
      0.8738745 = queryWeight(text:sutherland), product of:
        8.128418 = idf(docFreq=71)
        0.107508555 = queryNorm
      0.12700653 = (MATCH) fieldWeight(text:sutherland in 20148),
product of:
        1.0 = tf(termFreq(text:sutherland)=1)
        8.128418 = idf(docFreq=71)
        0.015625 = fieldNorm(field=text, doc=20148)
  0.090538904 = (MATCH) weight(searchSet:testSet in 20148), product of:
    0.124795556 = queryWeight(searchSet:testSet), product of:
      1.1607965 = idf(docFreq=76441)
      0.107508555 = queryNorm
    0.72549784 = (MATCH) fieldWeight(searchSet:testSet in 20148),
product of:
      1.0 = tf(termFreq(searchSet:testSet)=1)
      1.1607965 = idf(docFreq=76441)
      0.625 = fieldNorm(field=searchSet, doc=20148)
</str>
</lst>
</lst>

=========

-
    <lst name="debug">
-
    <str name="rawquerystring">
(james~0.75 AND sutherland~0.75) searchSet:testSet
</str>
-
    <str name="querystring">
(james~0.75 AND sutherland~0.75) searchSet:testSet
</str>
-
    <str name="parsedquery">
+(+text:james~0.75 +text:sutherland~0.75) +searchSet:testSet
</str>
-
    <str name="parsedquery_toString">
+(+text:james~0.75 +text:sutherland~0.75) +searchSet:testSet
</str>
-
    <lst name="explain">
-
    <str name="id=ECF.2227,internal_docid=19900">

0.10142321 = (MATCH) sum of:
  0.04733514 = (MATCH) sum of:
    0.03207182 = (MATCH) sum of:
      0.03207182 = (MATCH) weight(text:rames^0.20000005 in 19900),
product of:
        0.1452334 = queryWeight(text:rames^0.20000005), product of:
          0.20000005 = boost
          11.306472 = idf(docFreq=2)
          0.06422576 = queryNorm
        0.22082953 = (MATCH) fieldWeight(text:rames in 19900), product of:
          1.0 = tf(termFreq(text:rames)=1)
          11.306472 = idf(docFreq=2)
          0.01953125 = fieldNorm(field=text, doc=19900)
    0.015263321 = (MATCH) sum of:
      0.015263321 = (MATCH) weight(text:netherland^0.20000005 in 19900),
product of:
        0.10019111 = queryWeight(text:netherland^0.20000005), product of:
          0.20000005 = boost
          7.799914 = idf(docFreq=99)
          0.06422576 = queryNorm
        0.15234207 = (MATCH) fieldWeight(text:netherland in 19900),
product of:
          1.0 = tf(termFreq(text:netherland)=1)
          7.799914 = idf(docFreq=99)
          0.01953125 = fieldNorm(field=text, doc=19900)
  0.05408807 = (MATCH) weight(searchSet:testSet in 19900), product of:
    0.07455304 = queryWeight(searchSet:testSet), product of:
      1.1607965 = idf(docFreq=76441)
      0.06422576 = queryNorm
    0.72549784 = (MATCH) fieldWeight(searchSet:testSet in 19900),
product of:
      1.0 = tf(termFreq(searchSet:testSet)=1)
      1.1607965 = idf(docFreq=76441)
      0.625 = fieldNorm(field=searchSet, doc=19900)
</str>
</lst>
</lst>
Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy searching, tildes and solr

Yonik Seeley-2
On 1/25/07, Walter Lewis <[hidden email]> wrote:
> I ran the debug against the two following queries:
>
>    q=(James Sutherland) returns 13
>    q=(James~0.75 Sutherland~0.75) returns 1

OK, I have an idea of what's going on... here are your two parsed
queries side by side:

> +(+text:jame +text:sutherland) +searchSet:testSet
> +(+text:james~0.75 +text:sutherland~0.75) +searchSet:testSet

I can tell from the first that this is a stemmed field... "james" is
transformed to "jame"
The Lucene query parser doesn't do stemming and other analysis for
things like prefix or fuzzy queries (that would have it's own big set
of problems), but instead just lowercases.

So your second fuzzy query of "james~0.75" doesn't exactly match
exactly what is indexed.

Lucene expands something like james~0.75, to the closest terms by edit-distance.
But, the number of terms is limited to BooleanQuery.maxClauseCount
(1024 by default).  So my guess is that there are more than 1024 other
terms closer to "james" than "jame" is, so "jame" (the actual indexed
form of james when it is stemmed), isn't included.

I'm not an expert at edit distance, but the implementing classes are here:
http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/FuzzyTermEnum.java?view=markup
http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/FuzzyQuery.java?view=markup

So, you could
 - increase the size of maxClauseCount (it will slow down fuzzy and
wildcard type queries though)
 - index the field twice using copyField, and then do fuzzy queries on
the non-stemmed version.
 - ask the lucene list for other ideas

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy searching, tildes and solr

Walter Lewis-2
Yonik Seeley wrote:
> +(+text:jame +text:sutherland) +searchSet:testSet
>> +(+text:james~0.75 +text:sutherland~0.75) +searchSet:testSet
>
> I can tell from the first that this is a stemmed field... "james" is
> transformed to "jame"
"James" being the plural of "Jame" according to the stemmer.  I guess my
mind hadn't run in that direction. :)

I guess I wasn't expecting the fuzzy query logic to bypass the
stemming.  Would it be correct that if I were to add "james" to the
protwords.txt file that this *specific* problem would go away? Obviously
there are a significant quantity of proper names where this would have
an impact, so a more generic solution is preferable.

> So, you could
> - index the field twice using copyField, and then do fuzzy queries on
> the non-stemmed version. [plus two other good suggestions]
As I look at the field types in the example schema would you recommend
something like text_lu without the EnglishPorterFilterFactory, or are
there other issues I'm overlooking.

Walter Lewis
(aka Walt Lewi apparently)
Reply | Threaded
Open this post in threaded view
|

Re: Fuzzy searching, tildes and solr

Yonik Seeley-2
On 1/26/07, Walter Lewis <[hidden email]> wrote:

> Yonik Seeley wrote:
> > +(+text:jame +text:sutherland) +searchSet:testSet
> >> +(+text:james~0.75 +text:sutherland~0.75) +searchSet:testSet
> >
> > I can tell from the first that this is a stemmed field... "james" is
> > transformed to "jame"
> "James" being the plural of "Jame" according to the stemmer.  I guess my
> mind hadn't run in that direction. :)
>
> I guess I wasn't expecting the fuzzy query logic to bypass the
> stemming.

I would expect there to be at least as many problems trying to do
stemming on partial or misspelled words.

For a simpler example, consider prefix queries...
If you tried titie:a* or title:an* to find titles including anaconda,
and you did full "analysis" of the terms first, they would be removed
as stop words and you would find nothing.

>  Would it be correct that if I were to add "james" to the
> protwords.txt file that this *specific* problem would go away?

Yes, It should.

> Obviously
> there are a significant quantity of proper names where this would have
> an impact, so a more generic solution is preferable.
> > So, you could
> > - index the field twice using copyField, and then do fuzzy queries on
> > the non-stemmed version. [plus two other good suggestions]
> As I look at the field types in the example schema would you recommend
> something like text_lu without the EnglishPorterFilterFactory, or are
> there other issues I'm overlooking.

text_lu also has stemming.

The text field types are examples, and you should be customizing your own.
It depends on how you want to "normalize" text.

You could start make a new field type by starting with your current
text type and removing the stemmer.

-Yonik
Reply | Threaded
Open this post in threaded view
|

DisMax and truncation

Mark Phillips-9
In reply to this post by Walter Lewis-2
We are looking at using solr's DisMax handler for our implementation on
our digital collections at UNT, and I had a quick question about it.

Is there a way to do truncation with DisMax?

Just being able to search for (photo*) would be helpful in quite a few
places.

So far I am really liking how the dismax allows us to give priority to
specific fields in our collection.

Thanks


Mark
Reply | Threaded
Open this post in threaded view
|

Re: DisMax and truncation

Chris Hostetter-3

: Is there a way to do truncation with DisMax?
:
: Just being able to search for (photo*) would be helpful in quite a few
: places.

I'm afraid i'm not understanding your question ... what do you mean by
"truncation" ? ... your example:  (photo*) seems to be asking about
a prefix or wildcard type search, which might match photomat,
photographer, photoaaaaaaaaaaa, etc... is that what you're asking about?

there is no way at the moment to do prefix or wildcard searches ... but
off the top of my head i think it might be fairly easy - there is a
partical escaping that happens to escape all QueryParser meta characters
except + and - .. we could probably make that configurable so "*" was left
alone as well. ... i think there would need to be some additions to the
DisjunctionMaxQueryParser to loop over the aliased fields too though.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: DisMax and truncation

Yonik Seeley-2
On 1/30/07, Chris Hostetter <[hidden email]> wrote:

>
> : Is there a way to do truncation with DisMax?
> :
> : Just being able to search for (photo*) would be helpful in quite a few
> : places.
>
> I'm afraid i'm not understanding your question ... what do you mean by
> "truncation" ? ... your example:  (photo*) seems to be asking about
> a prefix or wildcard type search, which might match photomat,
> photographer, photoaaaaaaaaaaa, etc... is that what you're asking about?
>
> there is no way at the moment to do prefix or wildcard searches ... but
> off the top of my head i think it might be fairly easy - there is a
> partical escaping that happens to escape all QueryParser meta characters
> except + and - .. we could probably make that configurable so "*" was left
> alone as well. ... i think there would need to be some additions to the
> DisjunctionMaxQueryParser to loop over the aliased fields too though.

A more dismaxy way might be to specify the structure of the query in a
different param rather than in the query string.  haven't thought it
through though...

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: DisMax and truncation

Mark Phillips-9
In reply to this post by Chris Hostetter-3


>>> Chris Hostetter <[hidden email]> 1/30/2007 12:45 PM >>>

: Is there a way to do truncation with DisMax?
:
: Just being able to search for (photo*) would be helpful in quite a
few
: places.

: I'm afraid i'm not understanding your question ... what do you mean
by
: "truncation" ? ... your example:  (photo*) seems to be asking about
: a prefix or wildcard type search, which might match photomat,
: photographer, photoaaaaaaaaaaa, etc... is that what you're asking
about?

: there is no way at the moment to do prefix or wildcard searches ...
but
: off the top of my head i think it might be fairly easy - there is a
: partical escaping that happens to escape all QueryParser meta
characters
: except + and - .. we could probably make that configurable so "*" was
left
: alone as well. ... i think there would need to be some additions to
the
: DisjunctionMaxQueryParser to loop over the aliased fields too
though.

This is exactly what I am thinking about,  I used the wrong term, I was
thinking wildcard and wrote truncation... doah.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: DisMax and truncation

Chris Hostetter-3
In reply to this post by Yonik Seeley-2

: A more dismaxy way might be to specify the structure of the query in a
: different param rather than in the query string.  haven't thought it
: through though...

it depends on the goal ... i was interpreting the question as "i want my
users to be able to type in prefix/wildcard expressions and have them
applied across many fields using various boosts wrapped in a
DisjuntionMaxQuery" ... but yeah, the spirt of the DisMaxRequestHandler
isn't specific to building a DisjuntionMaxQuery -- it's having the info
about the query structure seperate from the raw user input.  right now
most of thatstructure is in code with params to let you hook in and
specify parts of that structure -- lettig you change the structure itself
is a pretty cool idea ... but i'm having trouble pictureing what it would
look like.


-Hoss

12