stemming (maybe?) question

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

stemming (maybe?) question

Jon Drukman
is it possible to make solr think that "omeara" and "o'meara" are the
same thing?

-jsd-

Reply | Threaded
Open this post in threaded view
|

Re: stemming (maybe?) question

Yonik Seeley-2-2
On Thu, Mar 12, 2009 at 1:36 PM, Jon Drukman <[hidden email]> wrote:
> is it possible to make solr think that "omeara" and "o'meara" are the same
> thing?

WordDelimiter would handle it if the document had "o'meara" (but you
may or may not want the other stuff that comes with
WordDelimiterFilter).
You could also use a PatternReplaceFilter to normalize tokens like this.

-Yonik
http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|

Re: stemming (maybe?) question

Jon Drukman
Yonik Seeley wrote:
> On Thu, Mar 12, 2009 at 1:36 PM, Jon Drukman <[hidden email]> wrote:
>> is it possible to make solr think that "omeara" and "o'meara" are the same
>> thing?
>
> WordDelimiter would handle it if the document had "o'meara" (but you
> may or may not want the other stuff that comes with
> WordDelimiterFilter).
> You could also use a PatternReplaceFilter to normalize tokens like this.

the document does have o'meara in it.  i tried creating a new field type
based on the wiki information.

<fieldType name="text_user" class="solr.TextField"
positionIncrementGap="100">
   <fieldtype name="subword" class="solr.TextField">
       <analyzer type="query">
           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
           <filter class="solr.WordDelimiterFilterFactory"
                 generateWordParts="1"
                 generateNumberParts="1"
                 catenateWords="0"
                 catenateNumbers="0"
                 catenateAll="0"
                 />
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="index">
           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
           <filter class="solr.WordDelimiterFilterFactory"
                 generateWordParts="1"
                 generateNumberParts="1"
                 catenateWords="1"
                 catenateNumbers="1"
                 catenateAll="0"
                 />
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
     </fieldtype>
</fieldType>


i reindexed everything but now any search on that field returns zero
results.  what did i do wrong?

-jsd-

Reply | Threaded
Open this post in threaded view
|

Re: stemming (maybe?) question

Yonik Seeley-2-2
Not sure... I just took the stock solr example, and it worked fine.

I inserted "o'meara" into example/exampledocs/solr.xml
 <field name="features">Advanced o'meara Full-Text Search
Capabilities using Lucene</field>

the indexed everything:  ./post.sh *.xml

Then queried in various ways:
q=o'meara
q=omeara
q=o%20meara

All of the queries found the solr doc.

-Yonik
http://www.lucidimagination.com


On Mon, Mar 16, 2009 at 8:34 PM, Jon Drukman <[hidden email]> wrote:

> Yonik Seeley wrote:
>>
>> On Thu, Mar 12, 2009 at 1:36 PM, Jon Drukman <[hidden email]> wrote:
>>>
>>> is it possible to make solr think that "omeara" and "o'meara" are the
>>> same
>>> thing?
>>
>> WordDelimiter would handle it if the document had "o'meara" (but you
>> may or may not want the other stuff that comes with
>> WordDelimiterFilter).
>> You could also use a PatternReplaceFilter to normalize tokens like this.
>
> the document does have o'meara in it.  i tried creating a new field type
> based on the wiki information.
>
> <fieldType name="text_user" class="solr.TextField"
> positionIncrementGap="100">
>  <fieldtype name="subword" class="solr.TextField">
>      <analyzer type="query">
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.WordDelimiterFilterFactory"
>                generateWordParts="1"
>                generateNumberParts="1"
>                catenateWords="0"
>                catenateNumbers="0"
>                catenateAll="0"
>                />
>          <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="index">
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.WordDelimiterFilterFactory"
>                generateWordParts="1"
>                generateNumberParts="1"
>                catenateWords="1"
>                catenateNumbers="1"
>                catenateAll="0"
>                />
>          <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldtype>
> </fieldType>
>
>
> i reindexed everything but now any search on that field returns zero
> results.  what did i do wrong?
>
> -jsd-
Reply | Threaded
Open this post in threaded view
|

Re: stemming (maybe?) question

Jon Drukman
Yonik Seeley wrote:

> Not sure... I just took the stock solr example, and it worked fine.
>
> I inserted "o'meara" into example/exampledocs/solr.xml
>  <field name="features">Advanced o'meara Full-Text Search
> Capabilities using Lucene</field>
>
> the indexed everything:  ./post.sh *.xml
>
> Then queried in various ways:
> q=o'meara
> q=omeara
> q=o%20meara
>
> All of the queries found the solr doc.

i grabbed the original example schema.xml and made my username field use
the following definition:

<fieldType name="text_user" class="solr.TextField"
positionIncrementGap="100">
       <analyzer type="index">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
       <analyzer type="query">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
         <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
</fieldType>


i removed the stopwords and porter stuff because for proper names i
don't want that.

seems to work fine now, thanks!
-jsd-