How to tokenize/analyze docs for the spellchecker - at indexing and query time

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How to tokenize/analyze docs for the spellchecker - at indexing and query time

martin.grotzke
Hi,

I'm just starting with the spellchecker component provided by solr - it
is really cool!

Now I'm thinking about the source-field in the spellchecker ("spell"):
how should fields be analyzed during indexing, and how should the
queryAnalyzerFieldType be configured.

If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them (the
field "brand") directly to the "spell" field. The "spell" field is of
type "string".

Other fields like e.g. the product title I would first copy to some
whitespaceTokinized field (field type with WhitespaceTokenizerFactory)
and afterwards to the "spell" field. The product title might be e.g.
"Canon EOS 450D EF-S 18-55 mm".

This is the process I have in mind during indexing (though I'm not sure
if some tokens/terms should be removed, but I'd asume that all terms
might be misspelled by the user).

Now when it comes to searching, the query should be analyzed using the
queryAnalyzerFieldType definition, which has a StandardTokenizerFactory
in the schema.xml of the solr example.

Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a
StandardTokenizerFactory here?

Or should I use a StandardTokenizerFactory for the "spell" field, so
that fields copied into this field get tokenized/analyzed in the same
way as the query will get tokenized/analyzed?

Do you have any experience with this and/or recommendations regarding
this?

Are there other things to consider?

Thanx for your help,
cheers,
Martin



signature.asc (204 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Jason Rennie-2
Hi Martin,

I'm a relative newbie to solr, have been playing with the spellcheck
component and seem to have it working.  I certainly can't explain what all
is going on, but with any luck, I can help you get the spellchecker
up-and-running.  Additional replies in-lined below.

On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[hidden email]
> wrote:

> Now I'm thinking about the source-field in the spellchecker ("spell"):
> how should fields be analyzed during indexing, and how should the
> queryAnalyzerFieldType be configured.


I followed the conventions in the default solrconfig.xml and schema.xml
files.  So I created a "textSpell" field type (schema.xml):

    <!-- field type for the spell checker which doesn't stem -->
    <fieldtype name="textSpell" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>

and used this for the queryAnalyzerFieldType.  I also created a spellField
to store the text I want to spell check against and used the same analyzer
(figuring that the query and indexed data should be analyzed the same way)
(schema.xml):

   <!-- Spell check field -->
   <field name="spellField" type="textSpell" indexed="true" stored="true" />



> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them (the
> field "brand") directly to the "spell" field. The "spell" field is of
> type "string".


We're copying description to spellField.  I'd recommend using a type like
the above textSpell type since "The StringField type is not analyzed, but
indexed/stored verbatim" (schema.xml):

  <copyField source="description" dest="spellField" />

Other fields like e.g. the product title I would first copy to some
> whitespaceTokinized field (field type with WhitespaceTokenizerFactory)
> and afterwards to the "spell" field. The product title might be e.g.
> "Canon EOS 450D EF-S 18-55 mm".


Hmm... I'm not sure if this would work as I don't think the analyzer is
applied until after the copy is made.  FWIW, I've had trouble copying
multipe fields to spellField (i.e. adding a second copyField w/
dest="spellField"), so we just index the spellchecker on a single field...

Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a
> StandardTokenizerFactory here?


I think if you use the same analyzer for indexing and queries, the
distinction probably isn't tremendously important.  When I went searching,
it looked like the StandardTokenizer split on non-letters.  I'd guess the
rationale for using the StandardTokenizer is that it won't recommend
non-letter characters.  I was seeing some weirdness earlier (no
inserts/deletes), but that disappeared now that I'm using the
StandardTokenizer.

Cheers,

Jason
Reply | Threaded
Open this post in threaded view
|

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

martin.grotzke
Hi Jason,

what about multi-word searches like "harry potter"? When I do a search
in our index for "harry poter", I get the suggestion "harry
spotter" (using spellcheck.collate=true and jarowinkler distance).
Searching for "harry spotter" (we're searching AND, not OR) then gives
no results. I asume that this is because suggestions are done for words
separately, and this does not require that both/all suggestions are
contained in the same document.

I wonder what's the standard approach for searches with multiple words.
Are these working ok for you?

Cheers,
Martin

On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:

> Hi Martin,
>
> I'm a relative newbie to solr, have been playing with the spellcheck
> component and seem to have it working.  I certainly can't explain what all
> is going on, but with any luck, I can help you get the spellchecker
> up-and-running.  Additional replies in-lined below.
>
> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[hidden email]
> > wrote:
>
> > Now I'm thinking about the source-field in the spellchecker ("spell"):
> > how should fields be analyzed during indexing, and how should the
> > queryAnalyzerFieldType be configured.
>
>
> I followed the conventions in the default solrconfig.xml and schema.xml
> files.  So I created a "textSpell" field type (schema.xml):
>
>     <!-- field type for the spell checker which doesn't stem -->
>     <fieldtype name="textSpell" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldtype>
>
> and used this for the queryAnalyzerFieldType.  I also created a spellField
> to store the text I want to spell check against and used the same analyzer
> (figuring that the query and indexed data should be analyzed the same way)
> (schema.xml):
>
>    <!-- Spell check field -->
>    <field name="spellField" type="textSpell" indexed="true" stored="true" />
>
>
>
> > If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them (the
> > field "brand") directly to the "spell" field. The "spell" field is of
> > type "string".
>
>
> We're copying description to spellField.  I'd recommend using a type like
> the above textSpell type since "The StringField type is not analyzed, but
> indexed/stored verbatim" (schema.xml):
>
>   <copyField source="description" dest="spellField" />
>
> Other fields like e.g. the product title I would first copy to some
> > whitespaceTokinized field (field type with WhitespaceTokenizerFactory)
> > and afterwards to the "spell" field. The product title might be e.g.
> > "Canon EOS 450D EF-S 18-55 mm".
>
>
> Hmm... I'm not sure if this would work as I don't think the analyzer is
> applied until after the copy is made.  FWIW, I've had trouble copying
> multipe fields to spellField (i.e. adding a second copyField w/
> dest="spellField"), so we just index the spellchecker on a single field...
>
> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a
> > StandardTokenizerFactory here?
>
>
> I think if you use the same analyzer for indexing and queries, the
> distinction probably isn't tremendously important.  When I went searching,
> it looked like the StandardTokenizer split on non-letters.  I'd guess the
> rationale for using the StandardTokenizer is that it won't recommend
> non-letter characters.  I was seeing some weirdness earlier (no
> inserts/deletes), but that disappeared now that I'm using the
> StandardTokenizer.
>
> Cheers,
>
> Jason
--
Martin Grotzke
http://www.javakaffee.de/blog/

signature.asc (204 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Grant Ingersoll-2

On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:

> Hi Jason,
>
> what about multi-word searches like "harry potter"? When I do a search
> in our index for "harry poter", I get the suggestion "harry
> spotter" (using spellcheck.collate=true and jarowinkler distance).
> Searching for "harry spotter" (we're searching AND, not OR) then gives
> no results. I asume that this is because suggestions are done for  
> words
> separately, and this does not require that both/all suggestions are
> contained in the same document.
>

Yeah, the SpellCheckComponent is not phrase aware.  My guess would be  
that you would somehow need a QueryConverter (see http://wiki.apache.org/solr/SpellCheckComponent)
   that preserved phrases as a single token.  Likewise, you would need  
that on your indexing side as well for the spell checker.  In short, I  
suppose it's possible, but it would be work.  You probably could use  
the shingle filter (token based n-grams).

Alternatively, by using extendedResults, you can get back the  
frequency of each of the words, and then you could decide whether the  
collation is going to have any results assuming they are all or'd  
together.  For phrases and AND queries, I'm not sure.  It's doable,  
I'm sure, but it would be a lot more involved.


> I wonder what's the standard approach for searches with multiple  
> words.
> Are these working ok for you?
>
> Cheers,
> Martin
>
> On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
>> Hi Martin,
>>
>> I'm a relative newbie to solr, have been playing with the spellcheck
>> component and seem to have it working.  I certainly can't explain  
>> what all
>> is going on, but with any luck, I can help you get the spellchecker
>> up-and-running.  Additional replies in-lined below.
>>
>> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[hidden email]
>>> wrote:
>>
>>> Now I'm thinking about the source-field in the spellchecker  
>>> ("spell"):
>>> how should fields be analyzed during indexing, and how should the
>>> queryAnalyzerFieldType be configured.
>>
>>
>> I followed the conventions in the default solrconfig.xml and  
>> schema.xml
>> files.  So I created a "textSpell" field type (schema.xml):
>>
>>    <!-- field type for the spell checker which doesn't stem -->
>>    <fieldtype name="textSpell" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer>
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>      </analyzer>
>>    </fieldtype>
>>
>> and used this for the queryAnalyzerFieldType.  I also created a  
>> spellField
>> to store the text I want to spell check against and used the same  
>> analyzer
>> (figuring that the query and indexed data should be analyzed the  
>> same way)
>> (schema.xml):
>>
>>   <!-- Spell check field -->
>>   <field name="spellField" type="textSpell" indexed="true"  
>> stored="true" />
>>
>>
>>
>>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them  
>>> (the
>>> field "brand") directly to the "spell" field. The "spell" field is  
>>> of
>>> type "string".
>>
>>
>> We're copying description to spellField.  I'd recommend using a  
>> type like
>> the above textSpell type since "The StringField type is not  
>> analyzed, but
>> indexed/stored verbatim" (schema.xml):
>>
>>  <copyField source="description" dest="spellField" />
>>
>> Other fields like e.g. the product title I would first copy to some
>>> whitespaceTokinized field (field type with  
>>> WhitespaceTokenizerFactory)
>>> and afterwards to the "spell" field. The product title might be e.g.
>>> "Canon EOS 450D EF-S 18-55 mm".
>>
>>
>> Hmm... I'm not sure if this would work as I don't think the  
>> analyzer is
>> applied until after the copy is made.  FWIW, I've had trouble copying
>> multipe fields to spellField (i.e. adding a second copyField w/
>> dest="spellField"), so we just index the spellchecker on a single  
>> field...
>>
>> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to  
>> use a
>>> StandardTokenizerFactory here?
>>
>>
>> I think if you use the same analyzer for indexing and queries, the
>> distinction probably isn't tremendously important.  When I went  
>> searching,
>> it looked like the StandardTokenizer split on non-letters.  I'd  
>> guess the
>> rationale for using the StandardTokenizer is that it won't recommend
>> non-letter characters.  I was seeing some weirdness earlier (no
>> inserts/deletes), but that disappeared now that I'm using the
>> StandardTokenizer.
>>
>> Cheers,
>>
>> Jason
> --
> Martin Grotzke
> http://www.javakaffee.de/blog/

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








Reply | Threaded
Open this post in threaded view
|

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

Walter Underwood, Netflix
In reply to this post by martin.grotzke
This is why OR is a better choice. With AND, one miss means no results
at all. Spelling suggestions will never be good enough to make AND work.

wunder

On 10/6/08 12:51 AM, "Martin Grotzke" <[hidden email]> wrote:

> Hi Jason,
>
> what about multi-word searches like "harry potter"? When I do a search
> in our index for "harry poter", I get the suggestion "harry
> spotter" (using spellcheck.collate=true and jarowinkler distance).
> Searching for "harry spotter" (we're searching AND, not OR) then gives
> no results. I asume that this is because suggestions are done for words
> separately, and this does not require that both/all suggestions are
> contained in the same document.
>
> I wonder what's the standard approach for searches with multiple words.
> Are these working ok for you?
>
> Cheers,
> Martin
>
> On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
>> Hi Martin,
>>
>> I'm a relative newbie to solr, have been playing with the spellcheck
>> component and seem to have it working.  I certainly can't explain what all
>> is going on, but with any luck, I can help you get the spellchecker
>> up-and-running.  Additional replies in-lined below.
>>
>> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[hidden email]
>>> wrote:
>>
>>> Now I'm thinking about the source-field in the spellchecker ("spell"):
>>> how should fields be analyzed during indexing, and how should the
>>> queryAnalyzerFieldType be configured.
>>
>>
>> I followed the conventions in the default solrconfig.xml and schema.xml
>> files.  So I created a "textSpell" field type (schema.xml):
>>
>>     <!-- field type for the spell checker which doesn't stem -->
>>     <fieldtype name="textSpell" class="solr.TextField"
>> positionIncrementGap="100">
>>       <analyzer>
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>     </fieldtype>
>>
>> and used this for the queryAnalyzerFieldType.  I also created a spellField
>> to store the text I want to spell check against and used the same analyzer
>> (figuring that the query and indexed data should be analyzed the same way)
>> (schema.xml):
>>
>>    <!-- Spell check field -->
>>    <field name="spellField" type="textSpell" indexed="true" stored="true" />
>>
>>
>>
>>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them (the
>>> field "brand") directly to the "spell" field. The "spell" field is of
>>> type "string".
>>
>>
>> We're copying description to spellField.  I'd recommend using a type like
>> the above textSpell type since "The StringField type is not analyzed, but
>> indexed/stored verbatim" (schema.xml):
>>
>>   <copyField source="description" dest="spellField" />
>>
>> Other fields like e.g. the product title I would first copy to some
>>> whitespaceTokinized field (field type with WhitespaceTokenizerFactory)
>>> and afterwards to the "spell" field. The product title might be e.g.
>>> "Canon EOS 450D EF-S 18-55 mm".
>>
>>
>> Hmm... I'm not sure if this would work as I don't think the analyzer is
>> applied until after the copy is made.  FWIW, I've had trouble copying
>> multipe fields to spellField (i.e. adding a second copyField w/
>> dest="spellField"), so we just index the spellchecker on a single field...
>>
>> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a
>>> StandardTokenizerFactory here?
>>
>>
>> I think if you use the same analyzer for indexing and queries, the
>> distinction probably isn't tremendously important.  When I went searching,
>> it looked like the StandardTokenizer split on non-letters.  I'd guess the
>> rationale for using the StandardTokenizer is that it won't recommend
>> non-letter characters.  I was seeing some weirdness earlier (no
>> inserts/deletes), but that disappeared now that I'm using the
>> StandardTokenizer.
>>
>> Cheers,
>>
>> Jason

Reply | Threaded
Open this post in threaded view
|

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

martin.grotzke
In reply to this post by Grant Ingersoll-2
On Mon, 2008-10-06 at 09:00 -0400, Grant Ingersoll wrote:

> On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:
>
> > Hi Jason,
> >
> > what about multi-word searches like "harry potter"? When I do a search
> > in our index for "harry poter", I get the suggestion "harry
> > spotter" (using spellcheck.collate=true and jarowinkler distance).
> > Searching for "harry spotter" (we're searching AND, not OR) then gives
> > no results. I asume that this is because suggestions are done for  
> > words
> > separately, and this does not require that both/all suggestions are
> > contained in the same document.
> >
>
> Yeah, the SpellCheckComponent is not phrase aware.  My guess would be  
> that you would somehow need a QueryConverter (see http://wiki.apache.org/solr/SpellCheckComponent)
>    that preserved phrases as a single token.  Likewise, you would need  
> that on your indexing side as well for the spell checker.  In short, I  
> suppose it's possible, but it would be work.  You probably could use  
> the shingle filter (token based n-grams).
I also thought about s.th. like this, and also stumbled over the
ShingleFilter :)

So I would change the "spell" field to use the ShingleFilter?

Did I understand the answer to the posting "chaining copyFields"
correctly, that I cannot pipe the title through some "shingledTitle"
field and copy it afterwards to the "spell" field (while other fields
like brand are copied directly to the spell field)?

Thanx && cheers,
Martin


>
> Alternatively, by using extendedResults, you can get back the  
> frequency of each of the words, and then you could decide whether the  
> collation is going to have any results assuming they are all or'd  
> together.  For phrases and AND queries, I'm not sure.  It's doable,  
> I'm sure, but it would be a lot more involved.
>
>
> > I wonder what's the standard approach for searches with multiple  
> > words.
> > Are these working ok for you?
> >
> > Cheers,
> > Martin
> >
> > On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
> >> Hi Martin,
> >>
> >> I'm a relative newbie to solr, have been playing with the spellcheck
> >> component and seem to have it working.  I certainly can't explain  
> >> what all
> >> is going on, but with any luck, I can help you get the spellchecker
> >> up-and-running.  Additional replies in-lined below.
> >>
> >> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[hidden email]
> >>> wrote:
> >>
> >>> Now I'm thinking about the source-field in the spellchecker  
> >>> ("spell"):
> >>> how should fields be analyzed during indexing, and how should the
> >>> queryAnalyzerFieldType be configured.
> >>
> >>
> >> I followed the conventions in the default solrconfig.xml and  
> >> schema.xml
> >> files.  So I created a "textSpell" field type (schema.xml):
> >>
> >>    <!-- field type for the spell checker which doesn't stem -->
> >>    <fieldtype name="textSpell" class="solr.TextField"
> >> positionIncrementGap="100">
> >>      <analyzer>
> >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>      </analyzer>
> >>    </fieldtype>
> >>
> >> and used this for the queryAnalyzerFieldType.  I also created a  
> >> spellField
> >> to store the text I want to spell check against and used the same  
> >> analyzer
> >> (figuring that the query and indexed data should be analyzed the  
> >> same way)
> >> (schema.xml):
> >>
> >>   <!-- Spell check field -->
> >>   <field name="spellField" type="textSpell" indexed="true"  
> >> stored="true" />
> >>
> >>
> >>
> >>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them  
> >>> (the
> >>> field "brand") directly to the "spell" field. The "spell" field is  
> >>> of
> >>> type "string".
> >>
> >>
> >> We're copying description to spellField.  I'd recommend using a  
> >> type like
> >> the above textSpell type since "The StringField type is not  
> >> analyzed, but
> >> indexed/stored verbatim" (schema.xml):
> >>
> >>  <copyField source="description" dest="spellField" />
> >>
> >> Other fields like e.g. the product title I would first copy to some
> >>> whitespaceTokinized field (field type with  
> >>> WhitespaceTokenizerFactory)
> >>> and afterwards to the "spell" field. The product title might be e.g.
> >>> "Canon EOS 450D EF-S 18-55 mm".
> >>
> >>
> >> Hmm... I'm not sure if this would work as I don't think the  
> >> analyzer is
> >> applied until after the copy is made.  FWIW, I've had trouble copying
> >> multipe fields to spellField (i.e. adding a second copyField w/
> >> dest="spellField"), so we just index the spellchecker on a single  
> >> field...
> >>
> >> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to  
> >> use a
> >>> StandardTokenizerFactory here?
> >>
> >>
> >> I think if you use the same analyzer for indexing and queries, the
> >> distinction probably isn't tremendously important.  When I went  
> >> searching,
> >> it looked like the StandardTokenizer split on non-letters.  I'd  
> >> guess the
> >> rationale for using the StandardTokenizer is that it won't recommend
> >> non-letter characters.  I was seeing some weirdness earlier (no
> >> inserts/deletes), but that disappeared now that I'm using the
> >> StandardTokenizer.
> >>
> >> Cheers,
> >>
> >> Jason
> > --
> > Martin Grotzke
> > http://www.javakaffee.de/blog/
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>

signature.asc (204 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time

martin.grotzke
Thanx for your help so far, I just wanted to post my results here...

In short: Now I use the ShingleFilter to create shingles when copying my
fields into my field "spellMultiWords". For query time, I implemented a
MultiWordSpellingQueryConverter that just leaves the query as is, so
that there's only one token that is check for spelling suggestions.

Here's the detailed configuration:

= schema.xml =
    <fieldType name="textSpellMultiWords" class="solr.TextField" positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

   <field name="spellMultiWords" type="textSpellMultiWords" indexed="true" stored="true" multiValued="true"/>

   <copyField source="name" dest="spellMultiWords" />
   <copyField source="cat" dest="spellMultiWords" />
   ... and more ...


= solrconfig.xml =
 
  <searchComponent name="spellcheckMultiWords" class="solr.SpellCheckComponent">

    <!-- this is not used at all, can probably be omitted -->
    <str name="queryAnalyzerFieldType">textSpellMultiWords</str>

    <lst name="spellchecker">
      <!-- Optional, it is required when more than one spellchecker is configured -->
      <str name="name">default</str>
      <str name="field">spellMultiWords</str>
      <str name="spellcheckIndexDir">./spellcheckerMultiWords1</str>
      <str name="accuracy">0.5</str>
      <str name="buildOnCommit">true</str>
    </lst>
    <lst name="spellchecker">
      <str name="name">jarowinkler</str>
      <str name="field">spellMultiWords</str>
      <str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str>
      <str name="spellcheckIndexDir">./spellcheckerMultiWords2</str>
      <str name="buildOnCommit">true</str>
    </lst>
  </searchComponent>
 
  <queryConverter name="queryConverter" class="my.proj.solr.MultiWordSpellingQueryConverter"/>


= MultiWordSpellingQueryConverter =

public class MultiWordSpellingQueryConverter extends QueryConverter {
   
    /**
     * Converts the original query string to a collection of Lucene Tokens.
     *
     * @param original the original query string
     * @return a Collection of Lucene Tokens
     */
    public Collection<Token> convert( String original ) {
        if ( original == null ) {
            return Collections.emptyList();
        }
        final Token token = new Token(0, original.length());
        token.setTermBuffer( original );
        return Arrays.asList( token );
    }
   
}



There are some issues still to be resolved:
- terms are lowercased in the index, there should happen some case
restoration
- we use stemming for our text field, so the spellchecker might suggest
searches, that lead to equal search results (e.g. the german2 stemmer
stems both "hose" and "hosen" to "hos" -> "Hose" and "Hosen" give the
same results)
- inconsistent/strange sorting of suggestions (as described in
http://www.nabble.com/spellcheck%3A-issues-td19845539.html).


Cheers,
Martin


On Mon, 2008-10-06 at 22:45 +0200, Martin Grotzke wrote:

> On Mon, 2008-10-06 at 09:00 -0400, Grant Ingersoll wrote:
> > On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote:
> >
> > > Hi Jason,
> > >
> > > what about multi-word searches like "harry potter"? When I do a search
> > > in our index for "harry poter", I get the suggestion "harry
> > > spotter" (using spellcheck.collate=true and jarowinkler distance).
> > > Searching for "harry spotter" (we're searching AND, not OR) then gives
> > > no results. I asume that this is because suggestions are done for  
> > > words
> > > separately, and this does not require that both/all suggestions are
> > > contained in the same document.
> > >
> >
> > Yeah, the SpellCheckComponent is not phrase aware.  My guess would be  
> > that you would somehow need a QueryConverter (see http://wiki.apache.org/solr/SpellCheckComponent)
> >    that preserved phrases as a single token.  Likewise, you would need  
> > that on your indexing side as well for the spell checker.  In short, I  
> > suppose it's possible, but it would be work.  You probably could use  
> > the shingle filter (token based n-grams).
> I also thought about s.th. like this, and also stumbled over the
> ShingleFilter :)
>
> So I would change the "spell" field to use the ShingleFilter?
>
> Did I understand the answer to the posting "chaining copyFields"
> correctly, that I cannot pipe the title through some "shingledTitle"
> field and copy it afterwards to the "spell" field (while other fields
> like brand are copied directly to the spell field)?
>
> Thanx && cheers,
> Martin
>
>
> >
> > Alternatively, by using extendedResults, you can get back the  
> > frequency of each of the words, and then you could decide whether the  
> > collation is going to have any results assuming they are all or'd  
> > together.  For phrases and AND queries, I'm not sure.  It's doable,  
> > I'm sure, but it would be a lot more involved.
> >
> >
> > > I wonder what's the standard approach for searches with multiple  
> > > words.
> > > Are these working ok for you?
> > >
> > > Cheers,
> > > Martin
> > >
> > > On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote:
> > >> Hi Martin,
> > >>
> > >> I'm a relative newbie to solr, have been playing with the spellcheck
> > >> component and seem to have it working.  I certainly can't explain  
> > >> what all
> > >> is going on, but with any luck, I can help you get the spellchecker
> > >> up-and-running.  Additional replies in-lined below.
> > >>
> > >> On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke <[hidden email]
> > >>> wrote:
> > >>
> > >>> Now I'm thinking about the source-field in the spellchecker  
> > >>> ("spell"):
> > >>> how should fields be analyzed during indexing, and how should the
> > >>> queryAnalyzerFieldType be configured.
> > >>
> > >>
> > >> I followed the conventions in the default solrconfig.xml and  
> > >> schema.xml
> > >> files.  So I created a "textSpell" field type (schema.xml):
> > >>
> > >>    <!-- field type for the spell checker which doesn't stem -->
> > >>    <fieldtype name="textSpell" class="solr.TextField"
> > >> positionIncrementGap="100">
> > >>      <analyzer>
> > >>        <tokenizer class="solr.StandardTokenizerFactory"/>
> > >>        <filter class="solr.LowerCaseFilterFactory"/>
> > >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >>      </analyzer>
> > >>    </fieldtype>
> > >>
> > >> and used this for the queryAnalyzerFieldType.  I also created a  
> > >> spellField
> > >> to store the text I want to spell check against and used the same  
> > >> analyzer
> > >> (figuring that the query and indexed data should be analyzed the  
> > >> same way)
> > >> (schema.xml):
> > >>
> > >>   <!-- Spell check field -->
> > >>   <field name="spellField" type="textSpell" indexed="true"  
> > >> stored="true" />
> > >>
> > >>
> > >>
> > >>> If I have brands like e.g. "Apple" or "Ed Hardy" I would copy them  
> > >>> (the
> > >>> field "brand") directly to the "spell" field. The "spell" field is  
> > >>> of
> > >>> type "string".
> > >>
> > >>
> > >> We're copying description to spellField.  I'd recommend using a  
> > >> type like
> > >> the above textSpell type since "The StringField type is not  
> > >> analyzed, but
> > >> indexed/stored verbatim" (schema.xml):
> > >>
> > >>  <copyField source="description" dest="spellField" />
> > >>
> > >> Other fields like e.g. the product title I would first copy to some
> > >>> whitespaceTokinized field (field type with  
> > >>> WhitespaceTokenizerFactory)
> > >>> and afterwards to the "spell" field. The product title might be e.g.
> > >>> "Canon EOS 450D EF-S 18-55 mm".
> > >>
> > >>
> > >> Hmm... I'm not sure if this would work as I don't think the  
> > >> analyzer is
> > >> applied until after the copy is made.  FWIW, I've had trouble copying
> > >> multipe fields to spellField (i.e. adding a second copyField w/
> > >> dest="spellField"), so we just index the spellchecker on a single  
> > >> field...
> > >>
> > >> Shouldn't this be a WhitespaceTokenizerFactory, or is it better to  
> > >> use a
> > >>> StandardTokenizerFactory here?
> > >>
> > >>
> > >> I think if you use the same analyzer for indexing and queries, the
> > >> distinction probably isn't tremendously important.  When I went  
> > >> searching,
> > >> it looked like the StandardTokenizer split on non-letters.  I'd  
> > >> guess the
> > >> rationale for using the StandardTokenizer is that it won't recommend
> > >> non-letter characters.  I was seeing some weirdness earlier (no
> > >> inserts/deletes), but that disappeared now that I'm using the
> > >> StandardTokenizer.
> > >>
> > >> Cheers,
> > >>
> > >> Jason
> > > --
> > > Martin Grotzke
> > > http://www.javakaffee.de/blog/
> >
> > --------------------------
> > Grant Ingersoll
> >
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> >
> >
> >
> >
> >
> >
--
Martin Grotzke
http://www.javakaffee.de/blog/

signature.asc (204 bytes) Download Attachment