Help with spellchecker integration

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Help with spellchecker integration

Otis Gospodnetic-2
Hi,
I'm trying to integrate the Lucene-based spellchecker (http://wiki.apache.org/jakarta-lucene/SpellChecker + contrib/spellchecker under Lucene) with Solr (http://issues.apache.org/jira/browse/SOLR-81) in order to provide a query spellchecking service (you enter Speers and it suggest pant^H^H ... Spears).  I've created a generic NGramTokenizer (+ NGramTokenizerFactory + unit test) that I'll attach to SOLR-81 shortly.

What I'm not yet sure about is:
1) integration of this generic n-grammer with that Lucene SpellChecker code - SpellChecker & TRStringDistance classes in particular.
2) mapping n-gram Tokens that come out of my NGramTokenizer to specific field names, like 3start, 4start, gram1, gram2, gram3.... is there is scheme.xml trick one can use to accomplish this?
3) once 2) is done, getting the.... request handler(?) to n-gram the query appropriately and hit the SpellChecker index to try and find alternative spelling suggestions.

Damn, that's a lot of unknowns... on top of that my computer started freezing every half an hour.  Hi Murphy.



Any pointers will be greatly appreciated. Thanks,
Otis



Reply | Threaded
Open this post in threaded view
|

Re: Help with spellchecker integration

Thorsten Scherler-3
On Thu, 2006-12-21 at 21:27 -0800, Otis Gospodnetic wrote:
> Hi,
> I'm trying to integrate the Lucene-based spellchecker (http://wiki.apache.org/jakarta-lucene/SpellChecker + contrib/spellchecker under Lucene) with Solr (http://issues.apache.org/jira/browse/SOLR-81) in order to provide a query spellchecking service (you enter Speers and it suggest pant^H^H ... Spears).  I've created a generic NGramTokenizer (+ NGramTokenizerFactory + unit test) that I'll attach to SOLR-81 shortly.
>
> What I'm not yet sure about is:
> 1) integration of this generic n-grammer with that Lucene SpellChecker code - SpellChecker & TRStringDistance classes in particular.

Hmm, reading SOLR-81, you actually have everything you need.

> 2) mapping n-gram Tokens that come out of my NGramTokenizer to specific field names, like 3start, 4start, gram1, gram2, gram3.... is there is scheme.xml trick one can use to accomplish this?

It is in the issue:
...
<!-- Here you map the @source="word" to @dest="gram2"
     What is does is copying the word input to the gram2 field-->
<copyField source="word" dest="gram2"/>
...
<!-- Here you define what happens if the field "gram2" get indexed.
     The solr.NGramTokenizerFactory will return the different
combination of tokens -->
<fieldtype name="gram2" class="solr.TextField">
  <analyzer>
    <!--more tokenizer -->
    <tokenizer
      class="solr.NGramTokenizerFactory" minGram="2" maxGram="2"/>
  </analyzer>
</fieldtype>

The above shows how to configure the second (spellcheck) index, however
if you want to update both indexes at the same time you need to write
your own implementation of the update servlet.

> 3) once 2) is done, getting the.... request handler(?) to n-gram the query appropriately and hit the SpellChecker index to try and find alternative spelling suggestions.

hmm, not sure, actually IMHO that highly depends on how you plan to use
it in the end. I mean there is more then one way to use spell check.

In the issue they talked about AJAX suggestions but that would be IMO
before the actual search request. If you want to have it in the request
handler then you need to decide how and when the spellchecker comes into
place.

Like if the normal search does not return a result or parallel. Parallel
would search in the spell check index for alternatives, use this
alternatives to dispatch the alternative word query and later on parse
the result of directly into the output writer. Here you have again
different alternatives, you can attack the solr index directly (loosing
all the cool feature)

Or you want the google thingy "Did you mean".

... in any form
start with:
public class NGramRequestHandler extends StandardRequestHandler
implements SolrRequestHandler, SolrInfoMBean {
    public void handleRequest(SolrQueryRequest req, SolrQueryResponse
rsp) {
        // Depending on the use case do your processing here
    }
}

This way you just need to implement the class specific methods.


>
> Damn, that's a lot of unknowns... on top of that my computer started freezing every half an hour.  Hi Murphy.
>
>
>
> Any pointers will be greatly appreciated. Thanks,

HTH a wee bit.

salu2

> Otis
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Help with spellchecker integration

Otis Gospodnetic-2
In reply to this post by Otis Gospodnetic-2
Hi Thorsten,

Some comments to your comments, inlined and prefixed with "OG".

----- Original Message ----
From: Thorsten Scherler <[hidden email]>
To: [hidden email]
Sent: Friday, December 22, 2006 5:53:19 AM
Subject: Re: Help with spellchecker integration

On Thu, 2006-12-21 at 21:27 -0800, Otis Gospodnetic wrote:
> Hi,
> I'm trying to integrate the Lucene-based spellchecker (http://wiki.apache.org/jakarta-lucene/SpellChecker + contrib/spellchecker under Lucene) with Solr (http://issues.apache.org/jira/browse/SOLR-81) in order to provide a query spellchecking service (you enter Speers and it suggest pant^H^H ... Spears).  I've created a generic NGramTokenizer (+ NGramTokenizerFactory + unit test) that I'll attach to SOLR-81 shortly.
>
> What I'm not yet sure about is:
> 1) integration of this generic n-grammer with that Lucene SpellChecker code - SpellChecker & TRStringDistance classes in particular.

Hmm, reading SOLR-81, you actually have everything you need.

> 2) mapping n-gram Tokens that come out of my NGramTokenizer to specific field names, like 3start, 4start, gram1, gram2, gram3.... is there is scheme.xml trick one can use to accomplish this?

It is in the issue:
<!-- Here you define what happens if the field "gram2" get indexed.
     The solr.NGramTokenizerFactory will return the different combination of tokens -->
<fieldtype name="gram2" class="solr.TextField">
  <analyzer>
    <!--more tokenizer -->
    <tokenizer
      class="solr.NGramTokenizerFactory" minGram="2" maxGram="2"/>
  </analyzer>
</fieldtype>

OG: Yes, adding those separate fieldtype definitions was my attempt at
getting separate sets of n-grams of different sizes: uni-bram,
bi-gram... But how do I get "3start", "4start", "2end", and "4end"?  It looks like I'd have to do this:
- To get 3start, pass "query string" to "gram3" type tokenizer, and keep only the first token.
- To get 3end, pass "query string" to "gram3" type tokenizer, and keep only the last token (this could be the same n-gram if query string is a 3-letter word)


But can this be configured somehow?  I don't see a way to configure Solr to do this.

<!-- Here you map the @source="word" to @dest="gram2"
     What is does is copying the word input to the gram2 field-->
<copyField source="word" dest="gram2"/>
...

OG: But doesn't this tell Solr to copy the _whole_ "word" into a field _named_ "gram2"?  The above fieldtype is a definition for a field of _type_ "gram2".
What I need to tell Solr is:
"Take the field named word, analyze is as fieldtype gram2 and index it into a field named gram2"
"Take the field named word, analyze is as fieldtype gram3 and index it into a field named gram3"
...
"Take the field named word, analyze is as fieldtype gram2 and index only the 1st token into a field named 2start"

"Take the field named word, analyze is as fieldtype gram3 and index only the 1st token into a field named 3start"


...
"Take the field named word, analyze is as fieldtype gram2 and index only the last token into a field named 2end"


"Take the field named word, analyze is as fieldtype gram3 and index only the last token into a field named 3end"





OG: I think :).  Doable?

The above shows how to configure the second (spellcheck) index, however
if you want to update both indexes at the same time you need to write
your own implementation of the update servlet.

OG: Right.  I think the spellchecker index will be small enough that it could be rebuilt from scratch on demand or at least separately from the main index being searched.

> 3) once 2) is done, getting the.... request handler(?) to n-gram the query appropriately and hit the SpellChecker index to try and find alternative spelling suggestions.

hmm, not sure, actually IMHO that highly depends on how you plan to use
it in the end. I mean there is more then one way to use spell check.

In the issue they talked about AJAX suggestions but that would be IMO
before the actual search request. If you want to have it in the request
handler then you need to decide how and when the spellchecker comes into
place.

OG: The goal is a "did you mean" type of functionality.  In other words, run the real query + run the query against the spellchecker index.  If the spellchecker returns something, offer than on the results page as a "did you mean: <suggested query>"

Like if the normal search does not return a result or parallel. Parallel
would search in the spell check index for alternatives, use this
alternatives to dispatch the alternative word query and later on parse
the result of directly into the output writer. Here you have again
different alternatives, you can attack the solr index directly (loosing
all the cool feature)

Or you want the google thingy "Did you mean".

... in any form
start with:
public class NGramRequestHandler extends StandardRequestHandler
implements SolrRequestHandler, SolrInfoMBean {
    public void handleRequest(SolrQueryRequest req, SolrQueryResponse
rsp) {
        // Depending on the use case do your processing here
    }
}

This way you just need to implement the class specific methods.

OG: I see I'll be losing my RequestHandler virginity.  Ah, the innocence.  I suppose at this point, if I manage to get the all the ngrams into the right fields, I can use Spellchecker.suggest(....) from the Lucene spellchecker and return any suggestions as matching documents.

> Damn, that's a lot of unknowns... on top of that my computer started freezing every half an hour.  Hi Murphy.
> Any pointers will be greatly appreciated. Thanks,

HTH a wee bit.

Thanks!
Otis



Reply | Threaded
Open this post in threaded view
|

Re: Help with spellchecker integration

Mike Klaas
On 12/22/06, Otis Gospodnetic <[hidden email]> wrote:

> OG: Yes, adding those separate fieldtype definitions was my attempt at
> getting separate sets of n-grams of different sizes: uni-bram,
> bi-gram... But how do I get "3start", "4start", "2end", and "4end"?  It looks like I'd have to do this:
> - To get 3start, pass "query string" to "gram3" type tokenizer, and keep only the first token.
> - To get 3end, pass "query string" to "gram3" type tokenizer, and keep only the last token (this could be the same n-gram if query string is a 3-letter word)
>
> But can this be configured somehow?  I don't see a way to configure Solr to do this.
>
> <!-- Here you map the @source="word" to @dest="gram2"
>      What is does is copying the word input to the gram2 field-->
> <copyField source="word" dest="gram2"/>
> ...
>
> OG: But doesn't this tell Solr to copy the _whole_ "word" into a field _named_ "gram2"?  The above fieldtype is a definition for a field of _type_ "gram2".

Let's say you define a field as follows:
<field type="gram2" name="gram2field">

Then you can copy contents into it using:
<copyField source="word" dest="gram2field">

The text will be analyzed as a field type "gram2"

> What I need to tell Solr is:
> "Take the field named word, analyze is as fieldtype gram2 and index it into a field named gram2"
> "Take the field named word, analyze is as fieldtype gram3 and index it into a field named gram3"
> ...

This is covered by the above.

> "Take the field named word, analyze is as fieldtype gram2 and index only the 1st token into a field named 2start"
>
> "Take the field named word, analyze is as fieldtype gram3 and index only the 1st token into a field named 3start"
>
>
> ...
> "Take the field named word, analyze is as fieldtype gram2 and index only the last token into a field named 2end"
>
>
> "Take the field named word, analyze is as fieldtype gram3 and index only the last token into a field named 3end"
>
> OG: I think :).  Doable?

Hmm, the only way I can think of to do that is to define fieldtypes
firstgram2, lastgram3, etc., which discards everything but the
first/last token.  This means you will be re-analyzing for every
field, however.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Help with spellchecker integration

Otis Gospodnetic-2
In reply to this post by Otis Gospodnetic-2
Hi Mike,

Thanks, that (what you said in the end) is precisely what I ended up doing.  I'll post a new patch to SOLR-81 shortly.

Otis

----- Original Message ----
From: Mike Klaas <[hidden email]>
To: [hidden email]
Sent: Friday, December 22, 2006 5:23:42 PM
Subject: Re: Help with spellchecker integration

On 12/22/06, Otis Gospodnetic <[hidden email]> wrote:

> OG: Yes, adding those separate fieldtype definitions was my attempt at
> getting separate sets of n-grams of different sizes: uni-bram,
> bi-gram... But how do I get "3start", "4start", "2end", and "4end"?  It looks like I'd have to do this:
> - To get 3start, pass "query string" to "gram3" type tokenizer, and keep only the first token.
> - To get 3end, pass "query string" to "gram3" type tokenizer, and keep only the last token (this could be the same n-gram if query string is a 3-letter word)
>
> But can this be configured somehow?  I don't see a way to configure Solr to do this.
>
> <!-- Here you map the @source="word" to @dest="gram2"
>      What is does is copying the word input to the gram2 field-->
> <copyField source="word" dest="gram2"/>
> ...
>
> OG: But doesn't this tell Solr to copy the _whole_ "word" into a field _named_ "gram2"?  The above fieldtype is a definition for a field of _type_ "gram2".

Let's say you define a field as follows:
<field type="gram2" name="gram2field">

Then you can copy contents into it using:
<copyField source="word" dest="gram2field">

The text will be analyzed as a field type "gram2"

> What I need to tell Solr is:
> "Take the field named word, analyze is as fieldtype gram2 and index it into a field named gram2"
> "Take the field named word, analyze is as fieldtype gram3 and index it into a field named gram3"
> ...

This is covered by the above.

> "Take the field named word, analyze is as fieldtype gram2 and index only the 1st token into a field named 2start"
>
> "Take the field named word, analyze is as fieldtype gram3 and index only the 1st token into a field named 3start"
>
>
> ...
> "Take the field named word, analyze is as fieldtype gram2 and index only the last token into a field named 2end"
>
>
> "Take the field named word, analyze is as fieldtype gram3 and index only the last token into a field named 3end"
>
> OG: I think :).  Doable?

Hmm, the only way I can think of to do that is to define fieldtypes
firstgram2, lastgram3, etc., which discards everything but the
first/last token.  This means you will be re-analyzing for every
field, however.

-Mike