Spellchecker design was Re: Solr 3.1 back compat

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Spellchecker design was Re: Solr 3.1 back compat

Grant Ingersoll-2

On Oct 25, 2010, at 10:14 PM, Robert Muir wrote:

> On Mon, Oct 25, 2010 at 9:42 PM, Grant Ingersoll <[hidden email]> wrote:
>> As part of https://issues.apache.org/jira/browse/SOLR-2080, I'd like to rework the SpellCheckComponent just a bit to be more generic.  I think I can maintain the URL APIs (i.e. &spellcheck.*) in a back compatible way, but I would like change some of the Java classes a bit, namely SolrSpellChecker and related to be reusable and reflect the commonality of the solutions.  The way I see it, spell checking, auto suggest and related search suggestions are all just suggestions.  We have much of the framework of this in place, other than a few things at the Java level are named after spell checking.  I know we generally don't worry too much about Java interfaces in Solr, but this seems like one area where people do sometimes write their own.  The changes will be mostly renaming commonalities from "spellcheck" to "suggester" (or something similar) and so I don't see it as particularly hard to make the change, but it would require some code changes.  What do people think?  My other option would be to factor out as much commonality as possible into helper classes, but that doesn't feel as clean.
>>
>>
>
> Almost certainly not what you are looking for,

Yeah, pretty much doesn't answer a single question I asked, but nonetheless, I'm happy to discuss a better design.  We really should discuss on another thread.

> but I'm gonna complain
> anyway from my experience of trying to write a Solr spellchecker
> recently.
> Note: I didnt take the time to actually try to learn these APIs a lot,
> so maybe i'm completely off-base, but this is what it looked like to
> me:
>
> I felt the entire framework in Solr is built around the idea of  "take
> stuff from one field in an index, shove it into another field of an
> index", but my spellchecker doesn't need any of this.
>

Not really, but...



> Configuring it for different fields is a pain in the ass, if you have
> many, but really the field could and should be a query-time parameter.

In fact, the SpellingOptions allows this.  You should look at the customParams piece.  You can pass in arbitrary query time parameters.

>
> The spellchecking apis have a wierd response format "Map<Token,
> LinkedHashMap<String, Integer>>" which really just means you can only
> provide text and docfreq, but i wanted to return the score, too... so
> for now it just gets discarded.

That kind of stuff can and should be changed.  Those are internal APIs.  If you want score in there, then we should change it to something like <Token, Map<String, SuggestionInfo>> where SuggestionInfo (or whatever you want to call it) is contains freq, score, etc.

>
> we are still using Token everywhere, again, which is bad news if we
> want to do more complex things later... like it would really make
> sense to switch to the attributes API if this stuff needs to be
> flexible.

I guess no one has upgraded it yet.  This is 1.3 stuff.  I don't have any problem with upgrading it.

>
> Even the input format that comes into the spellchecker in
> getSuggestions(SpellingOptions options) is just Tokens, but this is
> pretty limiting. For instance, I think it makes way more sense for a
> spellchecker API to take Query and return corrected Querys, and in my
> situation i could give better results, but the Solr APIs stop me.

And you are then going to do Query.toString() to display that back to the user?  

>
> Apparently the whole Collator thing is designed to "do this for me",
> but i have my own ideas (since my impl is new and different), only i'm
> not able to implement them... I don't know how the hell it could be
> doing this since i can't return the score.
>
> I realize i could have completely discarded all the spellchecking
> APIs, written a ton of code/re-invented wheels, and probably gotten
> what i wanted, but i just wimped out and committed a shitty
> spellchecker instead.

Or you could ask questions and we could discuss how to improve it.  We probably could get you what you want w/o that much of a change.
       
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spellchecker design was Re: Solr 3.1 back compat

Robert Muir
On Tue, Oct 26, 2010 at 6:59 AM, Grant Ingersoll <[hidden email]> wrote:
>> I felt the entire framework in Solr is built around the idea of  "take
>> stuff from one field in an index, shove it into another field of an
>> index", but my spellchecker doesn't need any of this.
>>
>
> Not really, but...

I think really? I can only "see" part of the query (i think one field
at once) via Tokens...

>
> I guess no one has upgraded it yet.  This is 1.3 stuff.  I don't have any problem with upgrading it.

I'm not saying we have to use the Attributes API, it was just an idea.
but we really have to move the stuff from this component from
"solr-makes-the-decisions" into "user-makes-the-decisions". This is
the number 1 problem with the current spellchecker (ok, maybe #2, #1
being the index-based one doesnt close its indexreader).

>
>>
>> Even the input format that comes into the spellchecker in
>> getSuggestions(SpellingOptions options) is just Tokens, but this is
>> pretty limiting. For instance, I think it makes way more sense for a
>> spellchecker API to take Query and return corrected Querys, and in my
>> situation i could give better results, but the Solr APIs stop me.
>
> And you are then going to do Query.toString() to display that back to the user?

why do you care? maybe that works fine for me, i don't use the dismax
parser that generates horrific queries so everything is fine... and
thats my point... something more like a pipeline/attributes-based
thing woudl work much better here, its up to the user.

certainly it makes sense to keep the original query around... why hide
it? and the hairy mess of code that converts it into tokens, this
needs to be something like a pipeline, because some people don't want
it, or want to do it their own way.

And, lets say i have a hunspell dictionary for my language... how do i
plug this in? I don't want it to implement Dictionary, because I'm not
stupid enough to return something thats not in my index (see below),
maybe i only want to use it as a 'filter' to prevent suggestions that
are spelled incorrectly...


we really need to seriously clean house on the spellchecker stuff
(lucene too) and to answer your question, if we can fix these APIs in
any way, I'm all for just doing a backwards break, because I think the
existing APIs are completely broken.

For example, the whole index-based spellchecker in lucene has bad
performance because its APIs were made overly generic:
I think its important that it doesn't call docFreq() on every single
term in the Dictionary when rebuilding, it should walk a TermEnum in
parallel.
But, it can't do this because it can't assume the Dictionary is in
sorted order!?
I guess thats because the "Dictionary" idea was made overly generic,
abstracted into useless PlainTextDictionary and LuceneDictionary.

PlainTextDictionary? useless... why the hell would you return
something that isn't in your index?!

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spellchecker design was Re: Solr 3.1 back compat

Grant Ingersoll-2
Some thoughts inline...

On Oct 26, 2010, at 7:24 AM, Robert Muir wrote:

> On Tue, Oct 26, 2010 at 6:59 AM, Grant Ingersoll <[hidden email]> wrote:
>>> I felt the entire framework in Solr is built around the idea of  "take
>>> stuff from one field in an index, shove it into another field of an
>>> index", but my spellchecker doesn't need any of this.
>>>
>>
>> Not really, but...
>
> I think really? I can only "see" part of the query (i think one field
> at once) via Tokens...
>
>>
>> I guess no one has upgraded it yet.  This is 1.3 stuff.  I don't have any problem with upgrading it.
>
> I'm not saying we have to use the Attributes API, it was just an idea.

I think that is a reasonable idea.  I was just pointing out that this stuff predates the move away from Token.  At the time, I would argue Token made sense.  FWIW, I still dread the day when I have to start explaining BytesRefs to new Lucene programmers when I really mean a Token, but heh, I'll get over it.  For all of it's inflexibility, Token was quite nice in that the word conveys its meaning quite nicely to most programmers.

> but we really have to move the stuff from this component from
> "solr-makes-the-decisions" into "user-makes-the-decisions". This is
> the number 1 problem with the current spellchecker (ok, maybe #2, #1
> being the index-based one doesnt close its indexreader).

I would suggest that the current architecture was aimed at making it easy for users to plug in their own capabilities and it allows you to do so at pretty much every step.  Did it hit that mark 100%?  Of course not.  But, I do know there are plenty of people who have implemented their own pieces to it using their own logic.

>
>>
>>>
>>> Even the input format that comes into the spellchecker in
>>> getSuggestions(SpellingOptions options) is just Tokens, but this is
>>> pretty limiting. For instance, I think it makes way more sense for a
>>> spellchecker API to take Query and return corrected Querys, and in my
>>> situation i could give better results, but the Solr APIs stop me.
>>
>> And you are then going to do Query.toString() to display that back to the user?
>
> why do you care?

I don't.  The SpellCheckComponent was meant for spellchecking a string and rendering it back to the user in a meaningful way, i.e. something that they would recognize.  To me, at the time, that meant operating on the string that the user passed in, not a Query object that has potentially been rewritten and is not mappable back to a user in meaningful way.  Given my requirements at the time, I thought it was a reasonable decision.  In light of your requirements, we can likely satisfy both.  In fact, with the proposal I'm putting forth about refactoring this stuff, I think it would likely make it easier for you to implement your own Component that does what you need to do it, while reusing as much as you want.

> maybe that works fine for me, i don't use the dismax
> parser that generates horrific queries so everything is fine... and
> thats my point... something more like a pipeline/attributes-based
> thing woudl work much better here, its up to the user.
>
> certainly it makes sense to keep the original query around... why hide
> it?

Let's just add it to the SpellingOptions.

> and the hairy mess of code that converts it into tokens, this
> needs to be something like a pipeline, because some people don't want
> it, or want to do it their own way.

The QueryConverter was designed to be pluggable right from the get go.  I don't see this as not fitting in that model, other than the Token issue, which we can change.

>
> And, lets say i have a hunspell dictionary for my language... how do i
> plug this in? I don't want it to implement Dictionary, because I'm not
> stupid enough to return something thats not in my index (see below),
> maybe i only want to use it as a 'filter' to prevent suggestions that
> are spelled incorrectly...

Implement an Index backed Dictionary that filters by Hunspell and feeds into the Spellchecker.  I've seen that done on more than one occasion.

>
>
> we really need to seriously clean house on the spellchecker stuff
> (lucene too)

+1.  

> and to answer your question, if we can fix these APIs in
> any way, I'm all for just doing a backwards break, because I think the
> existing APIs are completely broken.
>
> For example, the whole index-based spellchecker in lucene has bad
> performance because its APIs were made overly generic:
> I think its important that it doesn't call docFreq() on every single
> term in the Dictionary when rebuilding, it should walk a TermEnum in
> parallel.

Sounds great.  I also think the notion of onlyMorePopular is screwed up too and needs to be revisited.

> But, it can't do this because it can't assume the Dictionary is in
> sorted order!?
> I guess thats because the "Dictionary" idea was made overly generic,
> abstracted into useless PlainTextDictionary and LuceneDictionary.
>
> PlainTextDictionary? useless... why the hell would you return
> something that isn't in your index?!

It can be quite useful to have an external source for tokens and I've seen it in action on several occasions.  Just because they are fed in from an external source doesn't mean they aren't in the index.  For instance, dump your terms from the index, do some downstream processing according to user logs or whatever (or Hunspell if you want) and then load them back into the Spell checker.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spellchecker design was Re: Solr 3.1 back compat

Andrzej Białecki-2
In reply to this post by Robert Muir
On 2010-10-26 13:24, Robert Muir wrote:

> PlainTextDictionary? useless... why the hell would you return
> something that isn't in your index?!

This is a legitimate question, but there's no need to shout :)

Sometimes you want a dictionary that is cleaned up and re-weighted by an
external process (human-based or other), even if it originally came from
your index. So it's not either/or - you can have a file-based dictionary
that nonetheless gives you stuff that _is_ in your index.

(Yeah, and sorted vs. unsorted ... I tried to hack it by tagging some
classes with a SortedIterator, but it was indeed a half-hearted
attempt... it needs to be fixed, not worked around).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spellchecker design was Re: Solr 3.1 back compat

Robert Muir
In reply to this post by Grant Ingersoll-2
On Tue, Oct 26, 2010 at 8:11 AM, Grant Ingersoll <[hidden email]> wrote:

>>
>> And, lets say i have a hunspell dictionary for my language... how do i
>> plug this in? I don't want it to implement Dictionary, because I'm not
>> stupid enough to return something thats not in my index (see below),
>> maybe i only want to use it as a 'filter' to prevent suggestions that
>> are spelled incorrectly...
>
> Implement an Index backed Dictionary that filters by Hunspell and feeds into the Spellchecker.  I've seen that done on more than one occasion.
>

again though, i don't think it should be at the Dictionary level. For
example, my spellchecker (DirectSpellChecker) uses no dictionary... so
if i want to filter its results with Hunspell, i mean this is
perfectly reasonable... and maybe i want to filter the results from
AutoSuggest with Hunspell?!

Certainly i can add hunspell support to DirectSpellChecker myself, but
you see how this is sorta silly, if someone wants it with the
IndexBasedSpellChecker then it has to be implemented there too, yet I
think we could add some idea like SpellCheckFilter (filters spellcheck
results) where people could plug this stuff in themselves and it works
with all these checkers/suggesters/whatever.

I felt other things were at the dictionary level and shouldn't be, for
example "HighFrequencyDictionary" (which is only in Solr, and should
probably be factored into Lucene).
In my case i wanted to provide this to Lucene users, so i just do it
at runtime via thresholdFrequency, since the docfreq is free from the
TermsEnum anyway.

> Sounds great.  I also think the notion of onlyMorePopular is screwed up too and needs to be revisited.

yes, i don't really understand this... and some of the behavior around it!


>> PlainTextDictionary? useless... why the hell would you return
>> something that isn't in your index?!
>
> It can be quite useful to have an external source for tokens and I've seen it in action on several occasions.  Just because they are fed in from an external source doesn't mean they aren't in the index.  For instance, dump your terms from the index, do some downstream processing according to user logs or whatever (or Hunspell if you want) and then load them back into the Spell checker.

Right, but see above, in my case i don't "load anything" since i have
no datastructure... So i think the API can/should be flexible enough
to do these kinda things without the notion of taking data from one
index and shoving it into another.

And, this special use case shouldn't slow down the common use case
where its a LuceneDictionary.

In general, i know i sound like a big whiner, but i actually think we
have a huge opportunity here. It looked to me (at a glance) that now
that Lucene/Solr are merged we can fix this stuff across both Lucene
and Solr more easily.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spellchecker design was Re: Solr 3.1 back compat

Robert Muir
In reply to this post by Andrzej Białecki-2
On Tue, Oct 26, 2010 at 8:19 AM, Andrzej Bialecki <[hidden email]> wrote:
> Sometimes you want a dictionary that is cleaned up and re-weighted by an
> external process (human-based or other), even if it originally came from
> your index. So it's not either/or - you can have a file-based dictionary
> that nonetheless gives you stuff that _is_ in your index.

right, and i would like to possibly support this in my spellchecker
via DFA intersection at runtime (intersect the special cleaned-up DFA
with the levenshtein query DFA).
but the underlying "dictionary" (the lucene index) is unchanged,
instead this would act like a filter.

it would be nice if the concept was somehow more general and for the
other spellcheckers *implemented* via Dictionary, but that shouldn't
be the only way.

>
> (Yeah, and sorted vs. unsorted ... I tried to hack it by tagging some
> classes with a SortedIterator, but it was indeed a half-hearted
> attempt... it needs to be fixed, not worked around).
>

It would be cool to add this to Lucene in the short term, so we could
mark the LuceneDictionary as being in sorted order... then we could
explore the TermEnum optimization i spoke of, rather than calling
IndexReader.docFreq() on the spellcheck index for every term in the
dictionary to see if it already exists.

Yeah, i know if they are sorted they will tend to be in the same TII
block, and the term dictionary cache will generally work, but I think
it would still end out faster... and no need to completely hose the
term dictionary cache to rebuild a spellcheck index.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]