Bug in Lucene Sandbox SpellChecker class (fix included)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Bug in Lucene Sandbox SpellChecker class (fix included)

Michael Harhen
While debugging a Lucene application using the Sandbox SpellChecker, I
have identified a problem with SpellChecker's suggestSimilar(...) method.

Lines 175-177 are currently:
            if (sugword.string==word) {
                continue; // don't suggest a word for itself, that would
be silly
            }

I believe that "==" should be replaced by equals(), so the code should read:
            if (sugword.string.equals(word)) {
                continue; // don't suggest a word for itself, that would
be silly
            }

I cannot think of any case where using "==" would give the expected
result. Also, making the change eliminated my bug!

What is the procedure to get this fix incorporated in the Sandbox?

Regards
Michael H



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bug in Lucene Sandbox SpellChecker class (fix included)

Daniel Naber-5
On Samstag 28 Januar 2006 17:06, Michael Harhen wrote:

> I believe that "==" should be replaced by equals(), so the code should
> read: if (sugword.string.equals(word)) {
>                 continue; // don't suggest a word for itself, that would

> What is the procedure to get this fix incorporated in the Sandbox?

Thanks, I have committed your fix and extended the test case to include
this case.

Regards
 Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bug in Lucene Sandbox SpellChecker class (fix included)

davekor@gmail.com
In reply to this post by Michael Harhen
I am not sure about that particular piece of code, but Lucene always
use interned strings, which makes string compares a straightforward
and very fast "==" object reference comparison operation rather than
the use the slow equals() character by character compare function.

If spellchecker's strings all use intern() strings, then using
equals() will simply slow the spell checker's performance without
fixing any bugs.  (Of course, I could be totally wrong too)

On 1/29/06, Michael Harhen <[hidden email]> wrote:

> While debugging a Lucene application using the Sandbox SpellChecker, I
> have identified a problem with SpellChecker's suggestSimilar(...) method.
>
> Lines 175-177 are currently:
>             if (sugword.string==word) {
>                 continue; // don't suggest a word for itself, that would
> be silly
>             }
>
> I believe that "==" should be replaced by equals(), so the code should read:
>             if (sugword.string.equals(word)) {
>                 continue; // don't suggest a word for itself, that would
> be silly
>             }
>
> I cannot think of any case where using "==" would give the expected
> result. Also, making the change eliminated my bug!
>
> What is the procedure to get this fix incorporated in the Sandbox?
>
> Regards
> Michael H
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Dave Kor, Research Assistant
Center for Information Mining and Extraction
School of Computing
National University of Singapore.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Tree based BitSet (aka IntegerSet, DocSet...)

eks dev
In reply to this post by Daniel Naber-5
might be interesting:
http://www.iis.uni-stuttgart.de/intset/

Another way to represent Bit(Integer)Set. Should
outperform nicely BitSet or HashBitSet as far as
iteration speed and memory is concern. In Lucene where
distribution of set bits is typically exponential...
usage in caching, Filter...


               
___________________________________________________________
Win a BlackBerry device from O2 with Yahoo!. Enter now. http://www.yahoo.co.uk/blackberry

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tree based BitSet (aka IntegerSet, DocSet...)

Paul Elschot
On Saturday 28 January 2006 19:27, eks dev wrote:
> might be interesting:
> http://www.iis.uni-stuttgart.de/intset/
>
> Another way to represent Bit(Integer)Set. Should
> outperform nicely BitSet or HashBitSet as far as
> iteration speed and memory is concern. In Lucene where
> distribution of set bits is typically exponential...
> usage in caching, Filter...

This gives some context, a performance comparison program,
and indicates that the licence for the context is LGPL:

http://www.iis.uni-stuttgart.de/personen/lippold/MathCollection/index-en.html

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tree based BitSet (aka IntegerSet, DocSet...)

Andrzej Białecki-2
Paul Elschot wrote:

> On Saturday 28 January 2006 19:27, eks dev wrote:
>  
>> might be interesting:
>> http://www.iis.uni-stuttgart.de/intset/
>>
>> Another way to represent Bit(Integer)Set. Should
>> outperform nicely BitSet or HashBitSet as far as
>> iteration speed and memory is concern. In Lucene where
>> distribution of set bits is typically exponential...
>> usage in caching, Filter...
>>    
>
> This gives some context, a performance comparison program,
> and indicates that the licence for the context is LGPL:
>
> http://www.iis.uni-stuttgart.de/personen/lippold/MathCollection/index-en.html
>  

Unfortunately, the license distributed with the JAR (which we must
assume takes precedence over whatever is stated on the web pages) is
much more restrictive, it's the Java Research License, which
specifically disallows any commercial use. So, short of reimplementing
it from scratch it's of no use except for academic study. Pity.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bug in Lucene Sandbox SpellChecker class (fix included)

Michael Harhen
In reply to this post by davekor@gmail.com
The intern() method is never applied to variable "word", which is passed
directly into the suggestSimilar(...) method from user code. So in this
case, it appears that my fix is necessary.

Michael H

Dave Kor wrote:

> I am not sure about that particular piece of code, but Lucene always
> use interned strings, which makes string compares a straightforward
> and very fast "==" object reference comparison operation rather than
> the use the slow equals() character by character compare function.
>
> If spellchecker's strings all use intern() strings, then using
> equals() will simply slow the spell checker's performance without
> fixing any bugs.  (Of course, I could be totally wrong too)
>
> On 1/29/06, Michael Harhen <[hidden email]> wrote:
>  
>> While debugging a Lucene application using the Sandbox SpellChecker, I
>> have identified a problem with SpellChecker's suggestSimilar(...) method.
>>
>> Lines 175-177 are currently:
>>             if (sugword.string==word) {
>>                 continue; // don't suggest a word for itself, that would
>> be silly
>>             }
>>
>> I believe that "==" should be replaced by equals(), so the code should read:
>>             if (sugword.string.equals(word)) {
>>                 continue; // don't suggest a word for itself, that would
>> be silly
>>             }
>>
>> I cannot think of any case where using "==" would give the expected
>> result. Also, making the change eliminated my bug!
>>
>> What is the procedure to get this fix incorporated in the Sandbox?
>>
>> Regards
>> Michael H
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>    
>
>
> --
> Dave Kor, Research Assistant
> Center for Information Mining and Extraction
> School of Computing
> National University of Singapore.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tree based BitSet (aka IntegerSet, DocSet...)

psmith-3
In reply to this post by Andrzej Białecki-2
>
> Unfortunately, the license distributed with the JAR (which we must  
> assume takes precedence over whatever is stated on the web pages)  
> is much more restrictive, it's the Java Research License, which  
> specifically disallows any commercial use. So, short of  
> reimplementing it from scratch it's of no use except for academic  
> study. Pity.

Would that preclude re-implementing the same algorithm in new source  
code?  I'm not clear on whether that violates the license.

cheers,

Paul Smith

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tree based BitSet (aka IntegerSet, DocSet...)

Andrzej Białecki-2
Paul Smith wrote:

>>
>> Unfortunately, the license distributed with the JAR (which we must
>> assume takes precedence over whatever is stated on the web pages) is
>> much more restrictive, it's the Java Research License, which
>> specifically disallows any commercial use. So, short of
>> reimplementing it from scratch it's of no use except for academic
>> study. Pity.
>
> Would that preclude re-implementing the same algorithm in new source
> code?  I'm not clear on whether that violates the license.

No, I'm pretty sure it wouldn't, so long as you don't look at this code,
lest you become "tainted" ... ;-)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tree based BitSet (aka IntegerSet, DocSet...)

psmith-3
>
> No, I'm pretty sure it wouldn't, so long as you don't look at this  
> code, lest you become "tainted" ... ;-)

Isn't that where the phrase "I have no recollection of that Senator"  
comes in handy? :)

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tree based BitSet (aka IntegerSet, DocSet...)

eks dev
In reply to this post by Andrzej Białecki-2
> Unfortunately, the license distributed with the JAR
> (which we must
> assume takes precedence over whatever is stated on
> the web pages) is
> much more restrictive, it's the Java Research
> License, which
> specifically disallows any commercial use. So, short
> of reimplementing
> it from scratch it's of no use except for academic
> study. Pity.

I guess sun researche licence relates to the class
BitIntegerSet (practically taken from java 6.0 Mustang
release and slightly modified).

The rest (IntegerSet and TreeIntegerSet) represents
unique work and I would bet this part is under LGPL.
If somebody with more open-source leverage (e.g. Doug,
Apache...) could ask
nicely, I beleive nobody would mind to have his work
included in Apache project Lucene?












               
___________________________________________________________
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tree based BitSet (aka IntegerSet, DocSet...)

Andrzej Białecki-2
eks dev wrote:

>> Unfortunately, the license distributed with the JAR
>> (which we must
>> assume takes precedence over whatever is stated on
>> the web pages) is
>> much more restrictive, it's the Java Research
>> License, which
>> specifically disallows any commercial use. So, short
>> of reimplementing
>> it from scratch it's of no use except for academic
>> study. Pity.
>>    
>
> I guess sun researche licence relates to the class
> BitIntegerSet (practically taken from java 6.0 Mustang
> release and slightly modified).
>
> The rest (IntegerSet and TreeIntegerSet) represents
> unique work and I would bet this part is under LGPL.
> If somebody with more open-source leverage (e.g. Doug,
> Apache...) could ask
> nicely, I beleive nobody would mind to have his work
> included in Apache project Lucene?
>  

It's not about asking nicely - the ASF's policy is that no (L)GPL code
is allowed in the respositories, and for good reasons. The only thing
they/we could do would be to ask nicely to re-license it under ASL.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tree based BitSet (aka IntegerSet, DocSet...)

DM Smith
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki wrote:

> Paul Elschot wrote:
>> On Saturday 28 January 2006 19:27, eks dev wrote:
>>  
>>> might be interesting:
>>> http://www.iis.uni-stuttgart.de/intset/
>>>
>>> Another way to represent Bit(Integer)Set. Should
>>> outperform nicely BitSet or HashBitSet as far as
>>> iteration speed and memory is concern. In Lucene where
>>> distribution of set bits is typically exponential...
>>> usage in caching, Filter...    
>>
>> This gives some context, a performance comparison program,
>> and indicates that the licence for the context is LGPL:
>>
>> http://www.iis.uni-stuttgart.de/personen/lippold/MathCollection/index-en.html 
>>
>>  
>
> Unfortunately, the license distributed with the JAR (which we must
> assume takes precedence over whatever is stated on the web pages) is
> much more restrictive, it's the Java Research License, which
> specifically disallows any commercial use. So, short of reimplementing
> it from scratch it's of no use except for academic study. Pity.
The idea is fairly simple: If most uses of BitSet are sparse over a
large universe, then using a tree of BitSets would be cheaper.

The implementation needs an Interface representation of BitSet to work,
with BitSet being derived from that implementation. Thus the code uses
IntegerSet as the fundamental interface. This is essentially all the
methods in BitSet, with BitSet being changed to IntegerSet where ever it
occurs. BitIntegerSet is a copy of BitSet from Sun but trivially
modified to implement IntegerSet.

It is this usage of BitSet that is causing the basic problem.







---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bug in Lucene Sandbox SpellChecker class (fix included)

Michael Harhen
In reply to this post by Michael Harhen
This fix has been included in the repository.
Thanks....


Michael Harhen wrote:

> While debugging a Lucene application using the Sandbox SpellChecker, I
> have identified a problem with SpellChecker's suggestSimilar(...) method.
>
> Lines 175-177 are currently:
>            if (sugword.string==word) {
>                continue; // don't suggest a word for itself, that
> would be silly
>            }
>
> I believe that "==" should be replaced by equals(), so the code should
> read:
>            if (sugword.string.equals(word)) {
>                continue; // don't suggest a word for itself, that
> would be silly
>            }
>
> I cannot think of any case where using "==" would give the expected
> result. Also, making the change eliminated my bug!
>
> What is the procedure to get this fix incorporated in the Sandbox?
>
> Regards
> Michael H
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...