[jira] Created: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

Soren Daugaard (Jira)
a lexicon object for merging spellchecker and synonyms from stemming
--------------------------------------------------------------------

                 Key: LUCENE-1190
                 URL: https://issues.apache.org/jira/browse/LUCENE-1190
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/*, Search
    Affects Versions: 2.3
            Reporter: Mathieu Lecarme
         Attachments: aphone+lexicon.patch

Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...).
For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files.
Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful).
Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...).
Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used.
A similarTokenFilter is provided.
A spellchecker will come soon.
A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
Unused words can be remove on demand (lazy delete?)

Any criticism or suggestions?


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

Soren Daugaard (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mathieu Lecarme updated LUCENE-1190:
------------------------------------

    Attachment: aphone+lexicon.patch

> a lexicon object for merging spellchecker and synonyms from stemming
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1190
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1190
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*, Search
>    Affects Versions: 2.3
>            Reporter: Mathieu Lecarme
>         Attachments: aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

Soren Daugaard (Jira)
In reply to this post by Soren Daugaard (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mathieu Lecarme updated LUCENE-1190:
------------------------------------

    Attachment: aphone+lexicon.patch

> a lexicon object for merging spellchecker and synonyms from stemming
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1190
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1190
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*, Search
>    Affects Versions: 2.3
>            Reporter: Mathieu Lecarme
>         Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

Soren Daugaard (Jira)
In reply to this post by Soren Daugaard (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573907#action_12573907 ]

Mathieu Lecarme commented on LUCENE-1190:
-----------------------------------------

News features:
helper to extends query with similarity of each term :
+type:dog +name:rintint*
will become:
+type:(+dog (dogs doggy)^0.7) +name:rintint*

"Do you mean pattern" packaged over IndexSearcher. If search result is under a thresold, sorted suggestion list for each term is provided, and a rewritten query sentence:
truc:brawn
will become:
truc:brown




> a lexicon object for merging spellchecker and synonyms from stemming
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1190
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1190
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*, Search
>    Affects Versions: 2.3
>            Reporter: Mathieu Lecarme
>         Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

Soren Daugaard (Jira)
In reply to this post by Soren Daugaard (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574182#action_12574182 ]

Otis Gospodnetic commented on LUCENE-1190:
------------------------------------------

This sounds like something that might be interesting, but honestly I don't follow the initial description and the 300KB+ patch is a big one.

For example, I don't know what you mean by "Some Lucene features need a list of referring word".  Do you mean "a list of associated words"?

{quote}
Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...).
{quote}

Each meta is a Field.... what do you mean by that?  Could you please give an example?

{quote}
Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used.
{quote}

Hm, not sure I know what you mean.  Are you saying that once you create a sufficiently large lexicon/dictionary/index, the number of new terms starts decreasing? (Heap's Law? http://en.wikipedia.org/wiki/Heaps'_law )


> a lexicon object for merging spellchecker and synonyms from stemming
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1190
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1190
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*, Search
>    Affects Versions: 2.3
>            Reporter: Mathieu Lecarme
>         Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

Soren Daugaard (Jira)
In reply to this post by Soren Daugaard (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574214#action_12574214 ]

Mathieu Lecarme commented on LUCENE-1190:
-----------------------------------------


With a FuzzyQuery, for example, you iterate over Term in index, and  
looking for the nearest one. PrefixQuery or regular expression work in  
a similar way.
If you say, fuzzy querying will never gives a word with different size  
of 1 (size+1 or size -1), you can restrict the list of candidates, and  
ngram index can help you more.

Some token filter destroy the word. Stemmer for example. If you wont  
to search wide, stemmer can help you, but can't use PrefixQuery with  
stemmed word. So, you can stemme word in a lexicon and use it as a  
synonym. You index "dog" and look for "doggy",  "dogs" and "dog".  
Lexicon can use static list of word, from hunspell index or wikipedia  
parsing, or words extracted from your index.

for the word "Lucene" :

word:lucene
pop:42
anagram.anagram:celnu
aphone.start:LS
aphone.gram:LS
aphone.gram:SN
aphone.end:SN
aphone.size:3
aphone.phonem:LSN
ngram.start:lu
ngram.gram:lu
ngram.gram:uc
ngram.gram:ce
ngram.gram:en
ngram.gram:ne
ngram.end:ne
ngram.size:6
stemmer.stem:lucen


Yes.

M.


> a lexicon object for merging spellchecker and synonyms from stemming
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1190
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1190
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*, Search
>    Affects Versions: 2.3
>            Reporter: Mathieu Lecarme
>         Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

Mathieu Lecarme
hum, quote and question disappear.

Le 2 mars 08 à 13:32, Mathieu Lecarme (JIRA) a écrit :

>
>    [ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574214 
> #action_12574214 ]
>
> Mathieu Lecarme commented on LUCENE-1190:
> -----------------------------------------
>
>
 >> For example, I don't know what you mean by "Some Lucene features  
need a list of referring word".  Do you mean "a list of associated  
words"?

> With a FuzzyQuery, for example, you iterate over Term in index, and
> looking for the nearest one. PrefixQuery or regular expression work in
> a similar way.
> If you say, fuzzy querying will never gives a word with different size
> of 1 (size+1 or size -1), you can restrict the list of candidates, and
> ngram index can help you more.
>
> Some token filter destroy the word. Stemmer for example. If you wont
> to search wide, stemmer can help you, but can't use PrefixQuery with
> stemmed word. So, you can stemme word in a lexicon and use it as a
> synonym. You index "dog" and look for "doggy",  "dogs" and "dog".
> Lexicon can use static list of word, from hunspell index or wikipedia
> parsing, or words extracted from your index.

 >> Each meta is a Field.... what do you mean by that?  Could you  
please give an example?

> for the word "Lucene" :
>
> word:lucene
> pop:42
> anagram.anagram:celnu
> aphone.start:LS
> aphone.gram:LS
> aphone.gram:SN
> aphone.end:SN
> aphone.size:3
> aphone.phonem:LSN
> ngram.start:lu
> ngram.gram:lu
> ngram.gram:uc
> ngram.gram:ce
> ngram.gram:en
> ngram.gram:ne
> ngram.end:ne
> ngram.size:6
> stemmer.stem:lucen
>
>

 >> Hm, not sure I know what you mean.  Are you saying that once you  
create a sufficiently large lexicon/dictionary/index, the number of  
new terms starts decreasing? (Heap's Law? http://en.wikipedia.org/wiki/Heaps'_law 
  )

> Yes.
>
>> a lexicon object for merging spellchecker and synonyms from stemming
>> --------------------------------------------------------------------
>>
>>                Key: LUCENE-1190
>>                URL: https://issues.apache.org/jira/browse/LUCENE-1190
>>            Project: Lucene - Java
>>         Issue Type: New Feature
>>         Components: contrib/*, Search
>>   Affects Versions: 2.3
>>           Reporter: Mathieu Lecarme
>>        Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>>
>>
>> Some Lucene features need a list of referring word. Spellchecking  
>> is the basic example, but synonyms is an other use. Other tools can  
>> be used smoothlier with a list of words, without disturbing the  
>> main index : stemming and other simplification of word (anagram,  
>> phonetic ...).
>> For that, I suggest a Lexicon object, wich contains words (Term +  
>> frequency), wich can be built from Lucene Directory, or plain text  
>> files.
>> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and  
>> ISOLatin1AccentFilter should be the most useful).
>> Lexicon uses a Lucene Directory, each Word is a Document, each meta  
>> is a Field (word, ngram, phonetic, fields, anagram, size ...).
>> Above a minimum size, number of differents words used in an index  
>> can be considered as stable. So, a standard Lexicon (built from  
>> wikipedia by example) can be used.
>> A similarTokenFilter is provided.
>> A spellchecker will come soon.
>> A fuzzySearch implementation, a neutral synonym TokenFilter can be  
>> done.
>> Unused words can be remove on demand (lazy delete?)
>> Any criticism or suggestions?
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

Soren Daugaard (Jira)
In reply to this post by Soren Daugaard (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576415#action_12576415 ]

Mathieu Lecarme commented on LUCENE-1190:
-----------------------------------------

A simpler preview of Lexicon features :
http://blog.garambrogne.net/index.php?post/2008/03/07/A-lexicon-approach-for-Lucene-index


> a lexicon object for merging spellchecker and synonyms from stemming
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1190
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1190
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*, Search
>    Affects Versions: 2.3
>            Reporter: Mathieu Lecarme
>         Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]