[jira] Created: (LUCENE-2522) add simple japanese tokenizer, based on tinysegmenter

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2522) add simple japanese tokenizer, based on tinysegmenter

JIRA jira@apache.org
add simple japanese tokenizer, based on tinysegmenter
-----------------------------------------------------

                 Key: LUCENE-2522
                 URL: https://issues.apache.org/jira/browse/LUCENE-2522
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/analyzers
            Reporter: Robert Muir
            Priority: Minor


TinySegmenter (http://www.chasen.org/~taku/software/TinySegmenter/) is a tiny japanese segmenter.

It was ported to java/lucene by Kohei TAKETA <[hidden email]>,
and is under friendly license terms (BSD, some files explicitly disclaim copyright to the source code, giving a blessing instead)

Koji knows the author, and already contacted about incorporating into lucene:
{noformat}
I've contacted Takeda-san who is the creater of Java version of
TinySegmenter. He said he is happy if his program is part of Lucene.
He is a co-author of my book about Solr published in Japan, BTW. ;-)
{noformat}


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2522) add simple japanese tokenizer, based on tinysegmenter

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2522:
--------------------------------

    Attachment: LUCENE-2522.patch

here is a really quickly done patch, just to get started (not really for committing)

* converted their tests to basetokenstream tests,
* changed it to use CharTermAttribute instead of TermAttribute,
* added clearAttributes()
* made class final.
* added solr factory.

The code is nice, it is setup to work on unicode codepoints etc, but i think we can improve
it by using CharArrayMaps for speed and by using lucene's codepoint i/o stuff in CharUtils.


> add simple japanese tokenizer, based on tinysegmenter
> -----------------------------------------------------
>
>                 Key: LUCENE-2522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2522
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2522.patch
>
>
> TinySegmenter (http://www.chasen.org/~taku/software/TinySegmenter/) is a tiny japanese segmenter.
> It was ported to java/lucene by Kohei TAKETA <[hidden email]>,
> and is under friendly license terms (BSD, some files explicitly disclaim copyright to the source code, giving a blessing instead)
> Koji knows the author, and already contacted about incorporating into lucene:
> {noformat}
> I've contacted Takeda-san who is the creater of Java version of
> TinySegmenter. He said he is happy if his program is part of Lucene.
> He is a co-author of my book about Solr published in Japan, BTW. ;-)
> {noformat}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2522) add simple japanese tokenizer, based on tinysegmenter

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2522:
--------------------------------

    Attachment: LUCENE-2522.patch

i refactored the TinySegmenterConstants to use ints/switch statements instead of all the hashmaps.

this creates a larger .java file, but its a smaller .class, and scoring no longer has to create 24 strings per character


> add simple japanese tokenizer, based on tinysegmenter
> -----------------------------------------------------
>
>                 Key: LUCENE-2522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2522
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2522.patch, LUCENE-2522.patch
>
>
> TinySegmenter (http://www.chasen.org/~taku/software/TinySegmenter/) is a tiny japanese segmenter.
> It was ported to java/lucene by Kohei TAKETA <[hidden email]>,
> and is under friendly license terms (BSD, some files explicitly disclaim copyright to the source code, giving a blessing instead)
> Koji knows the author, and already contacted about incorporating into lucene:
> {noformat}
> I've contacted Takeda-san who is the creater of Java version of
> TinySegmenter. He said he is happy if his program is part of Lucene.
> He is a co-author of my book about Solr published in Japan, BTW. ;-)
> {noformat}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2522) add simple japanese tokenizer, based on tinysegmenter

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2522:
--------------------------------

        Fix Version/s: 3.1
                       4.0
    Affects Version/s: 3.0.3

> add simple japanese tokenizer, based on tinysegmenter
> -----------------------------------------------------
>
>                 Key: LUCENE-2522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2522
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.0.3
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2522.patch, LUCENE-2522.patch
>
>
> TinySegmenter (http://www.chasen.org/~taku/software/TinySegmenter/) is a tiny japanese segmenter.
> It was ported to java/lucene by Kohei TAKETA <[hidden email]>,
> and is under friendly license terms (BSD, some files explicitly disclaim copyright to the source code, giving a blessing instead)
> Koji knows the author, and already contacted about incorporating into lucene:
> {noformat}
> I've contacted Takeda-san who is the creater of Java version of
> TinySegmenter. He said he is happy if his program is part of Lucene.
> He is a co-author of my book about Solr published in Japan, BTW. ;-)
> {noformat}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2522) add simple japanese tokenizer, based on tinysegmenter

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2522:
--------------------------------

    Affects Version/s:     (was: 3.0.3)

> add simple japanese tokenizer, based on tinysegmenter
> -----------------------------------------------------
>
>                 Key: LUCENE-2522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2522
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2522.patch, LUCENE-2522.patch
>
>
> TinySegmenter (http://www.chasen.org/~taku/software/TinySegmenter/) is a tiny japanese segmenter.
> It was ported to java/lucene by Kohei TAKETA <[hidden email]>,
> and is under friendly license terms (BSD, some files explicitly disclaim copyright to the source code, giving a blessing instead)
> Koji knows the author, and already contacted about incorporating into lucene:
> {noformat}
> I've contacted Takeda-san who is the creater of Java version of
> TinySegmenter. He said he is happy if his program is part of Lucene.
> He is a co-author of my book about Solr published in Japan, BTW. ;-)
> {noformat}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]