[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888373#action_12888373 ]

Steven Rowe commented on LUCENE-2167:
-------------------------------------

I tried increasing the number of documents in the benchmark alg from 10k to 50k, but apparently 50k docs was too much to fit into my OS FS cache, because it thrashed the whole time, and performance was more than an order of magnitude worse.

I increased the number of rounds from 5 to 25, and increased the number of documents from 10k to 20k - below are three runs with these settings:

||Operation||recsPerRun||rec/s||elapsedSec||
|ClassicTokenizer|2467769|669,134.75|3.69|
|ICUTokenizer|2481688|548,924.56|4.52|
|RBBITokenizer|2481688|573,270.50|4.33|
|StandardTokenizer|2481687|656,704.69|3.78|
|UAX29Tokenizer|2481688|661,254.44|3.75|
||Operation||recsPerRun||rec/s||elapsedSec||
|ClassicTokenizer|2467769|667,867.12|3.69|
|ICUTokenizer|2481688|546,025.94|4.54|
|RBBITokenizer|2481688|576,466.44|4.30|
|StandardTokenizer|2481687|656,878.50|3.78|
|UAX29Tokenizer|2481688|665,510.31|3.73|
||Operation||recsPerRun||rec/s||elapsedSec||
|ClassicTokenizer|2467769|664,092.81|3.72|
|ICUTokenizer|2481688|551,486.25|4.50|
|RBBITokenizer|2481688|581,191.56|4.27|
|StandardTokenizer|2481687|655,317.38|3.79|
|UAX29Tokenizer|2481688|663,021.12|3.74|

These are more consistent.  I think the ~3% performance hit for the new StandardTokenizer over ClassicTokenizer is acceptable.

> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-lucene-buildhelper-maven-plugin.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, standard.zip
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]