[jira] Created: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Tim Allison (Jira)
Optimize the core tokenizers/analyzers & deprecate Token.termText
-----------------------------------------------------------------

                 Key: LUCENE-969
                 URL: https://issues.apache.org/jira/browse/LUCENE-969
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
    Affects Versions: 2.3
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor
             Fix For: 2.3


There is some "low hanging fruit" for optimizing the core tokenizers
and analyzers:

  - Re-use a single Token instance during indexing instead of creating
    a new one for every term.  To do this, I added a new method "Token
    next(Token result)" (Doron's suggestion) which means TokenStream
    may use the "Token result" as the returned Token, but is not
    required to (ie, can still return an entirely different Token if
    that is more convenient).  I added default implementations for
    both next() methods in TokenStream.java so that a TokenStream can
    choose to implement only one of the next() methods.

  - Use "char[] termBuffer" in Token instead of the "String
    termText".

    Token now maintains a char[] termBuffer for holding the term's
    text.  Tokenizers & filters should retrieve this buffer and
    directly alter it to put the term text in or change the term
    text.

    I only deprecated the termText() method.  I still allow the ctors
    that pass in String termText, as well as setTermText(String), but
    added a NOTE about performance cost of using these methods.  I
    think it's OK to keep these as convenience methods?

    After the next release, when we can remove the deprecated API, we
    should clean up Token.java to no longer maintain "either String or
    char[]" (and the initTermBuffer() private method) and always use
    the char[] termBuffer instead.

  - Re-use TokenStream instances across Fields & Documents instead of
    creating a new one for each doc.  To do this I added an optional
    "reusableTokenStream(...)" to Analyzer which just defaults to
    calling tokenStream(...), and then I implemented this for the core
    analyzers.

I'm using the patch from LUCENE-967 for benchmarking just
tokenization.

The changes above give 21% speedup (742 seconds -> 585 seconds) for
LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
IO system (best of 2 runs).

If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
(1236 sec -> 774 sec), I think because of re-using TokenStreams across
docs.

I'm just running with this alg and recording the elapsed time:

  analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
  doc.tokenize.log.step=50000
  docs.file=/lucene/wikifull.txt
  doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
  doc.tokenized=true
  doc.maker.forever=false

  {ReadTokens > : *

See this thread for discussion leading up to this:

  http://www.gossamer-threads.com/lists/lucene/java-dev/51283

I also fixed Token.toString() to work correctly when termBuffer is
used (and added unit test).


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-969:
--------------------------------------

    Attachment: LUCENE-969.patch

First-cut patch.  All tests pass.  I still need do fix some javadocs
but otherwise I think this is close...


> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -----------------------------------------------------------------
>
>                 Key: LUCENE-969
>                 URL: https://issues.apache.org/jira/browse/LUCENE-969
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
>     a new one for every term.  To do this, I added a new method "Token
>     next(Token result)" (Doron's suggestion) which means TokenStream
>     may use the "Token result" as the returned Token, but is not
>     required to (ie, can still return an entirely different Token if
>     that is more convenient).  I added default implementations for
>     both next() methods in TokenStream.java so that a TokenStream can
>     choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
>     termText".
>     Token now maintains a char[] termBuffer for holding the term's
>     text.  Tokenizers & filters should retrieve this buffer and
>     directly alter it to put the term text in or change the term
>     text.
>     I only deprecated the termText() method.  I still allow the ctors
>     that pass in String termText, as well as setTermText(String), but
>     added a NOTE about performance cost of using these methods.  I
>     think it's OK to keep these as convenience methods?
>     After the next release, when we can remove the deprecated API, we
>     should clean up Token.java to no longer maintain "either String or
>     char[]" (and the initTermBuffer() private method) and always use
>     the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
>     creating a new one for each doc.  To do this I added an optional
>     "reusableTokenStream(...)" to Analyzer which just defaults to
>     calling tokenStream(...), and then I implemented this for the core
>     analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=50000
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-969:
--------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [New])

> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -----------------------------------------------------------------
>
>                 Key: LUCENE-969
>                 URL: https://issues.apache.org/jira/browse/LUCENE-969
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
>     a new one for every term.  To do this, I added a new method "Token
>     next(Token result)" (Doron's suggestion) which means TokenStream
>     may use the "Token result" as the returned Token, but is not
>     required to (ie, can still return an entirely different Token if
>     that is more convenient).  I added default implementations for
>     both next() methods in TokenStream.java so that a TokenStream can
>     choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
>     termText".
>     Token now maintains a char[] termBuffer for holding the term's
>     text.  Tokenizers & filters should retrieve this buffer and
>     directly alter it to put the term text in or change the term
>     text.
>     I only deprecated the termText() method.  I still allow the ctors
>     that pass in String termText, as well as setTermText(String), but
>     added a NOTE about performance cost of using these methods.  I
>     think it's OK to keep these as convenience methods?
>     After the next release, when we can remove the deprecated API, we
>     should clean up Token.java to no longer maintain "either String or
>     char[]" (and the initTermBuffer() private method) and always use
>     the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
>     creating a new one for each doc.  To do this I added an optional
>     "reusableTokenStream(...)" to Analyzer which just defaults to
>     calling tokenStream(...), and then I implemented this for the core
>     analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=50000
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516510 ]

Michael Busch commented on LUCENE-969:
--------------------------------------

Hi Mike,

this is just an idea to keep Token.java simpler, but I haven't really thought about all the consequences. So feel free to tell me that it's a bad idea ;)

Could you add a new class TermBuffer including the char[] array and your resize() logic that would implement CharSequence? Then you could get rid of the duplicate constructors and setters for String and char[], because String also implements CharSequence. And CharSequence has the method charAt(int index), so it should be almost as fast as directly accessing the char array in case the TermBuffer is used. You would need to change the existing constructors and setter to take a CharSequence object instead of a String, but this is not an API change as users can still pass in a String object. And then you would just need to add a new constructor with offset and length and a similiar setter. Thoughts?

> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -----------------------------------------------------------------
>
>                 Key: LUCENE-969
>                 URL: https://issues.apache.org/jira/browse/LUCENE-969
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
>     a new one for every term.  To do this, I added a new method "Token
>     next(Token result)" (Doron's suggestion) which means TokenStream
>     may use the "Token result" as the returned Token, but is not
>     required to (ie, can still return an entirely different Token if
>     that is more convenient).  I added default implementations for
>     both next() methods in TokenStream.java so that a TokenStream can
>     choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
>     termText".
>     Token now maintains a char[] termBuffer for holding the term's
>     text.  Tokenizers & filters should retrieve this buffer and
>     directly alter it to put the term text in or change the term
>     text.
>     I only deprecated the termText() method.  I still allow the ctors
>     that pass in String termText, as well as setTermText(String), but
>     added a NOTE about performance cost of using these methods.  I
>     think it's OK to keep these as convenience methods?
>     After the next release, when we can remove the deprecated API, we
>     should clean up Token.java to no longer maintain "either String or
>     char[]" (and the initTermBuffer() private method) and always use
>     the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
>     creating a new one for each doc.  To do this I added an optional
>     "reusableTokenStream(...)" to Analyzer which just defaults to
>     calling tokenStream(...), and then I implemented this for the core
>     analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=50000
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516517 ]

Yonik Seeley commented on LUCENE-969:
-------------------------------------

> [...] implement CharSequence
I think CharSequence is Java5

> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -----------------------------------------------------------------
>
>                 Key: LUCENE-969
>                 URL: https://issues.apache.org/jira/browse/LUCENE-969
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
>     a new one for every term.  To do this, I added a new method "Token
>     next(Token result)" (Doron's suggestion) which means TokenStream
>     may use the "Token result" as the returned Token, but is not
>     required to (ie, can still return an entirely different Token if
>     that is more convenient).  I added default implementations for
>     both next() methods in TokenStream.java so that a TokenStream can
>     choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
>     termText".
>     Token now maintains a char[] termBuffer for holding the term's
>     text.  Tokenizers & filters should retrieve this buffer and
>     directly alter it to put the term text in or change the term
>     text.
>     I only deprecated the termText() method.  I still allow the ctors
>     that pass in String termText, as well as setTermText(String), but
>     added a NOTE about performance cost of using these methods.  I
>     think it's OK to keep these as convenience methods?
>     After the next release, when we can remove the deprecated API, we
>     should clean up Token.java to no longer maintain "either String or
>     char[]" (and the initTermBuffer() private method) and always use
>     the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
>     creating a new one for each doc.  To do this I added an optional
>     "reusableTokenStream(...)" to Analyzer which just defaults to
>     calling tokenStream(...), and then I implemented this for the core
>     analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=50000
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-969:
-------------------------------

    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

CharSequence was introduced in 1.4: http://java.sun.com/j2se/1.4.2/docs/api/java/lang/CharSequence.html

> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -----------------------------------------------------------------
>
>                 Key: LUCENE-969
>                 URL: https://issues.apache.org/jira/browse/LUCENE-969
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
>     a new one for every term.  To do this, I added a new method "Token
>     next(Token result)" (Doron's suggestion) which means TokenStream
>     may use the "Token result" as the returned Token, but is not
>     required to (ie, can still return an entirely different Token if
>     that is more convenient).  I added default implementations for
>     both next() methods in TokenStream.java so that a TokenStream can
>     choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
>     termText".
>     Token now maintains a char[] termBuffer for holding the term's
>     text.  Tokenizers & filters should retrieve this buffer and
>     directly alter it to put the term text in or change the term
>     text.
>     I only deprecated the termText() method.  I still allow the ctors
>     that pass in String termText, as well as setTermText(String), but
>     added a NOTE about performance cost of using these methods.  I
>     think it's OK to keep these as convenience methods?
>     After the next release, when we can remove the deprecated API, we
>     should clean up Token.java to no longer maintain "either String or
>     char[]" (and the initTermBuffer() private method) and always use
>     the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
>     creating a new one for each doc.  To do this I added an optional
>     "reusableTokenStream(...)" to Analyzer which just defaults to
>     calling tokenStream(...), and then I implemented this for the core
>     analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=50000
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516670 ]

Michael McCandless commented on LUCENE-969:
-------------------------------------------

> Could you add a new class TermBuffer including the char[] array and
> your resize() logic that would implement CharSequence? Then you
> could get rid of the duplicate constructors and setters for String
> and char[], because String also implements CharSequence. And
> CharSequence has the method charAt(int index), so it should be
> almost as fast as directly accessing the char array in case the
> TermBuffer is used. You would need to change the existing
> constructors and setter to take a CharSequence object instead of a
> String, but this is not an API change as users can still pass in a
> String object. And then you would just need to add a new constructor
> with offset and length and a similiar setter. Thoughts?

If I understand this, consumers of the Token API would need to
separately construct/reuse their own TermBuffer in order to then set
the Token to new text?  This could then slow down applications that
still need to make a new Token instance for every term in their
documents because now 2 class instances would be created for every
token.

Also I don't think this would make the public API simpler?  People
already understand String and char[] as normal ways to represent text
content; if we add our own new class here that's another
Lucene-specific way to represent text content that people will have to
learn.

Internally, Token looks more complex than it will be in the future,
just because we need the initTermBuffer() calls until we can remove
the deprecated attr (String termText) and method (termText()).

I believe having String and char[] variants of text-processing APIs is
fairly common practice and is reasonable.  EG the PorterStemmer has 4
"stem" methods (one accepting String and 3 accepting char[] with or
without offset/length).


> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -----------------------------------------------------------------
>
>                 Key: LUCENE-969
>                 URL: https://issues.apache.org/jira/browse/LUCENE-969
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
>     a new one for every term.  To do this, I added a new method "Token
>     next(Token result)" (Doron's suggestion) which means TokenStream
>     may use the "Token result" as the returned Token, but is not
>     required to (ie, can still return an entirely different Token if
>     that is more convenient).  I added default implementations for
>     both next() methods in TokenStream.java so that a TokenStream can
>     choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
>     termText".
>     Token now maintains a char[] termBuffer for holding the term's
>     text.  Tokenizers & filters should retrieve this buffer and
>     directly alter it to put the term text in or change the term
>     text.
>     I only deprecated the termText() method.  I still allow the ctors
>     that pass in String termText, as well as setTermText(String), but
>     added a NOTE about performance cost of using these methods.  I
>     think it's OK to keep these as convenience methods?
>     After the next release, when we can remove the deprecated API, we
>     should clean up Token.java to no longer maintain "either String or
>     char[]" (and the initTermBuffer() private method) and always use
>     the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
>     creating a new one for each doc.  To do this I added an optional
>     "reusableTokenStream(...)" to Analyzer which just defaults to
>     calling tokenStream(...), and then I implemented this for the core
>     analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=50000
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516792 ]

Michael Busch commented on LUCENE-969:
--------------------------------------

> This could then slow down applications that still need to make a
> new Token instance for every term in their documents because now
> 2 class instances would be created for every token.

Yes that's true. I was thinking that in the new optimized way, where
people reuse the same Token and char[] instance, this wouldn't harm
since TermBuffer would basically just be a wrapper around a char
array. But you are right, this would be an overhead in apps that
can't reuse the Tokens.

> if we add our own new class here that's another
> Lucene-specific way to represent text content that people will have to
> learn.

Agree. I was just thinking that the CharSequence approach would reduce the
number of setters and constructors, but you're right, we're going to remove
the ones that take Strings anyway in a future version.

OK, the API of this patch looks good to me! +1
Thanks for your detailed answer!


> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -----------------------------------------------------------------
>
>                 Key: LUCENE-969
>                 URL: https://issues.apache.org/jira/browse/LUCENE-969
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-969.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
>     a new one for every term.  To do this, I added a new method "Token
>     next(Token result)" (Doron's suggestion) which means TokenStream
>     may use the "Token result" as the returned Token, but is not
>     required to (ie, can still return an entirely different Token if
>     that is more convenient).  I added default implementations for
>     both next() methods in TokenStream.java so that a TokenStream can
>     choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
>     termText".
>     Token now maintains a char[] termBuffer for holding the term's
>     text.  Tokenizers & filters should retrieve this buffer and
>     directly alter it to put the term text in or change the term
>     text.
>     I only deprecated the termText() method.  I still allow the ctors
>     that pass in String termText, as well as setTermText(String), but
>     added a NOTE about performance cost of using these methods.  I
>     think it's OK to keep these as convenience methods?
>     After the next release, when we can remove the deprecated API, we
>     should clean up Token.java to no longer maintain "either String or
>     char[]" (and the initTermBuffer() private method) and always use
>     the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
>     creating a new one for each doc.  To do this I added an optional
>     "reusableTokenStream(...)" to Analyzer which just defaults to
>     calling tokenStream(...), and then I implemented this for the core
>     analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=50000
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-969:
--------------------------------------

    Attachment: LUCENE-969.take2.patch

Updated patch based on recent commits; fixed up the javadocs and a few
other small things.  I think this is ready to commit but I'll wait a
few days for more comments...


> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -----------------------------------------------------------------
>
>                 Key: LUCENE-969
>                 URL: https://issues.apache.org/jira/browse/LUCENE-969
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-969.patch, LUCENE-969.take2.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
>     a new one for every term.  To do this, I added a new method "Token
>     next(Token result)" (Doron's suggestion) which means TokenStream
>     may use the "Token result" as the returned Token, but is not
>     required to (ie, can still return an entirely different Token if
>     that is more convenient).  I added default implementations for
>     both next() methods in TokenStream.java so that a TokenStream can
>     choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
>     termText".
>     Token now maintains a char[] termBuffer for holding the term's
>     text.  Tokenizers & filters should retrieve this buffer and
>     directly alter it to put the term text in or change the term
>     text.
>     I only deprecated the termText() method.  I still allow the ctors
>     that pass in String termText, as well as setTermText(String), but
>     added a NOTE about performance cost of using these methods.  I
>     think it's OK to keep these as convenience methods?
>     After the next release, when we can remove the deprecated API, we
>     should clean up Token.java to no longer maintain "either String or
>     char[]" (and the initTermBuffer() private method) and always use
>     the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
>     creating a new one for each doc.  To do this I added an optional
>     "reusableTokenStream(...)" to Analyzer which just defaults to
>     calling tokenStream(...), and then I implemented this for the core
>     analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=50000
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-969.
---------------------------------------

       Resolution: Fixed
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -----------------------------------------------------------------
>
>                 Key: LUCENE-969
>                 URL: https://issues.apache.org/jira/browse/LUCENE-969
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-969.patch, LUCENE-969.take2.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
>     a new one for every term.  To do this, I added a new method "Token
>     next(Token result)" (Doron's suggestion) which means TokenStream
>     may use the "Token result" as the returned Token, but is not
>     required to (ie, can still return an entirely different Token if
>     that is more convenient).  I added default implementations for
>     both next() methods in TokenStream.java so that a TokenStream can
>     choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
>     termText".
>     Token now maintains a char[] termBuffer for holding the term's
>     text.  Tokenizers & filters should retrieve this buffer and
>     directly alter it to put the term text in or change the term
>     text.
>     I only deprecated the termText() method.  I still allow the ctors
>     that pass in String termText, as well as setTermText(String), but
>     added a NOTE about performance cost of using these methods.  I
>     think it's OK to keep these as convenience methods?
>     After the next release, when we can remove the deprecated API, we
>     should clean up Token.java to no longer maintain "either String or
>     char[]" (and the initTermBuffer() private method) and always use
>     the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
>     creating a new one for each doc.  To do this I added an optional
>     "reusableTokenStream(...)" to Analyzer which just defaults to
>     calling tokenStream(...), and then I implemented this for the core
>     analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=50000
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]