[jira] Created: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
TermAttributeImpl's buffer will never "shrink" if it grows too big
------------------------------------------------------------------

                 Key: LUCENE-1859
                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
             Project: Lucene - Java
          Issue Type: Bug
          Components: Analysis
    Affects Versions: 2.9
            Reporter: Tim Smith


This was also an issue with Token previously as well

If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory

Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set

I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)


perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747942#action_12747942 ]

Uwe Schindler commented on LUCENE-1859:
---------------------------------------

This also applies to Token. If we fix that, we should also fix it in Token.

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747943#action_12747943 ]

Tim Smith commented on LUCENE-1859:
-----------------------------------

it seems like the new TokenStream API may aggravate this issue a bit as it encourages even more reuse of the underlaying term char[] buffer (if i'm not mistaken)

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748064#action_12748064 ]

Marvin Humphrey commented on LUCENE-1859:
-----------------------------------------

The worst-case scenario seems kind of theoretical, since there are so many
reasons that huge tokens are impractical. (Is a priority of "major"
justified?) If there's a significant benefit to shrinking the allocation, it's
minimizing average memory usage over time.  But even that assumes a nearly
pathological distribution in field size -- it would have to be large for early
documents, then consistently small for subsequent documents.  If it's
scattered, you have to plan for worst case RAM usage as an app developer,
anyway.  Which generally means limiting token size.

I assume that, based on this report, TermAttributeImpl never gets reset or
discarded/recreated over the course of an indexing session?

-0 if the reallocation happens no more often than once per document.

-1 if it the reallocation has be performed in an inner loop.

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748071#action_12748071 ]

Tim Smith commented on LUCENE-1859:
-----------------------------------

b1. The worst-case scenario seems kind of theoretical
100% agree, but even if one extremely large token gets added to the stream (and possibly dropped prior to indexing), the char[] grows without ever shrinking back (so it can result in memory usage growing if "bad" content is thrown in (and people have no shortage of bad content)

bq. Is a priority of "major" justified?

major is just the default priority (feel free to change)

bq. I assume that, based on this report, TermAttributeImpl never gets reset or discarded/recreated over the course of an indexing session?
using reusable TokenStream will never cause the buffer to be nulled (as far as i can tell) for the lifetime of the thread (please correct me if i'm wrong on this)


i would argue for a semi-large value for MAX_BUFFER_SIZE (potentially allowing this to be statically updated), just as a means to bound the max memory used here
currently, the memory use is bounded by Integer.MAX_VALUE (which is really big)
If someone feeds a large text document with no spaces or other delimiting characters, a "non-intelligent" tokenizer would treat this a 1 big token (and grow the char[] accordingly)

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748071#action_12748071 ]

Tim Smith edited comment on LUCENE-1859 at 8/26/09 11:31 AM:
-------------------------------------------------------------

bq. The worst-case scenario seems kind of theoretical
100% agree, but even if one extremely large token gets added to the stream (and possibly dropped prior to indexing), the char[] grows without ever shrinking back (so it can result in memory usage growing if "bad" content is thrown in (and people have no shortage of bad content)

bq. Is a priority of "major" justified?

major is just the default priority (feel free to change)

bq. I assume that, based on this report, TermAttributeImpl never gets reset or discarded/recreated over the course of an indexing session?
using reusable TokenStream will never cause the buffer to be nulled (as far as i can tell) for the lifetime of the thread (please correct me if i'm wrong on this)


i would argue for a semi-large value for MAX_BUFFER_SIZE (potentially allowing this to be statically updated), just as a means to bound the max memory used here
currently, the memory use is bounded by Integer.MAX_VALUE (which is really big)
If someone feeds a large text document with no spaces or other delimiting characters, a "non-intelligent" tokenizer would treat this a 1 big token (and grow the char[] accordingly)

      was (Author: tsmith):
    b1. The worst-case scenario seems kind of theoretical
100% agree, but even if one extremely large token gets added to the stream (and possibly dropped prior to indexing), the char[] grows without ever shrinking back (so it can result in memory usage growing if "bad" content is thrown in (and people have no shortage of bad content)

bq. Is a priority of "major" justified?

major is just the default priority (feel free to change)

bq. I assume that, based on this report, TermAttributeImpl never gets reset or discarded/recreated over the course of an indexing session?
using reusable TokenStream will never cause the buffer to be nulled (as far as i can tell) for the lifetime of the thread (please correct me if i'm wrong on this)


i would argue for a semi-large value for MAX_BUFFER_SIZE (potentially allowing this to be statically updated), just as a means to bound the max memory used here
currently, the memory use is bounded by Integer.MAX_VALUE (which is really big)
If someone feeds a large text document with no spaces or other delimiting characters, a "non-intelligent" tokenizer would treat this a 1 big token (and grow the char[] accordingly)
 

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748072#action_12748072 ]

Uwe Schindler commented on LUCENE-1859:
---------------------------------------

The problem is, that it may be possible to shrink the buffer once per document, when TokenStream's reset() is called (which is done before each new document). To achieve this, all TokenStreams must notify the termattribute in reset() to shrink its size, which is impractical.

On the other hand, the reallocation would always be for each token (you call that inner loop).

I agree, that normally, the tokens will not grow very large (if they do, you do something wrong during tokenization). Even things like KeywordTokenizer that only creates one token has an upper limit of the term size (as far as I know).

I would set this to minor and would not take care before 2.9. The problem of maybe large buffers was there even in older versions with Token as attribute implementation. It is the same problem like preserving an ArrayList for very long time, it also only grows but never automatically shrinks.

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1859:
----------------------------------

    Priority: Minor  (was: Major)

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748077#action_12748077 ]

Tim Smith commented on LUCENE-1859:
-----------------------------------

bq. I would set this to minor and would not take care before 2.9.

i would agree with this

just reported the issue as it has the potential to cause memory issues (and would think something should be done about it (in the long term at least))
also, the AttributeSource stuff does result in TermAttributeImpl being held onto pretty much forever if using a reusableTokenStream (correct?)
was't a new Token() by the indexer for each doc/field in 2.4?, so the unbounding would only last at most for the duration of indexing that one document?
with Attribute caching in the TokenStream, the bounding lasts the duration of the TokenStream now (or its underlaying AttributeSource), which could remain until shutdown

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748079#action_12748079 ]

Robert Muir commented on LUCENE-1859:
-------------------------------------

bq. If someone feeds a large text document with no spaces or other delimiting characters, a "non-intelligent" tokenizer would treat this a 1 big token (and grow the char[] accordingly)

which non-intelligent tokenizers are you referring to? nearly all the lucene tokenizers have 255 as a limit.


> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748082#action_12748082 ]

Tim Smith commented on LUCENE-1859:
-----------------------------------

bq. which non-intelligent tokenizers are you referring to? nearly all the lucene tokenizers have 255 as a limit.

perhaps this is a non-issue with regards to "lucene tokenizers"
however, Tokenizers can be implemented by anyone (not sure if there are adequate warnings about keeping tokens short)
it also may not be possible to keep tokens short, i may need to index a rather long "id" string in a TokenStream fashion which will grow the buffer without reclaiming this

perhaps it should be the responsibility of the Tokenizer to shrink the TermBuffer if it adds long tokens (but this will probably require some helper methods)

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748083#action_12748083 ]

Robert Muir commented on LUCENE-1859:
-------------------------------------

bq. perhaps it should be the responsibility of the Tokenizer to shrink the TermBuffer if it adds long tokens (but this will probably require some helper methods)

I like this idea better than having any resizing behavior that I might not be able to control.


> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748089#action_12748089 ]

Marvin Humphrey commented on LUCENE-1859:
-----------------------------------------

IMO, the benefit of adding these theoretical helper methods to lower average -- but not peak -- memory usage by non-core Tokenizers which are probably doing something impractical anyway... does not justify the complexity cost.

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748091#action_12748091 ]

Tim Smith commented on LUCENE-1859:
-----------------------------------

i fail to see the complexity of adding one method to TermAttribute:
{code}
public void shrinkBuffer(int maxSize) {
  if ((maxSize > termLength) && (buffer.length > maxSize)) {
    termBuffer = new char[maxSize];
  }
}
{code}

Not having this is fine as long as its well documented that emitting large tokens can and will result in memory growing uncontrolled (especially if using many indexing threads)

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748091#action_12748091 ]

Tim Smith edited comment on LUCENE-1859 at 8/26/09 12:18 PM:
-------------------------------------------------------------

i fail to see the complexity of adding one method to TermAttribute:
{code}
public void shrinkBuffer(int maxSize) {
  if ((maxSize > termLength) && (buffer.length > maxSize)) {
    termBuffer = java.util.Arrays.copyOf(termBuffer, maxSize);
  }
}
{code}

Not having this is fine as long as its well documented that emitting large tokens can and will result in memory growing uncontrolled (especially if using many indexing threads)

      was (Author: tsmith):
    i fail to see the complexity of adding one method to TermAttribute:
{code}
public void shrinkBuffer(int maxSize) {
  if ((maxSize > termLength) && (buffer.length > maxSize)) {
    termBuffer = new char[maxSize];
  }
}
{code}

Not having this is fine as long as its well documented that emitting large tokens can and will result in memory growing uncontrolled (especially if using many indexing threads)
 

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748102#action_12748102 ]

Marvin Humphrey commented on LUCENE-1859:
-----------------------------------------

> i fail to see the complexity of adding one method to TermAttribute:

Death by a thousand cuts.  This is one cut.

I wouldn't even add the note to the documentation.  If you emit large tokens,
you have to plan for obscene peak memory usage anyway, and if you're not
prepared for that, you deserve what you get.  Keeping the average down
doesn't help that.

The only reason to do this is to keep average memory usage down for
the hell of it, and if it goes in, it should be an implementation detail.

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748103#action_12748103 ]

Tim Smith commented on LUCENE-1859:
-----------------------------------

bq. Death by a thousand cuts. This is one cut.

by this logic, nothing new can ever be added.
The thing that brought this to my attention was the new TokenStream API (one cut (rather big, but i like the new API so i'm happy with the blood loss (makes me dizzy and happy)))
The new TokenStream API holds onto theses char[] much longer (if not forever), so this results in memory growing unbounded unless there is some facility to truncate/null out the char[]

bq. I wouldn't even add the note to the documentation.

I don't believe there is ever any valid argument against adding documentation.
If someone can shoot themselves in the foot with the gun you gave them, at least tell them not to point the gun at their foot with the safety off.

bq. The only reason to do this is to keep average memory usage down for the hell of it.
keeping average memory usage down prevents those wonderful OutOfMemory Exceptions (which are difficult at best to recover from)

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748109#action_12748109 ]

Marvin Humphrey commented on LUCENE-1859:
-----------------------------------------

> I don't believe there is ever any valid argument against adding
> documentation.

The more that documentation grows, the harder it is to absorb.  The more
bells and whistles on an API, the harder it is to grok and to use effectively.
The more a code base bloats, the harder it is to maintain or to evolve.

> keeping average memory usage down prevents those wonderful OutOfMemory
> Exceptions

No, it won't.  If someone is emitting large tokens regularly, it is likely
that several threads will require large RAM footprints simultaneously, and an
OOM will occur.  That would be the common case.

If someone is emmitting large tokens periodically, well, this doesn't prevent
the OOM, it just makes it less likely.  That's not worthless, but it's not
something anybody should count on when assessing required RAM usage.

Keeping average memory usage down is good for the system at large.  If this is
implemented, that should be the justification.


> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748122#action_12748122 ]

Tim Smith commented on LUCENE-1859:
-----------------------------------

On documentation:
any warnings/precautions should always be called out (calling out the external link (wiki/etc) for in depth details)
in depth descriptions of the details can be pushed off to wiki pages or external references, as long as a link is provided for the curious, but i would still argue that they should exist

bq. this doesn't prevent the OOM, it just makes it less likely

all you can ever do for OOM issues is make them less likely (short of just fixing a bug that holds onto memory like mad).
If accepting arbitrary content, there will always be a possibility of the content forcing OOM issues. In general, everything possible should be done to
reduce the likelyhood of such OOM issues where possible (IMO).

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786680#action_12786680 ]

Mark Miller commented on LUCENE-1859:
-------------------------------------

without a proposed patch from someone, I'm tempted to close this issue...

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> ------------------------------------------------------------------
>
>                 Key: LUCENE-1859
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1859
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>            Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" tokens, however it seems that the TermAttributeImpl should have a reasonable static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12