[jira] Created: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
Add an "termInfosIndexDivisor" to IndexReader
---------------------------------------------

                 Key: LUCENE-1052
                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: 2.2
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor
             Fix For: 2.3


The termIndexInterval, set during indexing time, let's you tradeoff
how much RAM is used by a reader to load the indexed terms vs cost of
seeking to the specific term you want to load.

But the downside is you must set it at indexing time.

This issue adds an indexDivisor to TermInfosReader so that on opening
a reader you could further sub-sample the the termIndexInterval to use
less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
loaded into RAM.

This is particularly useful if your index has a great many terms (eg
you accidentally indexed binary terms).

Spinoff from this thread:

  http://www.gossamer-threads.com/lists/lucene/java-dev/54371



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1052:
---------------------------------------

    Attachment: LUCENE-1052.patch

Attached patch derived from Doug's & Chuck's patches from the thread.  All tests pass.

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1052.
----------------------------------------

    Resolution: Fixed

I just committed this.  Thanks Chuck & Doug!

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543306 ]

Chuck Williams commented on LUCENE-1052:
----------------------------------------

Michael, thanks for creating an excellent production version of this idea and committing it!

I'd like to take it one step further to eliminate the need to call IndexReader.setTermInfosIndexDivisor up front.  The idea is to instead specify a maximum number of index terms to cache in memory.  This could then allow TermInfosReader to set indexDivisor automatically to the smallest value that yields a cache size less than the maximum.

This seems a simple and extremely useful extension.  Unfortunately, I'm still on an older Lucene, but will post my update.  If you like this idea, you may want to just add the feature directly to your implementation in the trunk.


> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543366 ]

Michael McCandless commented on LUCENE-1052:
--------------------------------------------

{quote}
I'd like to take it one step further to eliminate the need to call IndexReader.setTermInfosIndexDivisor up front.  The idea is to instead specify a maximum number of index terms to cache in memory.  This could then allow TermInfosReader to set indexDivisor automatically to the smallest value that yields a cache size less than the maximum.

This seems a simple and extremely useful extension.  Unfortunately, I'm still on an older Lucene, but will post my update.  If you like this idea, you may want to just add the feature directly to your implementation in the trunk.
{quote}

Good idea!  This allows you to simply outright cap the memory usage,
rather than having memory usage be a fraction of the number of terms
and thus grow as your term count grows.

So you would propose replacing IndexReader.setTermInfosIndexDivisor
with IndexReader.setTermInfosIndexMaxCount or some such?  Ie you would
still need to call this on creating your reader...

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543383 ]

Chuck Williams commented on LUCENE-1052:
----------------------------------------

I believe this needs to be a formula as a reasonable bound on the number of terms is in general a function of the number of documents in the segment and the nature of the index (e.g., types of fields).  A common thing to do would be to enforce that RAM usage for cached terms grows no faster than logarithmically in the number of documents.  The specific formula that is appropriate will depend on the index, i.e. on the application.  It might be of the form:  c*ln(numdocs+k), wnere c and k are constants dependent on the index.

One consequence of this approach, or any approach along these lines, is that the indexDivisor will vary across the segments, both in a single index and across indexes.  It seems to me from the code that this should work fine.

This leaves the issue of how to best specify an arbitrary formula.  This requires a method to compute the max cached terms allowed for a segment based on the number of docs in the segment, the number of terms in the segment's index, and possibly other factors.  The most direct way to do this is to introduce an interface, e.g. TermInfosConfigurer, to define the method signature, and to add setTermInfosConfigurer as an alternative to setTermInfosIndexDivisor.  It would need to be in all the same places.

A more general approach would be to introduce an IndexConfigurer class which over time could hold additional methods like this.  It could even replace the current setters on IndexReader (as well as IndexWriter, etc.) with a more general mechanism that would allow dynamic parameters used to configure any classes in the index structure.  Each constructor would be passed the IndexConfigurer and call getters or other methods on it to obtain its config.  The methods could provide constant values or dynamic formulas.

I'm going to implement the straightforward solution at the moment in our older version of Lucene, then will sync up to whatever you guys decide is best for the trunk later.
 

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chuck Williams updated LUCENE-1052:
-----------------------------------

    Attachment: termInfosConfigurer.patch

termInfosConfigurer.patch extends the termInfoIndexDivisor mechanism to allow dynamic management of this parameter.  A new new interface, TermInfosConfigurer, allows specification of a method, getMaxTermsCached(), that bounds the size of the in-memory term infos as a function of the segment name, segment numDocs, and total segment terms.  This bound is then used to automatically set termInfosIndexDivisor whenever a TermInfosReader reads the term index.  This mechanism provides a simple way to ensure that the total amount of memory consumed by the term cache is bounded by, say, O(log(numDocs)).

All Lucene core tests pass.  I'm using another version of this same patch in Lucene 2.1+ in an application that has indexes with binary term pollution, using the TermInfosConfigurer to dynamically bound the term cache in the polluted segments.

Tried to test contrib, but it appears gdata-server needs external libraries I don't have to compile.

Michael, this patch applies cleanly to today's Lucene trunk.  I'd appreciate if you could verify one thing.  Lucene 2.3 has the incremental reopen mechanism (can't wait to get that!), new since Lucene 2.1.  It appears that reopen of a segment reuses the same TermInfosReader and thus does not need to configure a new one.  I've implemented that part of the patch with this assumption.


> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543854 ]

Michael McCandless commented on LUCENE-1052:
--------------------------------------------

Thanks Chuck for such a wonderfully thorough patch & unit tests, and
for adding the methods to ParallelReader, too (I had missed it the
first time around)!  The patch looks good.

Should we use an abstract base class instead of interface for
TermInfosConfigurer so we can add additional methods in the future
without breaking back compatibility?

Also I think we should mark this API as advanced, somewhat
experimental and subject to change?


> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Reopened: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reopened LUCENE-1052:
----------------------------------------


> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544032 ]

Doug Cutting commented on LUCENE-1052:
--------------------------------------

I think we should be cautious about adding a new public interface or abstract class to support just this feature.  If we want to add a generic configuration API for Lucene, then I'd prefer something fully general, like what I proposed on the mailing list, not something specific to configuring TermInfosReader.  Otherwise we'll keep adding new configuration interfaces and adding more parameters to IndexReader constructors each time we wish to make some obscure feature configurable.

http://www.gossamer-threads.com/lists/lucene/java-dev/54421#54421

In the model proposed there, adding a new configuration parameter involves just adding a new static method to the public class that implements a new configurable feature.


> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544055 ]

Chuck Williams commented on LUCENE-1052:
----------------------------------------

I agree a general configuration system would be much better.  Doug. we use a similar method to what you described in our application.

TermInfosConfigurer is slightly different though since the desired config is a method that implements a formula, rather than just a value.  This could still be done more generally by allowing methods as well as properties or setters on a higher level configuration object.

I didn't want to take on the broader issue just for this feature.

Michael I agree with both of your points.

I'd be happy to clean up this patch if you guys provide some guidance for what would make it acceptable to commit.


> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544076 ]

Michael McCandless commented on LUCENE-1052:
--------------------------------------------


Maybe, instead, we should simply make it "easy" to subclass
TermInfosReader whenever a SegmentReader wants to instantiate it?

Ie, the formula is such an advanced use case that it seems appropriate
to subclass instead of trying to break it out into a special
interface/abstract class?

Of course, we need to know this class at SegmentReader construction
time, so I think to specify it we should in fact take Doug's suggested
approach using generic properties.

The challenge with Lucene (and Hadoop) is how can you reach deep down
into a complex IndexReader.open static method call to change various
details of the embedded *Readers while they are being constructed,
and, after they are constructed... I agree it is messy now that we
must propogate the setTermInfosIndexInterval method up the *Reader
hierarchy when not all Readers would even use a TermInfosReader.

So ... maybe we 1) implement generic Lucene properties w/ static
classes/methods to set/get these properties, then 2) remove
set/getTermInfosIndexInterval from *Reader and make a generic property
for it instead, and 3) add another property that allows you to specify
the Class (or String name) of that is your TermInfosReader subclass
(and make it non-final)?


> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544125 ]

Doug Cutting commented on LUCENE-1052:
--------------------------------------

What class would we put TermInfosReader-specific setters & getters on, since that class is not public?  Do we make TermInfosReader public or leave it package-private?  My intuition is to leave it package-private for now, in order to retain freedom to re-structure w/o breaking applications, and because making it public would drag a lot of other stuff into the public.  We could consider making SegmentReader public, so that there's a public class that corresponds to the concrete index implementation, but that'd also drag more stuff public (like DirectoryIndexReader).

I'm also not yet convinced that it is critical to support arbitrary formulae for this feature.  Sure, it would be nice, but it has costs, like increasing public APIs that must be supported.  Folks have done fine without this feature for many years.  Adding a simple integer divisor is a sufficient initial step here.

So, even if we add a configuration system, I think the setter methods could still end up on IndexReader.  The difference is primarily whether the methods are:

public void setTermIndexInterval(int interval);
public void setTermIndexDivisor(int divisor);

or

public static void setTermIndexInterval(LuceneProps props, int interval);
public static void setTermIndexDivisor(LuceneProps props, int divisor);

With the latter just a façade that uses package-private stuff.  I think the latter style will be handy as we start adding parameters to, e.g., Query classes.  In those cases we'll probably want façade's too, since a Query setter will probably really tweak something for a private Scorer class.  In the case of indexes, however, we don't have a public, concrete class.

Another option is to make a public class whose purpose is just to only such parameters, something like SegmentIndexParameters.  That'd be my first choice and was the direction I pointed in my initial proposal, but with considerably less explanation.

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544136 ]

Chuck Williams commented on LUCENE-1052:
----------------------------------------

I can report that in our application having a formula is critical.  We have no control over the content our users index, nor in fact do they.  These are arbitrary documents.  We find a surprising number of them contain embedded encoded binary data.  When those are indexed, lucene's memory consumption skyrockets, either bringing the whole app down with an OOM or slowing performance to a crawl due to excessive GC's reclaiming a tiny remaining working memory space.

Our users won't accept a solution like, wait until the problem occurs and then increment your termIndexDivisor.  They expect our app to manage this automatically.

I agree that making TermInfosReader, SegmentReader, etc. public classes is not a great solution  The current patch does not do that.  It simply adds a configurable class that can be used to provide formula parameters as opposed to just value parameters.  At least for us, this special case is sufficiently important to outweigh any considerations of the complexity of an additional class.

A single configuration class could be used at the IndexReader level that provides for both static and dynamically-varying properties through getters, some of which take parameters.

Here is another possible solution.  My current thought is that the bound should always be a multiple of sqrt(numDocs).  E.g., see Heap's Law here:  http://nlp.stanford.edu/IR-book/html/htmledition/heaps-law-estimating-the-number-of-terms-1.html

I'm currently using this formula in my TermInfosConfigurer:

            int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

This has Heap's law as foundation.  I provide TERM_BOUNDING_MULTIPLIER as the config parameter, with 0 meaning don't do this.  I also provide a TERM_INDEX_DIVISOR_OVERRIDE that overrides the dynamic bounding with a manually specified constant amount.

If that approach would be acceptable to lucene in general, then we just need two static parameters.  However, I don't have enough experience with how well this formula works in our user base yet to know whether or not we'll tune it further.




> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544540 ]

Doug Cutting commented on LUCENE-1052:
--------------------------------------

> We find a surprising number of them contain embedded encoded binary data.

It sounds like a detector for this would be very useful.  It would, e.g., substantially speed updates of such indexes, and not slow searches of them like a divisor does.  At Excite we evolved effective heuristics for wordness to keep our dictionaries from exploding.  Perhaps you should look into that?  Also, it sounds like you might increase your default term index interval, since it sounds like you have big indexes with noisy data.

> Our users won't accept a solution like, wait until the problem occurs and then increment your termIndexDivisor. They expect our app to manage this automatically.

You could look at the size of the .tii files before you open an index, and, if they're too large, set the divisor automatically as you see fit.

> int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

This sounds like a fine approach.

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1052:
---------------------------------------

    Attachment: LUCENE-1052.patch


{quote}
What class would we put TermInfosReader-specific setters & getters on, since that class is not public? Do we make TermInfosReader public or leave it package-private? My intuition is to leave it package-private for now, in order to retain freedom to re-structure w/o breaking applications, and because making it public would drag a lot of other stuff into the public. We could consider making SegmentReader public, so that there's a public class that corresponds to the concrete index implementation, but that'd also drag more stuff public (like DirectoryIndexReader).
{quote}

Agreed: package private.  People who do advanced things should be fine
with that.

{quote}
Another option is to make a public class whose purpose is just to only such parameters, something like SegmentIndexParameters. That'd be my first choice and was the direction I pointed in my initial proposal, but with considerably less explanation.
{quote}

So I took a closer look at making generic properties by coding up
Doug's approach (attached patch).

I replaced *#setTermInfosIndexDivisor with a separate
SegmentIndexProperties class that has static methods to set/get
termIndexDivisor, and added/threaded down ctors that allow you to pass
a LuceneProperties when opening an IndexReader.

I came up with a number of questions along the way:

  * Who should know/store the default value for a given property?
    TermIndexDivisor defaults to 1.
    .
    Is this stored in that static facade class (a)?  Or, passed in as
    defaultValue arg by TermInfosReader when it looks up the property
    (b)?  Or, do we make a base DefaultLuceneProperties that has the
    default set for all properties (c)?
    .
    (b) is nice because I feel like the default should live in the
    class that uses it, but then that's bad because the outside world
    can't see the default value.

  * Every property must clearly define when it will be looked at.  So
    for termIndexDivisor in the javadoc we would say "it's used only
    when the termInfos index is loaded (once)".  This means changing
    that property after termInfos index is loaded has no effect.

  * We should presumably create a default LuceneProperties to save
    checking for props != null everywhere when user didn't make their
    own props.  This favors option (c) in the first bullet above.

  * Presumably once you've created a class, passing in your props
    instance, you cannot later install a new props instance.  The
    LuceneProperties class is "write once".

  * We would need guidelines for when something should be an arg to
    the ctor, setter/getter on the class.  I think there are shades of
    gray here.

After this, I suddenly realized if we indeed make termIndexDivisor a
generic property, it's actually hard for Chuck to then do his formula
by looking at the size of the .tii file: when the index has multiple
segments, you would presumably need to set different indexDivisors for
each segment, but the properties only lets you set one global value.

You could carefully set the property, then somehow get ahold of just
that one SegmentReader and have it load the term index, then move onto
the next one, etc, but that's quite messy.

Note that this limitation is also the case with the top-level
setTermInfosIndexDivisor as it now stands in trunk -- it's not easy to
set different index divisors per segment.

It almost feels like we should have "hooks" that are invoked at
certain times, like when we are about to load the term infos index,
that give the application a chance to change something...




> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544644 ]

Chuck Williams commented on LUCENE-1052:
----------------------------------------

> It almost feels like we should have "hooks" that are invoked at
> certain times, like when we are about to load the term infos index,
> that give the application a chance to change something...

I agree with the need for some kind of hook.  This is what TermInfosConfigurer is.  It calls a method whenever a SegmentReader reads an index to obtains parameters (termIndexDivisor) that should be used to configure the TermInfosReader.

Why not make the setters/getters on SegmentIndexProperties regular non-static methods, and allow hook methods as well?  E.g., setTermIndexDivsior(), getTermIndexDivisor(), getMaxTermsCached(String segmentName, int segmentNumDocs, long segmentNumTerms).  Non-static methods make the defaulting straightforward and allow for subclassing to override hook methods.

> It sounds like a detector for this would be very useful. It would, e.g., substantially
> speed updates of such indexes, and not slow searches of them like a divisor does.
> At Excite we evolved effective heuristics for wordness to keep our dictionaries from exploding.

Yes, we are pursuing that approach as well, but we have some stringent requirements in our market.  E.g., we cannot filter *any* valid content, because searches must be guaranteed to find all matching results.  As of result of this, we cannot impose any maximum length for documents.

Any type of binary content recognizer would either need to be 100% accurate, which is impossible, or require human intervention to validate filtering.  For a human intervention approach to be viable the false positive rate must be tiny.  To be effective the false negative rate must be tiny.  Although invalid content is pretty easy for people tor recognize, I've found so far that high-accuracy recognition rules are surprising subtle.

Do you by chance no of any quality work in this area?

> > int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

> This sounds like a fine approach.

It seems to be working ok, but there is one issue.  Heap's Law is based on the total number of tokens in the content, not the total number of documents.  I.e., longer documents will generate more distinct terms than shorter documents.  For large segments the use of numDocs works ok due to statistical averaging, but for smaller segments there are errors.  I may loosen the bound somewhat on smaller segments in order to allow for their larger standard deviation.

If Lucene indexes tracked totalTokens (with duplicates, i.e. not numDistinctTokens) that would be perfect, but they don't.  I don't know whether or not there would be other good uses for totalTokens but mention its relevance here in case there are.


> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1052:
---------------------------------------

    Fix Version/s:     (was: 2.3)
                   2.4

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1052.patch, LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549019 ]

Michael McCandless commented on LUCENE-1052:
--------------------------------------------

I think for 2.3 we should go with the approach as currently committed, and take this ongoing debate into 2.4?  I'll mark this as 2.4 target.

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1052.patch, LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

Shalin Shekhar Mangar (Jira)
In reply to this post by Shalin Shekhar Mangar (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1052:
---------------------------------------

    Fix Version/s:     (was: 2.4)

Changing fix version from 2.4 to Unknown.

> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1052.patch, LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12