huge tii files

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

huge tii files

tsuraan
I have a collection of indices with a total of about 7,000,000
documents between them all.  When I attempt to run a search over these
indices, the searching process's memory usage increases to ~1.7GB if I
allow java to use that much memory.  If I don't (my normal memory cap
is 512MB), I get the following exception:

Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOfRange(Arrays.java:3209)
        at java.lang.String.<init>(String.java:216)
        at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:104)
        at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:159)
        at org.apache.lucene.index.TermInfosReader.ensureIndexIsRead(TermInfosReader.java:119)
        at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:157)
        at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:419)
        at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:87)
        at org.apache.lucene.search.Searcher.docFreqs(Searcher.java:178)
        at org.apache.lucene.search.MultiSearcher.createWeight(MultiSearcher.java:311)
        at org.apache.lucene.search.Searcher.search(Searcher.java:118)
        at org.apache.lucene.search.Searcher.search(Searcher.java:97)
        at SearchThread.run(SearchThread.java:54)

So, it looks like simply attempting to read the .tii files from the
indices is taking huge amounts of RAM.  This is only happening on one
machine; other machines with similar data run just with 256-512MB
memory restrictions, so I'm trying to figure out what could cause the
.tii files to become so bloated.  Is there anything I can do to fix
these indices?  Searching is also very slow on this machine; many
machines with tens of millions of documents can do searches with
subsecond responses, whereas this machine takes many seconds to call
its HitCollector's collect function for the first time.

Any suggestions about how to slim down the .tii files on this machine
(or any workarounds) would be much appreciated.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: huge tii files

Alex-412

you can invoke IndexReader.setTermInfosIndexDivisor prior to any search to control the fraction of .tii file read into memory.


_________________________________________________________________
聰明搜尋和瀏覽網路的免費工具列 — MSN 搜尋工具列
http://toolbar.live.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: huge tii files

tsuraan
That's really nice.  Thanks!

I'm guessing the answer is no, but is there an equivalent to that for
lucene-2.2.0?  Upgrading shouldn't be much of a problem anyhow (we've
been doing it since 1.9), but out of curiosity...

On 17/06/2008, Alex <[hidden email]> wrote:

>
> you can invoke IndexReader.setTermInfosIndexDivisor prior to any search to
> control the fraction of .tii file read into memory.
>
>
> _________________________________________________________________
> 聰明搜尋和瀏覽網路的免費工具列 — MSN 搜尋工具列
> http://toolbar.live.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: huge tii files

steve_rowe
Hi tsuraan,

On 06/17/2008 at 2:31 PM, tsuraan wrote:
> I'm guessing the answer is no, but is there an equivalent to that for
> lucene-2.2.0?

Not exactly equivalent, but: from the apidoc for the 2.3.2 version of setTermInfosIndexDivisor(int)
<http://lucene.apache.org/java/2_3_2/api/core/org/apache/lucene/index/IndexReader.html#setTermInfosIndexDivisor(int)>:

     For IndexReader implementations that use TermInfosReader to read terms,
     this sets the indexDivisor to subsample the number of indexed terms
     loaded into memory. This has the same effect as
     IndexWriter.setTermIndexInterval(int) except that setting must be done
     at indexing time while this setting can be set per reader. [....]

The apidoc for the 2.2.0 version of IndexWriter.setTermIndexInterval(int):

<http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/index/IndexWriter.html#setTermIndexInterval(int)>

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]