Optimizing/minimizing memory usage of memory-based indexes

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Optimizing/minimizing memory usage of memory-based indexes

Tatu Saloranta
I am building a simple classifier system, using Lucene
essentially to efficiently+incrementally calculate
term frequencies.
(due to input variations, I am currently creating a
separate index for each attribute, although I guess I
could (should?) just use different field for each
attribute)

Now, one potential problem I have is that although
memory usage is probably sub-linear (I just index
terms, don't store; vocabulary grows sub-linearly),
and thus actual memory used should not grow too fast,
the way Lucene builds and merges indexes fluctuates: I
assume memory usage mostly changes when merging
segments. I have simple diagnostics for memory usage
that force gc every 1000 documents processed [yes, I
know that System.gc() does not strictly guarantee it,
but in practice it is good enough], and notice usage
fluctuating it a bit, with overall increase. but 10%
drop every 12000 documents or so, with default
settings).

So... I am essentially wondering if there are good
techniques for tuning memory usage (minimize index
structure size) adaptively, to avoid running out of
memory, in cases where compacting the index would
avoid out of mem case.

Further, are there possibilities to perhaps trade
reduced memory usage for slightly slower indexing? (or
even better, searching -- in my case, I only traverse
term indexes to get counts). IndexWriter.optimize()
probably does not really help here does it?

-+ Tatu +-


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Optimizing/minimizing memory usage of memory-based indexes

Wolfgang Hoschek-2
Hi Tatu,

I take it that simply maintaining the frequencies in a hashmap  
similar to  
org.apache.lucene.index.memory.AnalyzerUtil.getMostFrequentTerms()  
isn't sufficient for your usecases?
In the latter case, are you using  
org.apache.lucene.store.RAMDirectory or  
org.apache.lucene.index.memory.MemoryIndex?

Wolfgang.

On Feb 10, 2006, at 12:29 PM, Tatu Saloranta wrote:

> I am building a simple classifier system, using Lucene
> essentially to efficiently+incrementally calculate
> term frequencies.
> (due to input variations, I am currently creating a
> separate index for each attribute, although I guess I
> could (should?) just use different field for each
> attribute)
>
> Now, one potential problem I have is that although
> memory usage is probably sub-linear (I just index
> terms, don't store; vocabulary grows sub-linearly),
> and thus actual memory used should not grow too fast,
> the way Lucene builds and merges indexes fluctuates: I
> assume memory usage mostly changes when merging
> segments. I have simple diagnostics for memory usage
> that force gc every 1000 documents processed [yes, I
> know that System.gc() does not strictly guarantee it,
> but in practice it is good enough], and notice usage
> fluctuating it a bit, with overall increase. but 10%
> drop every 12000 documents or so, with default
> settings).
>
> So... I am essentially wondering if there are good
> techniques for tuning memory usage (minimize index
> structure size) adaptively, to avoid running out of
> memory, in cases where compacting the index would
> avoid out of mem case.
>
> Further, are there possibilities to perhaps trade
> reduced memory usage for slightly slower indexing? (or
> even better, searching -- in my case, I only traverse
> term indexes to get counts). IndexWriter.optimize()
> probably does not really help here does it?
>
> -+ Tatu +-
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Optimizing/minimizing memory usage of memory-based indexes

Tatu Saloranta
--- Wolfgang Hoschek <[hidden email]> wrote:
> Hi Tatu,
>
> I take it that simply maintaining the frequencies in
> a hashmap  
> similar to  
>
org.apache.lucene.index.memory.AnalyzerUtil.getMostFrequentTerms()
>  
> isn't sufficient for your usecases?

Initially it might, but probably eventually not. I was
thinking Lucene formats might also be bit more compact
than vanilla hash maps, but I guess that depends on
many factors. But I will probably want to play with
actual queries later on, based on frequencies.

> In the latter case, are you using  
> org.apache.lucene.store.RAMDirectory or  
> org.apache.lucene.index.memory.MemoryIndex?

I'm using RAMDirectory. Should I be using MemoryIndex
maybe instead (I'll check it out)?

Thanks!

-+ Tatu +-

>
> Wolfgang.
>
> On Feb 10, 2006, at 12:29 PM, Tatu Saloranta wrote:
>
> > I am building a simple classifier system, using
> Lucene
> > essentially to efficiently+incrementally calculate
> > term frequencies.
> > (due to input variations, I am currently creating
> a
> > separate index for each attribute, although I
> guess I
> > could (should?) just use different field for each
> > attribute)
> >
> > Now, one potential problem I have is that although
> > memory usage is probably sub-linear (I just index
> > terms, don't store; vocabulary grows
> sub-linearly),
> > and thus actual memory used should not grow too
> fast,
> > the way Lucene builds and merges indexes
> fluctuates: I
> > assume memory usage mostly changes when merging
> > segments. I have simple diagnostics for memory
> usage
> > that force gc every 1000 documents processed [yes,
> I
> > know that System.gc() does not strictly guarantee
> it,
> > but in practice it is good enough], and notice
> usage
> > fluctuating it a bit, with overall increase. but
> 10%
> > drop every 12000 documents or so, with default
> > settings).
> >
> > So... I am essentially wondering if there are good
> > techniques for tuning memory usage (minimize index
> > structure size) adaptively, to avoid running out
> of
> > memory, in cases where compacting the index would
> > avoid out of mem case.
> >
> > Further, are there possibilities to perhaps trade
> > reduced memory usage for slightly slower indexing?
> (or
> > even better, searching -- in my case, I only
> traverse
> > term indexes to get counts).
> IndexWriter.optimize()
> > probably does not really help here does it?
> >
> > -+ Tatu +-
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around
> > http://mail.yahoo.com
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> [hidden email]
> > For additional commands, e-mail:
> [hidden email]
> >
>
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Optimizing/minimizing memory usage of memory-based indexes

Wolfgang Hoschek-2
>
> Initially it might, but probably eventually not. I was
> thinking Lucene formats might also be bit more compact
> than vanilla hash maps, but I guess that depends on
> many factors. But I will probably want to play with
> actual queries later on, based on frequencies.

OK.

>
>> In the latter case, are you using
>> org.apache.lucene.store.RAMDirectory or
>> org.apache.lucene.index.memory.MemoryIndex?
>
> I'm using RAMDirectory. Should I be using MemoryIndex
> maybe instead (I'll check it out)?
>

The main constraint is that a MemoryIndex instance can only hold  
*one* lucene document (though it can have any number of fields).  
MemoryIndex is designed to be a transient throw away data structure,  
for streaming / publish-subscribe usecases. If it's applicable,  
MemoryIndex has better performance but worse memory consumption than  
RAMDirectory. I can't tell whether that may or may not be an issue  
for your case.

Wolfgang.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]