large term vectors

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

large term vectors

marc.dumontier
Hi,

 

I have a large index which is around 275GB. As I search different parts
of the index, the memory footprint grows with large byte arrays being
stored. They never seem to get unloaded or GC'ed. Is there any way to
control this behavior so that I can periodically unload cached
information?

 

The nature of the data being indexed doesn't allow me to reduce the
number of terms per field, although I might be able to reduce the number
of overall fields (I have some which aren't currently being searched
by).

 

I've just begun investigating and profiling the problem, so I don't have
a lot of details at this time. Any support would be extremely welcome.

 

Thanks,

 

Marc Dumontier
Manager, Software Development
Thomson Scientific (Canada)
1 Yonge Street, Suite 1801
Toronto, Ontario M5E 1W7

 

Direct +1 416 214 3448
Mobile +1 416 454 3147

 

Reply | Threaded
Open this post in threaded view
|

Re: large term vectors

Cedric Ho
Is it a single index ? My index is also in the 200G range, but I never
managed to get
a single index of size > 20G and still get acceptable performance (in
both searching and updating).
So I split my indexes into chunks of < 10G

I am curious as to how you manage such a single large index.

Cedric



On Feb 8, 2008 11:51 PM,  <[hidden email]> wrote:

> Hi,
>
>
>
> I have a large index which is around 275GB. As I search different parts
> of the index, the memory footprint grows with large byte arrays being
> stored. They never seem to get unloaded or GC'ed. Is there any way to
> control this behavior so that I can periodically unload cached
> information?
>
>
>
> The nature of the data being indexed doesn't allow me to reduce the
> number of terms per field, although I might be able to reduce the number
> of overall fields (I have some which aren't currently being searched
> by).
>
>
>
> I've just begun investigating and profiling the problem, so I don't have
> a lot of details at this time. Any support would be extremely welcome.
>
>
>
> Thanks,
>
>
>
> Marc Dumontier
> Manager, Software Development
> Thomson Scientific (Canada)
> 1 Yonge Street, Suite 1801
> Toronto, Ontario M5E 1W7
>
>
>
> Direct +1 416 214 3448
> Mobile +1 416 454 3147
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: large term vectors

Briggs
So, I have a question about 'splitting indexes'.  I see people say
this all over, but how have people been handling this.  I'm going to
start a new thread, and there probably was one back in the day, but I
am going to fire it up again.   But, how did you do it?

On Feb 10, 2008 9:18 PM, Cedric Ho <[hidden email]> wrote:

> Is it a single index ? My index is also in the 200G range, but I never
> managed to get
> a single index of size > 20G and still get acceptable performance (in
> both searching and updating).
> So I split my indexes into chunks of < 10G
>
> I am curious as to how you manage such a single large index.
>
> Cedric
>
>
>
>
> On Feb 8, 2008 11:51 PM,  <[hidden email]> wrote:
> > Hi,
> >
> >
> >
> > I have a large index which is around 275GB. As I search different parts
> > of the index, the memory footprint grows with large byte arrays being
> > stored. They never seem to get unloaded or GC'ed. Is there any way to
> > control this behavior so that I can periodically unload cached
> > information?
> >
> >
> >
> > The nature of the data being indexed doesn't allow me to reduce the
> > number of terms per field, although I might be able to reduce the number
> > of overall fields (I have some which aren't currently being searched
> > by).
> >
> >
> >
> > I've just begun investigating and profiling the problem, so I don't have
> > a lot of details at this time. Any support would be extremely welcome.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Marc Dumontier
> > Manager, Software Development
> > Thomson Scientific (Canada)
> > 1 Yonge Street, Suite 1801
> > Toronto, Ontario M5E 1W7
> >
> >
> >
> > Direct +1 416 214 3448
> > Mobile +1 416 454 3147
> >
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
"Conscious decisions by conscious minds are what make reality real"

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: large term vectors

Cedric Ho
I guess it would be quite different for different apps.

For me, I do index update on a single machine: index each incoming
documents into one chunk according to some rule to ensure even
distribution. Then copy all the updated indexes to some other machines
for searching. Each machine will then reopen the updated index.

For searching you can look at RemoteSearchable + ParallelSearcher. But
if you need redundancy / failover, etc, you will probably need to do
it yourself.

Cedric


On Feb 11, 2008 11:14 AM, Briggs <[hidden email]> wrote:

> So, I have a question about 'splitting indexes'.  I see people say
> this all over, but how have people been handling this.  I'm going to
> start a new thread, and there probably was one back in the day, but I
> am going to fire it up again.   But, how did you do it?
>
>
> On Feb 10, 2008 9:18 PM, Cedric Ho <[hidden email]> wrote:
> > Is it a single index ? My index is also in the 200G range, but I never
> > managed to get
> > a single index of size > 20G and still get acceptable performance (in
> > both searching and updating).
> > So I split my indexes into chunks of < 10G
> >
> > I am curious as to how you manage such a single large index.
> >
> > Cedric
> >
> >
> >
> >
> > On Feb 8, 2008 11:51 PM,  <[hidden email]> wrote:
> > > Hi,
> > >
> > >
> > >
> > > I have a large index which is around 275GB. As I search different parts
> > > of the index, the memory footprint grows with large byte arrays being
> > > stored. They never seem to get unloaded or GC'ed. Is there any way to
> > > control this behavior so that I can periodically unload cached
> > > information?
> > >
> > >
> > >
> > > The nature of the data being indexed doesn't allow me to reduce the
> > > number of terms per field, although I might be able to reduce the number
> > > of overall fields (I have some which aren't currently being searched
> > > by).
> > >
> > >
> > >
> > > I've just begun investigating and profiling the problem, so I don't have
> > > a lot of details at this time. Any support would be extremely welcome.
> > >
> > >
> > >
> > > Thanks,
> > >
> > >
> > >
> > > Marc Dumontier
> > > Manager, Software Development
> > > Thomson Scientific (Canada)
> > > 1 Yonge Street, Suite 1801
> > > Toronto, Ontario M5E 1W7
> > >
> > >
> > >
> > > Direct +1 416 214 3448
> > > Mobile +1 416 454 3147
> > >
> > >
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: large term vectors

Grant Ingersoll-2
In reply to this post by marc.dumontier
Hi Marc,

Can you give more info about what your field properties are?  Your  
subject line implies you are storing term vectors, is that the case?

Also, what version of Lucene are you using?

Cheers,
Grant

On Feb 8, 2008, at 10:51 AM, <[hidden email]> <[hidden email]
 > wrote:

> Hi,
>
>
>
> I have a large index which is around 275GB. As I search different  
> parts
> of the index, the memory footprint grows with large byte arrays being
> stored. They never seem to get unloaded or GC'ed. Is there any way to
> control this behavior so that I can periodically unload cached
> information?
>
>
>
> The nature of the data being indexed doesn't allow me to reduce the
> number of terms per field, although I might be able to reduce the  
> number
> of overall fields (I have some which aren't currently being searched
> by).
>
>
>
> I've just begun investigating and profiling the problem, so I don't  
> have
> a lot of details at this time. Any support would be extremely welcome.
>
>
>
> Thanks,
>
>
>
> Marc Dumontier
> Manager, Software Development
> Thomson Scientific (Canada)
> 1 Yonge Street, Suite 1801
> Toronto, Ontario M5E 1W7
>
>
>
> Direct +1 416 214 3448
> Mobile +1 416 454 3147
>
>
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: large term vectors

marc.dumontier
In reply to this post by Cedric Ho
No, it's split into about 100 individual indexes. But I'm running my
64-bit JVM with around 10gb max memory in order to avoid running out of
memory after running all my unit tests (I have some other indexes as
well running as part of this application).

Upon further investigation, it seems to have something to do with the
norms
(SegmentReader.norms)

Marc


-----Original Message-----
From: Cedric Ho [mailto:[hidden email]]
Sent: Sunday, February 10, 2008 9:19 PM
To: [hidden email]
Subject: Re: large term vectors

Is it a single index ? My index is also in the 200G range, but I never
managed to get
a single index of size > 20G and still get acceptable performance (in
both searching and updating).
So I split my indexes into chunks of < 10G

I am curious as to how you manage such a single large index.

Cedric



On Feb 8, 2008 11:51 PM,  <[hidden email]> wrote:
> Hi,
>
>
>
> I have a large index which is around 275GB. As I search different
parts
> of the index, the memory footprint grows with large byte arrays being
> stored. They never seem to get unloaded or GC'ed. Is there any way to
> control this behavior so that I can periodically unload cached
> information?
>
>
>
> The nature of the data being indexed doesn't allow me to reduce the
> number of terms per field, although I might be able to reduce the
number
> of overall fields (I have some which aren't currently being searched
> by).
>
>
>
> I've just begun investigating and profiling the problem, so I don't
have

> a lot of details at this time. Any support would be extremely welcome.
>
>
>
> Thanks,
>
>
>
> Marc Dumontier
> Manager, Software Development
> Thomson Scientific (Canada)
> 1 Yonge Street, Suite 1801
> Toronto, Ontario M5E 1W7
>
>
>
> Direct +1 416 214 3448
> Mobile +1 416 454 3147
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: large term vectors

marc.dumontier
In reply to this post by Grant Ingersoll-2
Hi Grant,

Lucene 2.2.0

I'm not actually explicitely storing term vectors. It seems the huge
amount of byte arrays is actually coming from SegmentReader.norms. Maybe
that cache constantly grows as I read somewhere that it's on-demand. I'm
not using any field or document boosting..is there some way to optimize
around this?

Marc


-----Original Message-----
From: Grant Ingersoll [mailto:[hidden email]]
Sent: Monday, February 11, 2008 7:46 AM
To: [hidden email]
Subject: Re: large term vectors

Hi Marc,

Can you give more info about what your field properties are?  Your  
subject line implies you are storing term vectors, is that the case?

Also, what version of Lucene are you using?

Cheers,
Grant

On Feb 8, 2008, at 10:51 AM, <[hidden email]>
<[hidden email]
 > wrote:

> Hi,
>
>
>
> I have a large index which is around 275GB. As I search different  
> parts
> of the index, the memory footprint grows with large byte arrays being
> stored. They never seem to get unloaded or GC'ed. Is there any way to
> control this behavior so that I can periodically unload cached
> information?
>
>
>
> The nature of the data being indexed doesn't allow me to reduce the
> number of terms per field, although I might be able to reduce the  
> number
> of overall fields (I have some which aren't currently being searched
> by).
>
>
>
> I've just begun investigating and profiling the problem, so I don't  
> have
> a lot of details at this time. Any support would be extremely welcome.
>
>
>
> Thanks,
>
>
>
> Marc Dumontier
> Manager, Software Development
> Thomson Scientific (Canada)
> 1 Yonge Street, Suite 1801
> Toronto, Ontario M5E 1W7
>
>
>
> Direct +1 416 214 3448
> Mobile +1 416 454 3147
>
>
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: large term vectors

Karl Wettin

http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/document/Field.Index.html#NO_NORMS

?


11 feb 2008 kl. 15.55 skrev <[hidden email]>:

> Hi Grant,
>
> Lucene 2.2.0
>
> I'm not actually explicitely storing term vectors. It seems the huge
> amount of byte arrays is actually coming from SegmentReader.norms.  
> Maybe
> that cache constantly grows as I read somewhere that it's on-demand.  
> I'm
> not using any field or document boosting..is there some way to  
> optimize
> around this?
>
> Marc
>
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:[hidden email]]
> Sent: Monday, February 11, 2008 7:46 AM
> To: [hidden email]
> Subject: Re: large term vectors
>
> Hi Marc,
>
> Can you give more info about what your field properties are?  Your
> subject line implies you are storing term vectors, is that the case?
>
> Also, what version of Lucene are you using?
>
> Cheers,
> Grant
>
> On Feb 8, 2008, at 10:51 AM, <[hidden email]>
> <[hidden email]
>> wrote:
>
>> Hi,
>>
>>
>>
>> I have a large index which is around 275GB. As I search different
>> parts
>> of the index, the memory footprint grows with large byte arrays being
>> stored. They never seem to get unloaded or GC'ed. Is there any way to
>> control this behavior so that I can periodically unload cached
>> information?
>>
>>
>>
>> The nature of the data being indexed doesn't allow me to reduce the
>> number of terms per field, although I might be able to reduce the
>> number
>> of overall fields (I have some which aren't currently being searched
>> by).
>>
>>
>>
>> I've just begun investigating and profiling the problem, so I don't
>> have
>> a lot of details at this time. Any support would be extremely  
>> welcome.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Marc Dumontier
>> Manager, Software Development
>> Thomson Scientific (Canada)
>> 1 Yonge Street, Suite 1801
>> Toronto, Ontario M5E 1W7
>>
>>
>>
>> Direct +1 416 214 3448
>> Mobile +1 416 454 3147
>>
>>
>>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
> http://www.lucenebootcamp.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]