ways to minimize index size?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

ways to minimize index size?

jmuguruza
Hi,

I want to make my index as small as possible. I noticed about
field.setOmitNorms(true), I read in the list the diff is 1 byte per
field per doc, not huge but hey...is the only effect the score being
different? I hardly mind about the score so that would be ok.

And can I add to an index without norms when it has previous doc with norms?

Any other way to minimize size of index? Most of my fields but one are
Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
tried compressing that one and size is reduced around 1% (it's a small
field), but I guess compression means worse performance so I am not
sure about applying that.

thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ways to minimize index size?

Erick Erickson
Store as little as possible, index as little as possible <G>.....

How big is your index, and how much do you expect it to grow?
I ask this because it's probably not worth your time to try to
reduce the index size below some threshold... I found that
reducing my index from 8G to 4G (through not stemming) gave
me about a 10% performance improvement, so at some point
it's just not worth the effort. Also, if you posted the index size,
it would give folks a chance to say "there's not much you can
gain by reducing things more". As it is, I don't have a clue
whether your index is 100M or 100T. The former is in the
"don't waste your time" class, and the latter is...er...
different....

I wouldn't bother compressing for 1%....

Question for "the guys" so I can check an assumption....
Is there any difference between these two?
Field(Name, Value, Store, index)
*<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Field.Store,%20org.apache.lucene.document.Field.Index,%20org.apache.lucene.document.Field.TermVector%29>
*Field(Name, Value, Store, index, Field.TermVector.NO)


Best
Erick

On 3/14/07, jm <[hidden email]> wrote:

>
> Hi,
>
> I want to make my index as small as possible. I noticed about
> field.setOmitNorms(true), I read in the list the diff is 1 byte per
> field per doc, not huge but hey...is the only effect the score being
> different? I hardly mind about the score so that would be ok.
>
> And can I add to an index without norms when it has previous doc with
> norms?
>
> Any other way to minimize size of index? Most of my fields but one are
> Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
> Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
> tried compressing that one and size is reduced around 1% (it's a small
> field), but I guess compression means worse performance so I am not
> sure about applying that.
>
> thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: ways to minimize index size?

Jeff-188
In reply to this post by jmuguruza
>I found that reducing my index from 8G to 4G (through not stemming) gave me
about a 10% performance improvement.

How did you do this? I don't see this as an option.

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: ways to minimize index size?

jmuguruza
In reply to this post by Erick Erickson
hi Erick,


Well, typically my application will start with some hundreds of
indexes...and then grow at a rate of several per day, for ever. At
some point I know I can do some merging etc if needed.

Size is dependant on the customer, could be up to a 1G per index. That
is way I would like to minimize them. I am not worried with search
performance.

I dont understand how not stemming can reduce the size of an index...I
would think it happens the other way, does not stemming makes the
words shorter? (I dont stemm, so I never looked into it)

thanks
On 3/14/07, Erick Erickson <[hidden email]> wrote:

> Store as little as possible, index as little as possible <G>.....
>
> How big is your index, and how much do you expect it to grow?
> I ask this because it's probably not worth your time to try to
> reduce the index size below some threshold... I found that
> reducing my index from 8G to 4G (through not stemming) gave
> me about a 10% performance improvement, so at some point
> it's just not worth the effort. Also, if you posted the index size,
> it would give folks a chance to say "there's not much you can
> gain by reducing things more". As it is, I don't have a clue
> whether your index is 100M or 100T. The former is in the
> "don't waste your time" class, and the latter is...er...
> different....
>
> I wouldn't bother compressing for 1%....
>
> Question for "the guys" so I can check an assumption....
> Is there any difference between these two?
> Field(Name, Value, Store, index)
> *<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Field.Store,%20org.apache.lucene.document.Field.Index,%20org.apache.lucene.document.Field.TermVector%29>
> *Field(Name, Value, Store, index, Field.TermVector.NO)
>
>
> Best
> Erick
>
> On 3/14/07, jm <[hidden email]> wrote:
> >
> > Hi,
> >
> > I want to make my index as small as possible. I noticed about
> > field.setOmitNorms(true), I read in the list the diff is 1 byte per
> > field per doc, not huge but hey...is the only effect the score being
> > different? I hardly mind about the score so that would be ok.
> >
> > And can I add to an index without norms when it has previous doc with
> > norms?
> >
> > Any other way to minimize size of index? Most of my fields but one are
> > Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
> > Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
> > tried compressing that one and size is reduced around 1% (it's a small
> > field), but I guess compression means worse performance so I am not
> > sure about applying that.
> >
> > thanks
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ways to minimize index size?

Erick Erickson
OK, I caused more confusion than rendered help by my stemming
statement. The only reason I mentioned it was to illustrate that
performance is not linearly related to size.

It took some effort to put stemming into the index, see
PorterStemmer etc. This is NOT the default. So I took it out
to see what the effect would be.

Why not stemming made things shorter: because we also
have the requirement that phrases (i.e. words in double quotes)
do NOT match the stemmed version. Thus if we index
running watching, the following searches have the
indicated results
run - hits
watch - hits
running - hits
"run watch" does NOT hit.
"running watching" hits

So I indexed the following terms...

run
running$
watch
watching&

with the two forms of run indexed in the same position (0)
and the two forms of watch in the same position (1).

I agree that if we didn't have the exact-phrase-match requirement
the stemmed version of the index should be smaller....


Sorry for the confusion
Erick

On 3/14/07, jm <[hidden email]> wrote:

>
> hi Erick,
>
>
> Well, typically my application will start with some hundreds of
> indexes...and then grow at a rate of several per day, for ever. At
> some point I know I can do some merging etc if needed.
>
> Size is dependant on the customer, could be up to a 1G per index. That
> is way I would like to minimize them. I am not worried with search
> performance.
>
> I dont understand how not stemming can reduce the size of an index...I
> would think it happens the other way, does not stemming makes the
> words shorter? (I dont stemm, so I never looked into it)
>
> thanks
> On 3/14/07, Erick Erickson <[hidden email]> wrote:
> > Store as little as possible, index as little as possible <G>.....
> >
> > How big is your index, and how much do you expect it to grow?
> > I ask this because it's probably not worth your time to try to
> > reduce the index size below some threshold... I found that
> > reducing my index from 8G to 4G (through not stemming) gave
> > me about a 10% performance improvement, so at some point
> > it's just not worth the effort. Also, if you posted the index size,
> > it would give folks a chance to say "there's not much you can
> > gain by reducing things more". As it is, I don't have a clue
> > whether your index is 100M or 100T. The former is in the
> > "don't waste your time" class, and the latter is...er...
> > different....
> >
> > I wouldn't bother compressing for 1%....
> >
> > Question for "the guys" so I can check an assumption....
> > Is there any difference between these two?
> > Field(Name, Value, Store, index)
> > *<
> file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Field.Store,%20org.apache.lucene.document.Field.Index,%20org.apache.lucene.document.Field.TermVector%29
> >
> > *Field(Name, Value, Store, index, Field.TermVector.NO)
> >
> >
> > Best
> > Erick
> >
> > On 3/14/07, jm <[hidden email]> wrote:
> > >
> > > Hi,
> > >
> > > I want to make my index as small as possible. I noticed about
> > > field.setOmitNorms(true), I read in the list the diff is 1 byte per
> > > field per doc, not huge but hey...is the only effect the score being
> > > different? I hardly mind about the score so that would be ok.
> > >
> > > And can I add to an index without norms when it has previous doc with
> > > norms?
> > >
> > > Any other way to minimize size of index? Most of my fields but one are
> > > Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
> > > Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
> > > tried compressing that one and size is reduced around 1% (it's a small
> > > field), but I guess compression means worse performance so I am not
> > > sure about applying that.
> > >
> > > thanks
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: ways to minimize index size?

Sebastin
In reply to this post by Jeff-188
Hi Does anyone give me an idea to reduce the Index size to down.now i am  getting 42% compression in my index store.i want to reduce upto 70%.i use standardanalyzer to write the document.when i use SimpleAnalyzer it reduce upto 58% but i couldnt search the document.please help me to acheive.

 Thanks in advance
Jeff-188 wrote
>I found that reducing my index from 8G to 4G (through not stemming) gave me
about a 10% performance improvement.

How did you do this? I don't see this as an option.

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: ways to minimize index size?

Erick Erickson
Show us the code you use to index. Are you storing the fields?
omitting norms? Throwing out stop words?

Best
Erick

On 6/19/07, Sebastin <[hidden email]> wrote:

>
>
> Hi Does anyone give me an idea to reduce the Index size to down.now i am
> getting 42% compression in my index store.i want to reduce upto 70%.i use
> standardanalyzer to write the document.when i use SimpleAnalyzer it reduce
> upto 58% but i couldnt search the document.please help me to acheive.
>
> Thanks in advance
>
> Jeff-188 wrote:
> >
> >>I found that reducing my index from 8G to 4G (through not stemming) gave
> me
> > about a 10% performance improvement.
> >
> > How did you do this? I don't see this as an option.
> >
> > Jeff
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: ways to minimize index size?

Sebastin
                       
String outgoingNumber="9198408365809";
String incomingNumber="9840861114";
String datesc="070601";
String imsiNumber="444021365987";
String callType="1";

//Search Fields
 String contents=(outgoingNumber+" "+incomingNumber+" "+dateSc+" "+imsiNumber+" "+callType );

//Display Fields
                     
                          String records=(callingPartyNumber+" "+calledPartyNumber+" "+dateSc+" "+chargDur+" "+incomingRoute+" "+outgoingRoute+" "+timeSc);
                         
                     
                       IndexWriter indexWriter = new IndexWriter(indexDir,new StandardAnalyzer(),true);  
                       
                          Document document = new Document();
 
                             document.add(new Field("contents",contents,Field.Store.NO,Field.Index.TOKENIZED));
                             
                     
                     
                document.add(new Field("records",records,Field.Store.YES,Field.Index.NO));
                             
                           
                             indexWriter.setUseCompoundFile(true);
                             indexWriter.addDocument(document);
                          }

please help me to acheive the minimum size




Erick Erickson wrote
Show us the code you use to index. Are you storing the fields?
omitting norms? Throwing out stop words?

Best
Erick

On 6/19/07, Sebastin <sebasmtech@gmail.com> wrote:
>
>
> Hi Does anyone give me an idea to reduce the Index size to down.now i am
> getting 42% compression in my index store.i want to reduce upto 70%.i use
> standardanalyzer to write the document.when i use SimpleAnalyzer it reduce
> upto 58% but i couldnt search the document.please help me to acheive.
>
> Thanks in advance
>
> Jeff-188 wrote:
> >
> >>I found that reducing my index from 8G to 4G (through not stemming) gave
> me
> > about a 10% performance improvement.
> >
> > How did you do this? I don't see this as an option.
> >
> > Jeff
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Reply | Threaded
Open this post in threaded view
|

Re: ways to minimize index size?

Sebastin
When i use the standardAnalyzer storage size increases.how can i minimize index store
Sebastin wrote
                       
String outgoingNumber="9198408365809";
String incomingNumber="9840861114";
String datesc="070601";
String imsiNumber="444021365987";
String callType="1";

//Search Fields
 String contents=(outgoingNumber+" "+incomingNumber+" "+dateSc+" "+imsiNumber+" "+callType );

//Display Fields
                     
                          String records=(callingPartyNumber+" "+calledPartyNumber+" "+dateSc+" "+chargDur+" "+incomingRoute+" "+outgoingRoute+" "+timeSc);
                         
                     
                       IndexWriter indexWriter = new IndexWriter(indexDir,new StandardAnalyzer(),true);  
                       
                          Document document = new Document();
 
                             document.add(new Field("contents",contents,Field.Store.NO,Field.Index.TOKENIZED));
                             
                     
                     
                document.add(new Field("records",records,Field.Store.YES,Field.Index.NO));
                             
                           
                             indexWriter.setUseCompoundFile(true);
                             indexWriter.addDocument(document);
                          }

please help me to acheive the minimum size




Erick Erickson wrote
Show us the code you use to index. Are you storing the fields?
omitting norms? Throwing out stop words?

Best
Erick

On 6/19/07, Sebastin <sebasmtech@gmail.com> wrote:
>
>
> Hi Does anyone give me an idea to reduce the Index size to down.now i am
> getting 42% compression in my index store.i want to reduce upto 70%.i use
> standardanalyzer to write the document.when i use SimpleAnalyzer it reduce
> upto 58% but i couldnt search the document.please help me to acheive.
>
> Thanks in advance
>
> Jeff-188 wrote:
> >
> >>I found that reducing my index from 8G to 4G (through not stemming) gave
> me
> > about a 10% performance improvement.
> >
> > How did you do this? I don't see this as an option.
> >
> > Jeff
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Reply | Threaded
Open this post in threaded view
|

Re: ways to minimize index size?

Steve Liles
Compression aside you could index the "contents" as terms in separate
fields instead of tokenized text, and disable storing of norms:

String outgoingNumber="9198408365809";
String incomingNumber="9840861114";

_doc.add(new Field("outgoingNumber", outgoingNumber, Store.NO,
Index.NO_NORMS));
_doc.add(new Field("incomingNumber", incomingNumber, Store.NO,
Index.NO_NORMS));

According to the docs "Index.NO_NORMS" will save you one byte per
document in the index.

Or you could index all of the data as separate terms in the same
"contents" field if you wanted (make the first param "contents" for all
of the terms), which is more comparable to what you are currently doing.
(Another advantage is that the Analyzer will not be used for fields
which are untokenized, and indexing should be faster.)

...

One way to compress numerical data (possibly not the best - i'm no
expert) is to change the base of the number that is indexed / stored in
the index.

java.lang.Long and java.math.BigInteger have methods for converting from
one radix to another. Taking your "outgoingNumber" as an example:

//compression
BigInteger  _bi = new java.math.BigInteger("9198408365809", 10);
System.out.println(_bi.toString(36));

 > 39douufap

//decompression
BigInteger _bi = new java.math.BigInteger("39douufap", 36);
System.out.println(_bi.toString(10));

 >9198408365809

Converting to a higher radix will give you better compression but you'll
have to do it yourself as the jdk classes only work up to base 36
<http://en.wikipedia.org/wiki/Base_36>.

It's worth compressing your unstored "contents" field as well as your
stored "records" field, as the unique terms in the "contents" field will
effectively be stored.

Also don't forget to convert the terms when you search too, otherwise
you won't find anything ;)

Steve.


Sebastin wrote:

> When i use the standardAnalyzer storage size increases.how can i minimize
> index store
>
> Sebastin wrote:
>  
>>                        
>> String outgoingNumber="9198408365809";
>> String incomingNumber="9840861114";
>> String datesc="070601";
>> String imsiNumber="444021365987";
>> String callType="1";
>>
>> //Search Fields
>>  String contents=(outgoingNumber+" "+incomingNumber+" "+dateSc+"
>> "+imsiNumber+" "+callType );
>>
>> //Display Fields
>>                      
>>                           String records=(callingPartyNumber+"
>> "+calledPartyNumber+" "+dateSc+" "+chargDur+" "+incomingRoute+"
>> "+outgoingRoute+" "+timeSc);
>>                          
>>                      
>>                        IndexWriter indexWriter = new
>> IndexWriter(indexDir,new StandardAnalyzer(),true);  
>>                        
>>                           Document document = new Document();
>>  
>>                              document.add(new
>> Field("contents",contents,Field.Store.NO,Field.Index.TOKENIZED));
>>                              
>>                      
>>                      
>>                 document.add(new
>> Field("records",records,Field.Store.YES,Field.Index.NO));
>>                              
>>                            
>>                              indexWriter.setUseCompoundFile(true);
>>                              indexWriter.addDocument(document);
>>                           }
>>
>> please help me to acheive the minimum size
>>
>>
>>
>>
>>
>> Erick Erickson wrote:
>>    
>>> Show us the code you use to index. Are you storing the fields?
>>> omitting norms? Throwing out stop words?
>>>
>>> Best
>>> Erick
>>>
>>> On 6/19/07, Sebastin <[hidden email]> wrote:
>>>      
>>>> Hi Does anyone give me an idea to reduce the Index size to down.now i am
>>>> getting 42% compression in my index store.i want to reduce upto 70%.i
>>>> use
>>>> standardanalyzer to write the document.when i use SimpleAnalyzer it
>>>> reduce
>>>> upto 58% but i couldnt search the document.please help me to acheive.
>>>>
>>>> Thanks in advance
>>>>
>>>> Jeff-188 wrote:
>>>>        
>>>>>> I found that reducing my index from 8G to 4G (through not stemming)
>>>>>>            
>>>> gave
>>>> me
>>>>        
>>>>> about a 10% performance improvement.
>>>>>
>>>>> How did you do this? I don't see this as an option.
>>>>>
>>>>> Jeff
>>>>>
>>>>>
>>>>>          
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>>        
>>>      
>>    
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ways to minimize index size?

Sebastin
In reply to this post by jmuguruza
Hi Erick do u have any idea on this?
jm-27 wrote
Hi,

I want to make my index as small as possible. I noticed about
field.setOmitNorms(true), I read in the list the diff is 1 byte per
field per doc, not huge but hey...is the only effect the score being
different? I hardly mind about the score so that would be ok.

And can I add to an index without norms when it has previous doc with norms?

Any other way to minimize size of index? Most of my fields but one are
Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
tried compressing that one and size is reduced around 1% (it's a small
field), but I guess compression means worse performance so I am not
sure about applying that.

thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: ways to minimize index size?

Sebastin
In reply to this post by Steve Liles
Hi Steve,
     thanks for your reply a lot.its now compress upto 50% of the original size.is there any other possiblity using this code compress upto 80%.
Steve Liles wrote
Compression aside you could index the "contents" as terms in separate
fields instead of tokenized text, and disable storing of norms:

String outgoingNumber="9198408365809";
String incomingNumber="9840861114";

_doc.add(new Field("outgoingNumber", outgoingNumber, Store.NO,
Index.NO_NORMS));
_doc.add(new Field("incomingNumber", incomingNumber, Store.NO,
Index.NO_NORMS));

According to the docs "Index.NO_NORMS" will save you one byte per
document in the index.

Or you could index all of the data as separate terms in the same
"contents" field if you wanted (make the first param "contents" for all
of the terms), which is more comparable to what you are currently doing.
(Another advantage is that the Analyzer will not be used for fields
which are untokenized, and indexing should be faster.)

...

One way to compress numerical data (possibly not the best - i'm no
expert) is to change the base of the number that is indexed / stored in
the index.

java.lang.Long and java.math.BigInteger have methods for converting from
one radix to another. Taking your "outgoingNumber" as an example:

//compression
BigInteger  _bi = new java.math.BigInteger("9198408365809", 10);
System.out.println(_bi.toString(36));

 > 39douufap

//decompression
BigInteger _bi = new java.math.BigInteger("39douufap", 36);
System.out.println(_bi.toString(10));

 >9198408365809

Converting to a higher radix will give you better compression but you'll
have to do it yourself as the jdk classes only work up to base 36
<http://en.wikipedia.org/wiki/Base_36>.

It's worth compressing your unstored "contents" field as well as your
stored "records" field, as the unique terms in the "contents" field will
effectively be stored.

Also don't forget to convert the terms when you search too, otherwise
you won't find anything ;)

Steve.


Sebastin wrote:
> When i use the standardAnalyzer storage size increases.how can i minimize
> index store
>
> Sebastin wrote:
>  
>>                        
>> String outgoingNumber="9198408365809";
>> String incomingNumber="9840861114";
>> String datesc="070601";
>> String imsiNumber="444021365987";
>> String callType="1";
>>
>> //Search Fields
>>  String contents=(outgoingNumber+" "+incomingNumber+" "+dateSc+"
>> "+imsiNumber+" "+callType );
>>
>> //Display Fields
>>                      
>>                           String records=(callingPartyNumber+"
>> "+calledPartyNumber+" "+dateSc+" "+chargDur+" "+incomingRoute+"
>> "+outgoingRoute+" "+timeSc);
>>                          
>>                      
>>                        IndexWriter indexWriter = new
>> IndexWriter(indexDir,new StandardAnalyzer(),true);  
>>                        
>>                           Document document = new Document();
>>  
>>                              document.add(new
>> Field("contents",contents,Field.Store.NO,Field.Index.TOKENIZED));
>>                              
>>                      
>>                      
>>                 document.add(new
>> Field("records",records,Field.Store.YES,Field.Index.NO));
>>                              
>>                            
>>                              indexWriter.setUseCompoundFile(true);
>>                              indexWriter.addDocument(document);
>>                           }
>>
>> please help me to acheive the minimum size
>>
>>
>>
>>
>>
>> Erick Erickson wrote:
>>    
>>> Show us the code you use to index. Are you storing the fields?
>>> omitting norms? Throwing out stop words?
>>>
>>> Best
>>> Erick
>>>
>>> On 6/19/07, Sebastin <sebasmtech@gmail.com> wrote:
>>>      
>>>> Hi Does anyone give me an idea to reduce the Index size to down.now i am
>>>> getting 42% compression in my index store.i want to reduce upto 70%.i
>>>> use
>>>> standardanalyzer to write the document.when i use SimpleAnalyzer it
>>>> reduce
>>>> upto 58% but i couldnt search the document.please help me to acheive.
>>>>
>>>> Thanks in advance
>>>>
>>>> Jeff-188 wrote:
>>>>        
>>>>>> I found that reducing my index from 8G to 4G (through not stemming)
>>>>>>            
>>>> gave
>>>> me
>>>>        
>>>>> about a 10% performance improvement.
>>>>>
>>>>> How did you do this? I don't see this as an option.
>>>>>
>>>>> Jeff
>>>>>
>>>>>
>>>>>          
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>        
>>>      
>>    
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: ways to minimize index size?

Sebastin
Steve,
          i use your idea it works for me great,once again i say thanks to you.But when i use
                    (Index.No_NORMS ) it increase the size in the same time when i use(Index.TOKENIZED)it will reduce the size.

           i use the code given by you  
BigInteger  _bi = new java.math.BigInteger("9198408365809", 10);
 System.out.println(_bi.toString(36));    

 other RADIX increase the size.                    

          Modifications I made in my code is below:

    String outgoingNumber="9198408365809";

     String incomingNumber="9840861114";
     String datesc="070601";
     String imsiNumber="444021365987";
     String callType="1";


     String outgoingRoute="DJZ01" ;
     String incomingRoute="BSC01";

BigInteger  _on = new java.math.BigInteger(outgoingNumber, 10);
 String compOutgoingNumber= _on.toString(36);

BigInteger  _in = new java.math.BigInteger( incomingNumber, 10);
 String compIncomingNumber= _in.toString(36);

BigInteger  _ds = new java.math.BigInteger(dateSc, 10);
 String compDateSc= _ds.toString(36);

BigInteger  _im = new java.math.BigInteger(imsiNumber, 10);
 String compImsiNumber= _im.toString(36);

String contents(compOutgoingNumber+" "+compIncomingNumber+" "+compDateSc+" "+compImsiNumber+callTYpe);

String records=((compOutgoingNumber+" "+compIncomingNumber+" "+compDateSc+ " " +outgoingRoute+" "+incomingRoute);

File indexDir = new File("/home/Mediation/Index");
IndexWriter indexWriter =new IndexWriter(indexDir, new StandardAnalyzer(), true);
Document doc=new Document();
doc.add("contents",contents,Field.Store.NO,Field.Index.TOKENIZED);
doc.add("records",records,Field.Store.YES ,Field.Index.No);
indexWriter.addDocument(document);

please help me to acheive that

Sebastin wrote
Hi Steve,
     thanks for your reply a lot.its now compress upto 50% of the original size.is there any other possiblity using this code compress upto 80%.
Steve Liles wrote
Compression aside you could index the "contents" as terms in separate
fields instead of tokenized text, and disable storing of norms:

String outgoingNumber="9198408365809";
String incomingNumber="9840861114";

_doc.add(new Field("outgoingNumber", outgoingNumber, Store.NO,
Index.NO_NORMS));
_doc.add(new Field("incomingNumber", incomingNumber, Store.NO,
Index.NO_NORMS));

According to the docs "Index.NO_NORMS" will save you one byte per
document in the index.

Or you could index all of the data as separate terms in the same
"contents" field if you wanted (make the first param "contents" for all
of the terms), which is more comparable to what you are currently doing.
(Another advantage is that the Analyzer will not be used for fields
which are untokenized, and indexing should be faster.)

...

One way to compress numerical data (possibly not the best - i'm no
expert) is to change the base of the number that is indexed / stored in
the index.

java.lang.Long and java.math.BigInteger have methods for converting from
one radix to another. Taking your "outgoingNumber" as an example:

//compression
BigInteger  _bi = new java.math.BigInteger("9198408365809", 10);
System.out.println(_bi.toString(36));

 > 39douufap

//decompression
BigInteger _bi = new java.math.BigInteger("39douufap", 36);
System.out.println(_bi.toString(10));

 >9198408365809

Converting to a higher radix will give you better compression but you'll
have to do it yourself as the jdk classes only work up to base 36
<http://en.wikipedia.org/wiki/Base_36>.

It's worth compressing your unstored "contents" field as well as your
stored "records" field, as the unique terms in the "contents" field will
effectively be stored.

Also don't forget to convert the terms when you search too, otherwise
you won't find anything ;)

Steve.


Sebastin wrote:
> When i use the standardAnalyzer storage size increases.how can i minimize
> index store
>
> Sebastin wrote:
>  
>>                        
>> String outgoingNumber="9198408365809";
>> String incomingNumber="9840861114";
>> String datesc="070601";
>> String imsiNumber="444021365987";
>> String callType="1";
>>
>> //Search Fields
>>  String contents=(outgoingNumber+" "+incomingNumber+" "+dateSc+"
>> "+imsiNumber+" "+callType );
>>
>> //Display Fields
>>                      
>>                           String records=(callingPartyNumber+"
>> "+calledPartyNumber+" "+dateSc+" "+chargDur+" "+incomingRoute+"
>> "+outgoingRoute+" "+timeSc);
>>                          
>>                      
>>                        IndexWriter indexWriter = new
>> IndexWriter(indexDir,new StandardAnalyzer(),true);  
>>                        
>>                           Document document = new Document();
>>  
>>                              document.add(new
>> Field("contents",contents,Field.Store.NO,Field.Index.TOKENIZED));
>>                              
>>                      
>>                      
>>                 document.add(new
>> Field("records",records,Field.Store.YES,Field.Index.NO));
>>                              
>>                            
>>                              indexWriter.setUseCompoundFile(true);
>>                              indexWriter.addDocument(document);
>>                           }
>>
>> please help me to acheive the minimum size
>>
>>
>>
>>
>>
>> Erick Erickson wrote:
>>    
>>> Show us the code you use to index. Are you storing the fields?
>>> omitting norms? Throwing out stop words?
>>>
>>> Best
>>> Erick
>>>
>>> On 6/19/07, Sebastin <sebasmtech@gmail.com> wrote:
>>>      
>>>> Hi Does anyone give me an idea to reduce the Index size to down.now i am
>>>> getting 42% compression in my index store.i want to reduce upto 70%.i
>>>> use
>>>> standardanalyzer to write the document.when i use SimpleAnalyzer it
>>>> reduce
>>>> upto 58% but i couldnt search the document.please help me to acheive.
>>>>
>>>> Thanks in advance
>>>>
>>>> Jeff-188 wrote:
>>>>        
>>>>>> I found that reducing my index from 8G to 4G (through not stemming)
>>>>>>            
>>>> gave
>>>> me
>>>>        
>>>>> about a 10% performance improvement.
>>>>>
>>>>> How did you do this? I don't see this as an option.
>>>>>
>>>>> Jeff
>>>>>
>>>>>
>>>>>          
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/ways-to-minimize-index-size--tf3401213.html#a11195406
>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>        
>>>      
>>    
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org