Lucene Compression

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene Compression

Sebastin
Hi All,
       is there any possibility to create compression store for the following types of string in lucene index store?


String str = "II0264.D05|00022745|ABCDE|03/01/2008 00:23:12|00035| 9840836588| 129382152520| 04F4243B600408|04F4243B600408| |11919898456123|354943011025810L| "CPTBS2I"| "ABCD3E"|11|1234510003243219I|"


I try to store these fields as Field.Store.COMPRESSION  but it exceeds the original size of the file?

Reply | Threaded
Open this post in threaded view
|

Re: Lucene Compression

Grant Ingersoll-2
It's generally considered best practice to compress things first in  
your app and then add them as a binary field.   That being said, I  
don't see why that would blow up on it's own.  Have you tried  
compressing it outside of Lucene to see what happens?  If you can  
reproduce it as a test case for Lucene, that would be great.

 From FieldsWriter, Lucene's compression code looks like:
private final byte[] compress (byte[] input) {

       // Create the compressor with highest level of compression
       Deflater compressor = new Deflater();
       compressor.setLevel(Deflater.BEST_COMPRESSION);

       // Give the compressor the data to compress
       compressor.setInput(input);
       compressor.finish();

       /*
        * Create an expandable byte array to hold the compressed data.
        * You cannot use an array that's the same size as the orginal  
because
        * there is no guarantee that the compressed data will be  
smaller than
        * the uncompressed data.
        */
       ByteArrayOutputStream bos = new  
ByteArrayOutputStream(input.length);

       // Compress the data
       byte[] buf = new byte[1024];
       while (!compressor.finished()) {
         int count = compressor.deflate(buf);
         bos.write(buf, 0, count);
       }

       compressor.end();

       // Get the compressed data
       return bos.toByteArray();
     }


There is an interesting comment in that code about how the compressed  
data won't necessarily be smaller, so maybe you have entered the  
compression twilight zone.

HTH
-Grant


On Apr 2, 2008, at 12:51 AM, Sebastin wrote:

>
> Hi All,
>       is there any possibility to create compression store for the
> following types of string in lucene index store?
>
>
> String str = "II0264.D05|00022745|ABCDE|03/01/2008 00:23:12|00035|
> 9840836588| 129382152520| 04F4243B600408|04F4243B600408|
> |11919898456123|354943011025810L| "CPTBS2I"| "ABCD3E"|11|
> 1234510003243219I|"
>
>
> I try to store these fields as Field.Store.COMPRESSION  but it  
> exceeds the
> original size of the file?
>
>
> --
> View this message in context: http://www.nabble.com/Lucene-Compression-tp16442112p16442112.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene Compression

eks dev
In reply to this post by Sebastin
the example you have sent is too small for the type of compression implemented in lucene. The problem is that you have to store decoding symbol table , header ...* for each* document you compress.  
The best you can do for this would be to use some compressor with static decoding table (some entropy codec, huffman or so) and maybe some static dictionary compressor. As Grant said, the best way to do it would be to compress your field before storing it into lucene

----- Original Message ----

> From: Grant Ingersoll <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, 2 April, 2008 2:09:37 PM
> Subject: Re: Lucene Compression
>
> It's generally considered best practice to compress things first in  
> your app and then add them as a binary field.   That being said, I  
> don't see why that would blow up on it's own.  Have you tried  
> compressing it outside of Lucene to see what happens?  If you can  
> reproduce it as a test case for Lucene, that would be great.
>
>  From FieldsWriter, Lucene's compression code looks like:
> private final byte[] compress (byte[] input) {
>
>        // Create the compressor with highest level of compression
>        Deflater compressor = new Deflater();
>        compressor.setLevel(Deflater.BEST_COMPRESSION);
>
>        // Give the compressor the data to compress
>        compressor.setInput(input);
>        compressor.finish();
>
>        /*
>         * Create an expandable byte array to hold the compressed data.
>         * You cannot use an array that's the same size as the orginal  
> because
>         * there is no guarantee that the compressed data will be  
> smaller than
>         * the uncompressed data.
>         */
>        ByteArrayOutputStream bos = new  
> ByteArrayOutputStream(input.length);
>
>        // Compress the data
>        byte[] buf = new byte[1024];
>        while (!compressor.finished()) {
>          int count = compressor.deflate(buf);
>          bos.write(buf, 0, count);
>        }
>
>        compressor.end();
>
>        // Get the compressed data
>        return bos.toByteArray();
>      }
>
>
> There is an interesting comment in that code about how the compressed  
> data won't necessarily be smaller, so maybe you have entered the  
> compression twilight zone.
>
> HTH
> -Grant
>
>
> On Apr 2, 2008, at 12:51 AM, Sebastin wrote:
>
> >
> > Hi All,
> >       is there any possibility to create compression store for the
> > following types of string in lucene index store?
> >
> >
> > String str = "II0264.D05|00022745|ABCDE|03/01/2008 00:23:12|00035|
> > 9840836588| 129382152520| 04F4243B600408|04F4243B600408|
> > |11919898456123|354943011025810L| "CPTBS2I"| "ABCD3E"|11|
> > 1234510003243219I|"
> >
> >
> > I try to store these fields as Field.Store.COMPRESSION  but it  
> > exceeds the
> > original size of the file?
> >
> >
> > --
> > View this message in context:
> http://www.nabble.com/Lucene-Compression-tp16442112p16442112.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>
> Lucene Helpful Hints:
> http://wiki.apache..org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>




      __________________________________________________________
Sent from Yahoo! Mail.
A Smarter Inbox http://uk.docs.yahoo.com/nowyoucan.html


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]