Lucene's VInt for lengths/counts/sizes

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene's VInt for lengths/counts/sizes

Andrzej Białecki-2
Hi,

I wonder, would it be a good idea to replace the (rather wasteful)
4-byte ints with Lucene's variable-byte int encoding, in all places
where size matters? We could "borrow" the code from Lucene and create a
VIntWritable for this purpose. I'm thinking specifically about the
following places:

* UTF8 (2-byte string length)

* ArrayWritable/BytesWritable/TwoDArrayWritable (4-byte length)

* Properties and derived maps (like ContentProperties): all lengths are
written as 4-byte ints.

* any Writable that consists of lists of values is currently serialized
using 4-byte ints for the size of list, e.g. ParseData.outlinks

Overall I think the size savings could be considerable, at the cost of
some CPU.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Lucene's VInt for lengths/counts/sizes

Stefan Groschupf-2
+1 :-)

Am 31.01.2006 um 22:06 schrieb Andrzej Bialecki:

> Hi,
>
> I wonder, would it be a good idea to replace the (rather wasteful)  
> 4-byte ints with Lucene's variable-byte int encoding, in all places  
> where size matters? We could "borrow" the code from Lucene and  
> create a VIntWritable for this purpose. I'm thinking specifically  
> about the following places:
>
> * UTF8 (2-byte string length)
>
> * ArrayWritable/BytesWritable/TwoDArrayWritable (4-byte length)
>
> * Properties and derived maps (like ContentProperties): all lengths  
> are written as 4-byte ints.
>
> * any Writable that consists of lists of values is currently  
> serialized using 4-byte ints for the size of list, e.g.  
> ParseData.outlinks
>
> Overall I think the size savings could be considerable, at the cost  
> of some CPU.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Lucene's VInt for lengths/counts/sizes

Doug Cutting-2
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki wrote:
> I wonder, would it be a good idea to replace the (rather wasteful)
> 4-byte ints with Lucene's variable-byte int encoding, in all places
> where size matters?

I'm not sure there are that many places where it could make a big
difference.

> * UTF8 (2-byte string length)

Currently Nutch uses Java's DataOutput format for UTF8, so this would
mean departing from that format, which is not a bad thing.  But most
strings in Nutch (urls, anchors, etc.) are significantly longer than 4
bytes, so this won't provide a huge savings.

> * ArrayWritable/BytesWritable/TwoDArrayWritable (4-byte length)

Are there particular space-sensitive usages of these?

> Overall I think the size savings could be considerable, at the cost of
> some CPU.

I'd be interested to see what the size savings really amount to.

A more substantial savings might be had if we developed a version of
MapFile which writes keys as differences from the previous key.  That
could make, e.g.., all of the url-keyed files smaller.

Another good way to save space would be to use a faster compression
algorithm in SequenceFile.  The LZO algorithm is many times faster than
the gzip we use now.

Doug