VInt block lenght in Lucene 4.1 postings format

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

VInt block lenght in Lucene 4.1 postings format

Aleksandra Woźniak
Hi all,

recently I wanted to try out some modifications of Lucene's postings format (namely, copying blocks that have no deletions without int-decoding/encoding -- this is similar to what was described here: https://issues.apache.org/jira/browse/LUCENE-2082). I started with changing Lucene 4.1 postings format to check what can be done there.

I came across the following problem: in Lucene41PostingsReader the length (number of bytes) of the last, vInt-encoded, block of posting in not known before all individual postings are read and decoded. When reading this block we only know the number of postings that should be read and decoded -- since vInts have different sizes by definition.

If I wanted to copy the whole block without vInt decoding/encoding, I need to know how many bytes I have to read from postings index input. So, my question is: is there a clean way to determine the length of this block (ie. the number of bytes that this block has)? Is the number of bytes in a posting list tracked somewhere in Lucene 4.1 postings format?

Thanks,
Aleksandra
Reply | Threaded
Open this post in threaded view
|

Re: VInt block lenght in Lucene 4.1 postings format

Han Jiang
Hi Aleksandra,

The PostingsReader uses a skip list to determine the start file
pointer of each block (both FOR packed and vInt encoded). The
information
is currently maintained by Lucene41SkipReader.

The tricky part is, for each term, the skip data is exactly at the end
of TermFreqs blocks, so, if you fetch the startFP for vInt block, and
knows the docTermStartOffset & skipOffset for current term, you can
calculate out what you need.

http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Frequencies

On Thu, Aug 1, 2013 at 4:20 PM, Aleksandra Woźniak
<[hidden email]> wrote:

> Hi all,
>
> recently I wanted to try out some modifications of Lucene's postings format
> (namely, copying blocks that have no deletions without int-decoding/encoding
> -- this is similar to what was described here:
> https://issues.apache.org/jira/browse/LUCENE-2082). I started with changing
> Lucene 4.1 postings format to check what can be done there.
>
> I came across the following problem: in Lucene41PostingsReader the length
> (number of bytes) of the last, vInt-encoded, block of posting in not known
> before all individual postings are read and decoded. When reading this block
> we only know the number of postings that should be read and decoded -- since
> vInts have different sizes by definition.
>
> If I wanted to copy the whole block without vInt decoding/encoding, I need
> to know how many bytes I have to read from postings index input. So, my
> question is: is there a clean way to determine the length of this block (ie.
> the number of bytes that this block has)? Is the number of bytes in a
> posting list tracked somewhere in Lucene 4.1 postings format?
>
> Thanks,
> Aleksandra



--
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering and Computer Science,
Peking University, China

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]