Renaud Delbru

I am currently benchmarking various compression algorithms using the Sep
Codec, but I got index corruption exception during the merge process,
and I would need your help to debug it.

I have reimplemented various algorithms like FOR, Simple9, VInt, PFor
for the Sep IntBlock Codec. I am benchmarking them now on the wikipedia
dataset. For some algorithms, FOR, Simple9, etc., I don't encounter
problems. But using the PFor algorithms, I get a CorruptedIndex
exception during the merge process (in SepPostingsWriterImpl#startDoc),
because document are out of order:

Exception in thread "Lucene Merge Thread #0"
org.apache.lucene.index.CorruptIndexException: docs out of order (153 <=
153 )
Caused by: org.apache.lucene.index.CorruptIndexException: docs out of
order (153 <= 153 )

However, this is happening only when I tried to index the wikipedia
dataset using the PFor algorithm. I have tried to recreate the error
using a unit test, creating random document, and performing a merge, but
in this case the error does not appear.

After some debug, I have noticed that the document id at the end of a
segment is the same than (or inferior to) the document id of the next
segment to be merged. However, even by activating Codec.DEBUG=true, I am
unable to know which are the faulty segments, and the faulty terms
inside these segments. Could you indicate me a easy way to get this
information, so I will be able to check these segments and their encoded
blocks in order to find and understand the problem ?

Thanks in advance,
Renaud Delbru

