Flex API - Debugging Segment Merge

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Flex API - Debugging Segment Merge

Renaud Delbru
Hi,

I am currently benchmarking various compression algorithms using the Sep
Codec, but I got index corruption exception during the merge process,
and I would need your help to debug it.

I have reimplemented various algorithms like FOR, Simple9, VInt, PFor
for the Sep IntBlock Codec. I am benchmarking them now on the wikipedia
dataset. For some algorithms, FOR, Simple9, etc., I don't encounter
problems. But using the PFor algorithms, I get a CorruptedIndex
exception during the merge process (in SepPostingsWriterImpl#startDoc),
because document are out of order:

Exception in thread "Lucene Merge Thread #0"
org.apache.lucene.index.MergePolicy$MergeException:
org.apache.lucene.index.CorruptIndexException: docs out of order (153 <=
153 )
         at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:471)
         at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:435)
Caused by: org.apache.lucene.index.CorruptIndexException: docs out of
order (153 <= 153 )
         at
org.apache.lucene.index.codecs.sep.SepPostingsWriterImpl.startDoc(SepPostingsWriterImpl.java:177)

However, this is happening only when I tried to index the wikipedia
dataset using the PFor algorithm. I have tried to recreate the error
using a unit test, creating random document, and performing a merge, but
in this case the error does not appear.

After some debug, I have noticed that the document id at the end of a
segment is the same than (or inferior to) the document id of the next
segment to be merged. However, even by activating Codec.DEBUG=true, I am
unable to know which are the faulty segments, and the faulty terms
inside these segments. Could you indicate me a easy way to get this
information, so I will be able to check these segments and their encoded
blocks in order to find and understand the problem ?

Thanks in advance,
--
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]