Data Integrity Rules

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Data Integrity Rules

In an earlier article, Doug Cutting described a method
to verify a segments integrity by simply merging the segment
to a NullDirectory.

We have found several instances where the segement Corrupts
even if it passes the NullDirectory TEST. Merging with NULL
Directory only protects us against disk errors.  There are
structural errors that makes the segments corrupt after a
few iterations of merges.

I would like to define a simple rule:

"A segment has data integrity if and only if
the segment is readable and successively mergeable
 without any errors."

For example, in the current version, you can add an empty string
into the DocumentWriter.  This is not a problem so
long as it is readable and successively mergeable. But, after a
few merge iterations, the segment merge errors with a
"term out of order" exception in TermInfosWriter.  Now you have an
inoperable Search Engine. GRANTED, the tokenizer is at fault, but a
simple issue like that should not bring the search engine down.

Similarly, we have found instances where term postings having Zero
frequency (NOT sure how it got in that state) and having document ids
greater than the max doc of the segement. See earlier posting or
Bug (a).  

Therefore I suggest a few more checks into DocumentWriter right after
line "283" in

       if (posting.term.text.length()==0) {

        // add an entry to the freq file
        int postingFreq = posting.freq;
        if (postingFreq <= 0) {
  Also, please apply the changes to SegmentMerger as suggested in bug 23650.

I also think, we should create test cases that keep the segments robust and not
derailed by edge cases.



To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]