[jira] Commented: (LUCENE-140) docs out of order

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-140) docs out of order

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463176 ]

Doron Cohen commented on LUCENE-140:
------------------------------------

Amazed by this long lasting bug report I was going similar routes to Mike, and I noticed 3 things -

(1) the sequence of ops brought by Jason is wrong:
 -a- Open an IndexReader (#1) over an existing index (this reader is used for searching while updating the index)
 -b- Using this reader (#1) do a search for the document(s) that you would like to update; obtain their document ID numbers
 -c- Create an IndexWriter and add several new documents to the index (for me, this writing is done in other threads) (*)
 -d- Close the IndexWriter (*)
 -e- Open another IndexReader (#2) over the index
 -f- Delete the previously found documents by their document ID numbers using reader #2
 -g- Close the #2 reader
 -h- Create another IndexWriter (#2) and re-add the updated documents
 -i- Close the IndexWriter #2
 -j- Close the original IndexReader (#1) and open a new reader for general searching

Problem here is that the docIDs found in (b) may be altered in step (d) and so step (f) would delete the wrong docs. In particular, it might attempt to delete ids that are out of the range. This might expose exactly the BitVector problem, and would explain the whole thing, but I too cannot see how it explains the delete-by-term case.

(2) BitVectort silent ignoring of attempts to delete slightly-out-of-bound docs that fall in the higher byte - this the problem that Mike fixed. I think the fix is okay - though some applications might now get exceptions they did not get in the past - but I believe this is for their own good.
However when I first ran into this I didn't notice that BitVector.size() would become wrong as result of this - nice catch Mike!

I think however that the test Mike added does not expose the docs out of order bug - I tried this test without the fix and it only fail on the "gotException assert" - if you comment this assert the test pass.

The following test would expose the out-of-order bug - it would fail with out-of-order before the fix, and would succeed without it.

  public void testOutOfOrder () throws IOException {
    String tempDir = System.getProperty("java.io.tmpdir");
    if (tempDir == null) {
      throw new IOException("java.io.tmpdir undefined, cannot run test: "+getName());
    }
   
    File indexDir = new File(tempDir, "lucenetestindexTemp");
    Directory dir = FSDirectory.getDirectory(indexDir, true);

    boolean create = true;
    int numDocs = 0;
    int maxDoc = 0;
    while (numDocs < 100) {
      IndexWriter iw = new IndexWriter(dir,anlzr,create);
      create = false;
      iw.setUseCompoundFile(false);
      for (int i=0; i<2; i++) {
        Document d = new Document();
        d.add(new Field("body","body"+i,Store.NO,Index.UN_TOKENIZED));
        iw.addDocument(d);
      }
      iw.optimize();
      iw.close();
      IndexReader ir = IndexReader.open(dir);
      numDocs = ir.numDocs();
      maxDoc = ir.maxDoc();
      assertEquals(numDocs,maxDoc);
      for (int i=7; i >=-1; i--) {
        try {
          ir.deleteDocument(maxDoc+i);
        } catch (ArrayIndexOutOfBoundsException e) {  
        }
      }
      ir.close();
    }
  }

Mike, do you agree?

(3) maxDoc() computation in SegmentReader is based (on some paths) in RandomAccessFile.length(). IIRC I saw cases (in previous project) where File.length() or RAF.length() (not sure which of the two) did not always reflect real length, if the system was very busy IO wise, unless FD.sync() was called (with performance hit).

This post seems relevant - RAF.length over 2GB in NFS - http://forum.java.sun.com/thread.jspa?threadID=708670&messageID=4103657 

Not sure if this can be the case here but at least we can discuss whether it is better to always store the length.




> docs out of order
> -----------------
>
>                 Key: LUCENE-140
>                 URL: https://issues.apache.org/jira/browse/LUCENE-140
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: unspecified
>         Environment: Operating System: Linux
> Platform: PC
>            Reporter: legez
>         Assigned To: Michael McCandless
>         Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar
>
>
> Hello,
>   I can not find out, why (and what) it is happening all the time. I got an
> exception:
> java.lang.IllegalStateException: docs out of order
>         at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
>         at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
>         at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
>         at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
>         at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
>         at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
>         at Optimize.main(Optimize.java:29)
> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not find
> it neither in download nor in version list in this form). Everything seems OK. I
> can search through index, but I can not optimize it. Even worse after this
> exception every time I add new documents and close IndexWriter new segments is
> created! I think it has all documents added before, because of its size.
> My index is quite big: 500.000 docs, about 5gb of index directory.
> It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
> docs, try to optimize and receive above exception.
> My documents' structure is:
>   static Document indexIt(String id_strony, Reader reader, String data_wydania,
> String id_wydania, String id_gazety, String data_wstawienia)
> {
>     Document doc = new Document();
>     doc.add(Field.Keyword("id", id_strony ));
>     doc.add(Field.Keyword("data_wydania", data_wydania));
>     doc.add(Field.Keyword("id_wydania", id_wydania));
>     doc.add(Field.Text("id_gazety", id_gazety));
>     doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
>     doc.add(Field.Text("tresc", reader));
>     return doc;
> }
> Sincerely,
> legez

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-140) docs out of order

Robert Engels
The Java discussion that is cited is not valid (at least in terms of  
the test case provided).

The javadoc for RandomAccessFile states:

/**
      * Sets the file-pointer offset, measured from the beginning of  
this
      * file, at which the next read or write occurs.  The offset may be
      * set beyond the end of the file. Setting the offset beyond the  
end
      * of the file does not change the file length.  The file length  
will
      * change only by writing after the offset has been set beyond  
the end
      * of the file.

so the seeking does not affect the file length, meaning that all of  
the lengths should be 0.

But since both of these methods are native, there is the real  
possibility that some JVM or OS combination is not adhering to the  
specification.


On Jan 8, 2007, at 7:27 PM, Doron Cohen (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-140?
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel#action_12463176 ]
>
> Doron Cohen commented on LUCENE-140:
> ------------------------------------
>
> Amazed by this long lasting bug report I was going similar routes  
> to Mike, and I noticed 3 things -
>
> (1) the sequence of ops brought by Jason is wrong:
>  -a- Open an IndexReader (#1) over an existing index (this reader  
> is used for searching while updating the index)
>  -b- Using this reader (#1) do a search for the document(s) that  
> you would like to update; obtain their document ID numbers
>  -c- Create an IndexWriter and add several new documents to the  
> index (for me, this writing is done in other threads) (*)
>  -d- Close the IndexWriter (*)
>  -e- Open another IndexReader (#2) over the index
>  -f- Delete the previously found documents by their document ID  
> numbers using reader #2
>  -g- Close the #2 reader
>  -h- Create another IndexWriter (#2) and re-add the updated documents
>  -i- Close the IndexWriter #2
>  -j- Close the original IndexReader (#1) and open a new reader for  
> general searching
>
> Problem here is that the docIDs found in (b) may be altered in step  
> (d) and so step (f) would delete the wrong docs. In particular, it  
> might attempt to delete ids that are out of the range. This might  
> expose exactly the BitVector problem, and would explain the whole  
> thing, but I too cannot see how it explains the delete-by-term case.
>
> (2) BitVectort silent ignoring of attempts to delete slightly-out-
> of-bound docs that fall in the higher byte - this the problem that  
> Mike fixed. I think the fix is okay - though some applications  
> might now get exceptions they did not get in the past - but I  
> believe this is for their own good.
> However when I first ran into this I didn't notice that  
> BitVector.size() would become wrong as result of this - nice catch  
> Mike!
>
> I think however that the test Mike added does not expose the docs  
> out of order bug - I tried this test without the fix and it only  
> fail on the "gotException assert" - if you comment this assert the  
> test pass.
>
> The following test would expose the out-of-order bug - it would  
> fail with out-of-order before the fix, and would succeed without it.
>
>   public void testOutOfOrder () throws IOException {
>     String tempDir = System.getProperty("java.io.tmpdir");
>     if (tempDir == null) {
>       throw new IOException("java.io.tmpdir undefined, cannot run  
> test: "+getName());
>     }
>
>     File indexDir = new File(tempDir, "lucenetestindexTemp");
>     Directory dir = FSDirectory.getDirectory(indexDir, true);
>
>     boolean create = true;
>     int numDocs = 0;
>     int maxDoc = 0;
>     while (numDocs < 100) {
>       IndexWriter iw = new IndexWriter(dir,anlzr,create);
>       create = false;
>       iw.setUseCompoundFile(false);
>       for (int i=0; i<2; i++) {
>         Document d = new Document();
>         d.add(new Field("body","body"+i,Store.NO,Index.UN_TOKENIZED));
>         iw.addDocument(d);
>       }
>       iw.optimize();
>       iw.close();
>       IndexReader ir = IndexReader.open(dir);
>       numDocs = ir.numDocs();
>       maxDoc = ir.maxDoc();
>       assertEquals(numDocs,maxDoc);
>       for (int i=7; i >=-1; i--) {
>         try {
>           ir.deleteDocument(maxDoc+i);
>         } catch (ArrayIndexOutOfBoundsException e) {
>         }
>       }
>       ir.close();
>     }
>   }
>
> Mike, do you agree?
>
> (3) maxDoc() computation in SegmentReader is based (on some paths)  
> in RandomAccessFile.length(). IIRC I saw cases (in previous  
> project) where File.length() or RAF.length() (not sure which of the  
> two) did not always reflect real length, if the system was very  
> busy IO wise, unless FD.sync() was called (with performance hit).
>
> This post seems relevant - RAF.length over 2GB in NFS - http://
> forum.java.sun.com/thread.jspa?threadID=708670&messageID=4103657
>
> Not sure if this can be the case here but at least we can discuss  
> whether it is better to always store the length.
>
>
>
>
>> docs out of order
>> -----------------
>>
>>                 Key: LUCENE-140
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-140
>>             Project: Lucene - Java
>>          Issue Type: Bug
>>          Components: Index
>>    Affects Versions: unspecified
>>         Environment: Operating System: Linux
>> Platform: PC
>>            Reporter: legez
>>         Assigned To: Michael McCandless
>>         Attachments: bug23650.txt, corrupted.part1.rar,  
>> corrupted.part2.rar
>>
>>
>> Hello,
>>   I can not find out, why (and what) it is happening all the time.  
>> I got an
>> exception:
>> java.lang.IllegalStateException: docs out of order
>>         at
>> org.apache.lucene.index.SegmentMerger.appendPostings
>> (SegmentMerger.java:219)
>>         at
>> org.apache.lucene.index.SegmentMerger.mergeTermInfo
>> (SegmentMerger.java:191)
>>         at
>> org.apache.lucene.index.SegmentMerger.mergeTermInfos
>> (SegmentMerger.java:172)
>>         at org.apache.lucene.index.SegmentMerger.mergeTerms
>> (SegmentMerger.java:135)
>>         at org.apache.lucene.index.SegmentMerger.merge
>> (SegmentMerger.java:88)
>>         at org.apache.lucene.index.IndexWriter.mergeSegments
>> (IndexWriter.java:341)
>>         at org.apache.lucene.index.IndexWriter.optimize
>> (IndexWriter.java:250)
>>         at Optimize.main(Optimize.java:29)
>> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I  
>> can not find
>> it neither in download nor in version list in this form).  
>> Everything seems OK. I
>> can search through index, but I can not optimize it. Even worse  
>> after this
>> exception every time I add new documents and close IndexWriter new  
>> segments is
>> created! I think it has all documents added before, because of its  
>> size.
>> My index is quite big: 500.000 docs, about 5gb of index directory.
>> It is _repeatable_. I drop index, reindex everything. Afterwards I  
>> add a few
>> docs, try to optimize and receive above exception.
>> My documents' structure is:
>>   static Document indexIt(String id_strony, Reader reader, String  
>> data_wydania,
>> String id_wydania, String id_gazety, String data_wstawienia)
>> {
>>     Document doc = new Document();
>>     doc.add(Field.Keyword("id", id_strony ));
>>     doc.add(Field.Keyword("data_wydania", data_wydania));
>>     doc.add(Field.Keyword("id_wydania", id_wydania));
>>     doc.add(Field.Text("id_gazety", id_gazety));
>>     doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
>>     doc.add(Field.Text("tresc", reader));
>>     return doc;
>> }
>> Sincerely,
>> legez
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators: https://issues.apache.org/jira/secure/ 
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/ 
> software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-140) docs out of order

Doron Cohen
robert engels <[hidden email]> wrote on 08/01/2007 17:39:45:

> The Java discussion that is cited is not valid (at least in terms of
> the test case provided).

... right ... that discussion really is irrelevant - I read that too quick,
sorry for that.

>
> The javadoc for RandomAccessFile states:
>
> /**
>       * Sets the file-pointer offset, measured from the beginning of
> this
>       * file, at which the next read or write occurs.  The offset may be
>       * set beyond the end of the file. Setting the offset beyond the
> end
>       * of the file does not change the file length.  The file length
> will
>       * change only by writing after the offset has been set beyond
> the end
>       * of the file.
>
> so the seeking does not affect the file length, meaning that all of
> the lengths should be 0.
>
> But since both of these methods are native, there is the real
> possibility that some JVM or OS combination is not adhering to the
> specification.

Actually I don't see any use of RandomAccessFile.setLength() in Lucene so
this is not an issue.

...I think I now remember what it was - File.length() -
http://forum.java.sun.com/thread.jspa?forumID=31&threadID=262446 That was
while ago - jre 1.3 - I saw this also with jre 1.4, but not yet with 1.5.
But - again - it is randomAccessFile.length() used in Lucene, not
File.length()... So I am not sure what the conclusion should be.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-140) docs out of order

Robert Engels
I just meant that the discussion (on the Java board) included an  
incorrect testcase, since the code should not have worked according  
to the specification.

It may still be relevant, I was just pointing out that it is suspect.

Unless the bug has been posted & verified in the Java Bug Database, I  
find most "user reports" to not be valid, and usually their "bug" is  
caused by incorrect code.


On Jan 8, 2007, at 11:10 PM, Doron Cohen wrote:

> robert engels <[hidden email]> wrote on 08/01/2007 17:39:45:
>
>> The Java discussion that is cited is not valid (at least in terms of
>> the test case provided).
>
> ... right ... that discussion really is irrelevant - I read that  
> too quick,
> sorry for that.
>
>>
>> The javadoc for RandomAccessFile states:
>>
>> /**
>>       * Sets the file-pointer offset, measured from the beginning of
>> this
>>       * file, at which the next read or write occurs.  The offset  
>> may be
>>       * set beyond the end of the file. Setting the offset beyond the
>> end
>>       * of the file does not change the file length.  The file length
>> will
>>       * change only by writing after the offset has been set beyond
>> the end
>>       * of the file.
>>
>> so the seeking does not affect the file length, meaning that all of
>> the lengths should be 0.
>>
>> But since both of these methods are native, there is the real
>> possibility that some JVM or OS combination is not adhering to the
>> specification.
>
> Actually I don't see any use of RandomAccessFile.setLength() in  
> Lucene so
> this is not an issue.
>
> ...I think I now remember what it was - File.length() -
> http://forum.java.sun.com/thread.jspa?forumID=31&threadID=262446 
> That was
> while ago - jre 1.3 - I saw this also with jre 1.4, but not yet  
> with 1.5.
> But - again - it is randomAccessFile.length() used in Lucene, not
> File.length()... So I am not sure what the conclusion should be.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]