[jira] Commented: (LUCENE-140) docs out of order

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-140) docs out of order

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463524 ]

Michael McCandless commented on LUCENE-140:
-------------------------------------------

OK from that indexing-failure.log (thanks Jed!) I can see that indeed
there are segments whose maxDoc() is much smaller than
deleteDocs.count().  This then leads to negative doc numbers on
merging these segments.

Jed when you say "there are old files (_*.cfs & _*.del) in this
directory with updated timestamps that are months old" what do you
mean by "with updated timestamps"?  Which timestamp is months old and
which one is updated?

OK, assuming Jed you are indeed sending "create=false" when creating
the Directory and then passing that directory to IndexWriter with
create=true, I think we now have this case fully explained (thanks
Doron): your old _*.del files are being incorrectly opened & re-used
by Lucene, when they should not be.

Lucene (all released versions but not the trunk version, see below)
does a simple fileExists("_XXX.del") call to determine if a segment
XXX has deletes.

But when that _XXX.del is a leftover from a previous index, it very
likely doesn't "match" the newly created _XXX segment.  (Especially if
merge factor has changed but also if order of operations has changed,
which I would expect in this use case).

If that file exists, Lucene assumes it's for this segment and so opens
it and uses it.  If it happens that this _XXX.del file has more
documents in it than the newly created _XXX.cfs segment, then negative
doc numbers will result (and then later cause the "docs out of order"
exception).  If it happens that the _XXX.del file has fewer documents
than the newly created _XXX.cfs segment then you'll hit
ArrayIndexOutOfBounds exceptions in calls to isDeleted(...).  If they
are exactly equal then you'd randomly see some of your docs got
deleted.

Note that the trunk version of Lucene has already fixed this bug (as
part of lockless commits):

  * Whether a segment has deletions or not is now explictly stored in
    the segments file rather than relying on a "fileExists(...)" call.
    So, if an old _XXX.del existed in the filesystem, the newly
    created _XXX segment would not open it.

  * Furthermore, the trunk version of Lucene uses a new
    IndexFileDelter class to remove any unreferenced index files.
    This means it would have removed these old _*.cfs and _*.del files
    even in the case where a directory was created with "create=false"
    and the IndexWriter was created with "create=true".

To summarize:

  * There was one case where if you gave slightly illegal doc numbers
    (within 7 of the actual maxDoc) Lucene may silently accept the
    call but would corrupt your index only to be seen later as an
    "docs out of order" IllegalStateException when the segment is
    merged.  This was just a missing boundary case check.  This case
    is now fixed in the trunk (you get an
    ArrayIndexOutOfBoundsException if doc number is too large).

  * There is also another case, that only happens if you have old
    _*.del files leftover from a previous index while re-creating a
    new index.

    The workaround is simple here: always open the Directory with
    create=true (or, remove the directory contents yourself before
    hand).  (IndexWriter does this if you give it a String or File
    with create=true).

    This is really a bug in Lucene, but given that it's already fixed
    in the trunk, and the workaround is simple, I'm inclined to not
    fix it in prior releases and instead publicize the issue (I will
    do so on java-user).

    But, I will commit two additional IllegalStateException checks to
    the trunk when a segment is first initialized: 1) check that the
    two different sources of "maxDoc" (fieldsReader.size() and
    si.docCount) are the same, and 2) check that the number of pending
    deletions does not exceed maxDoc().  When an index has
    inconsistency I think the earlier it's detected the better.


> docs out of order
> -----------------
>
>                 Key: LUCENE-140
>                 URL: https://issues.apache.org/jira/browse/LUCENE-140
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: unspecified
>         Environment: Operating System: Linux
> Platform: PC
>            Reporter: legez
>         Assigned To: Michael McCandless
>         Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar, indexing-failure.log, LUCENE-140-2007-01-09-instrumentation.patch
>
>
> Hello,
>   I can not find out, why (and what) it is happening all the time. I got an
> exception:
> java.lang.IllegalStateException: docs out of order
>         at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
>         at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
>         at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
>         at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
>         at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
>         at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
>         at Optimize.main(Optimize.java:29)
> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not find
> it neither in download nor in version list in this form). Everything seems OK. I
> can search through index, but I can not optimize it. Even worse after this
> exception every time I add new documents and close IndexWriter new segments is
> created! I think it has all documents added before, because of its size.
> My index is quite big: 500.000 docs, about 5gb of index directory.
> It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
> docs, try to optimize and receive above exception.
> My documents' structure is:
>   static Document indexIt(String id_strony, Reader reader, String data_wydania,
> String id_wydania, String id_gazety, String data_wstawienia)
> {
>     Document doc = new Document();
>     doc.add(Field.Keyword("id", id_strony ));
>     doc.add(Field.Keyword("data_wydania", data_wydania));
>     doc.add(Field.Keyword("id_wydania", id_wydania));
>     doc.add(Field.Text("id_gazety", id_gazety));
>     doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
>     doc.add(Field.Text("tresc", reader));
>     return doc;
> }
> Sincerely,
> legez

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]