flushRamSegments() is "over merging"?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

flushRamSegments() is "over merging"?

Doron Cohen

Hi, I ran into this while reviewing the patch for 565.

It appears that closing an index writer with non empty ram segments (at
least 1 doc was added) is causing a merge with the last (most recent) on
disk segment.

This seems to me problematic in the case that an application has a lot of
interleaving - adding / removing documents, or even switching indexes,
therefore the indexWriter would be closed often.

The test case below demonstrates this behavior - all maxBufferedDocs,
maxMergeDocs, mergeFactor are assigned very large values, and in a loop a
few documents are added and the indexWriter is closed and re-opened.

Surprisingly (at least for me) the number of segments on disk remains 1.
In other words, each time the IndexWriter is closed, the single disk
segment is merged with the current ram segments and re-written to a new
disk segments.

The "blame" is in the second line here:
    if (minSegment < 0 ||                   // add one FS segment?
        (docCount + segmentInfos.info(minSegment).docCount) > mergeFactor
||
        !(segmentInfos.info(segmentInfos.size()-1).dir == ramDirectory))

This code in flushRamSegments() merges the (temporary) ram segments with
the most recent non-temporary segment.

I can see how this can make sense in some cases. Perhaps an additional
constraint should be added on the ratio of the size of this non-temp
segment to that of all temporary segments, or the difference, or both.

Here is the test case,
Thanks,
Doron
------------------------------------
package org.apache.lucene.index;

import java.io.IOException;

import junit.framework.TestCase;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;

/**
 * Test that the number of segments is as expected.
 * I.e. that there was not too many / too few merges.
 *
 * @author Doron Cohen
 */
public class TestNumSegments extends TestCase {

      protected int nextDocNum = 0;
      protected Directory dir = null;
      protected IndexWriter iw = null;
      protected IndexReader ir = null;

      /* (non-Javadoc)
       * @see junit.framework.TestCase#setUp()
       */
      protected void setUp() throws Exception {
            super.setUp();
            //dir = new RAMDirectory();
            dir = FSDirectory.getDirectory("test.num.segments",true);
            iw = new IndexWriter(dir, new StandardAnalyzer(), true);
            setLimits(iw);
            addSomeDocs(); // some docs in index
      }

      // for now, take these limits out of the "game"
      protected void setLimits(IndexWriter iw) {
            iw.setMaxBufferedDocs(Integer.MAX_VALUE-1);
            iw.setMaxMergeDocs(Integer.MAX_VALUE-1);
            iw.setMergeFactor(Integer.MAX_VALUE-1);
      }

      /* (non-Javadoc)
       * @see junit.framework.TestCase#tearDown()
       */
      protected void tearDown() throws Exception {
            closeW();
            if (dir!=null) {
                  dir.close();
            }
            super.tearDown();
      }

      // count how many segments are on a directory - index writer must be
closed
      protected int countDirSegments() throws IOException {
            assertNull(iw);
            SegmentInfos segmentInfos = new SegmentInfos();
            segmentInfos.read(dir);
            int nSegs = segmentInfos.size();
            segmentInfos.clear();
            return nSegs;
      }

      // open writer
      private void openW() throws IOException {
            iw = new IndexWriter(dir, new StandardAnalyzer(), false);
            setLimits(iw);
      }

      private void closeW() throws IOException {
            if (iw!=null) {
                  iw.close();
                  iw=null;
            }
      }

      public void testNumSegments() throws IOException {
            int numExceptions = 0;
            for (int i=1; i<30; i++) {
                  closeW();
                  try {
                        assertEquals("Oops - wrong number of segments!", i,
countDirSegments());
                  } catch (Throwable t) {
                        numExceptions++;
                        System.err.println(i+":  "+t.getMessage());
                  }
                  openW();
                  addSomeDocs();
            }
            assertEquals("Oops!, so many times numbr of egments was
\"wrong\"",0,numExceptions);
      }

      private void addSomeDocs() throws IOException {
            for (int i=0; i<2; i++) {
                  iw.addDocument(getDoc());
            }
      }

      protected Document getDoc() {
            Document doc = new Document();
            doc.add(new Field("body", new Integer(nextDocNum).toString(),
Field.Store.YES, Field.Index.UN_TOKENIZED));
            doc.add(new Field("all", "x", Field.Store.YES,
Field.Index.UN_TOKENIZED));
            nextDocNum ++;
            return doc;
      }

}


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: flushRamSegments() is "over merging"?

Yonik Seeley-2
Yes, that's counter-intuitive.... a high merge factor is more likely
to cause a merge with the last disk-based segment.

On the other hand... if you have a high maxBufferedDocs and a normal
mergeFactor (much more likely), you could end up with way too many
segments if you didn't merge.

Hmmm, I'm thinking of another case where you could end up with far too
many segments... if you have a low merge factor and high
maxBufferedDocs (a common scenario), then if you add enough docs it
will keep creating a separate segment.

Consider the following settings:
mergeFactor=10
maxBufferedDocs=10000

Now add 11 docs at a time to an existing index, closing inbetween.
segment sizes: 100000, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, ...

It seems like the merge logic somewhere should also take into account
the number of segments at a certain level, not just the number of
documents in those segments.

-Yonik

On 8/15/06, Doron Cohen <[hidden email]> wrote:

>
> Hi, I ran into this while reviewing the patch for 565.
>
> It appears that closing an index writer with non empty ram segments (at
> least 1 doc was added) is causing a merge with the last (most recent) on
> disk segment.
>
> This seems to me problematic in the case that an application has a lot of
> interleaving - adding / removing documents, or even switching indexes,
> therefore the indexWriter would be closed often.
>
> The test case below demonstrates this behavior - all maxBufferedDocs,
> maxMergeDocs, mergeFactor are assigned very large values, and in a loop a
> few documents are added and the indexWriter is closed and re-opened.
>
> Surprisingly (at least for me) the number of segments on disk remains 1.
> In other words, each time the IndexWriter is closed, the single disk
> segment is merged with the current ram segments and re-written to a new
> disk segments.
>
> The "blame" is in the second line here:
>     if (minSegment < 0 ||                   // add one FS segment?
>         (docCount + segmentInfos.info(minSegment).docCount) > mergeFactor
> ||
>         !(segmentInfos.info(segmentInfos.size()-1).dir == ramDirectory))
>
> This code in flushRamSegments() merges the (temporary) ram segments with
> the most recent non-temporary segment.
>
> I can see how this can make sense in some cases. Perhaps an additional
> constraint should be added on the ratio of the size of this non-temp
> segment to that of all temporary segments, or the difference, or both.
>
> Here is the test case,
> Thanks,
> Doron
> ------------------------------------
> package org.apache.lucene.index;
>
> import java.io.IOException;
>
> import junit.framework.TestCase;
>
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.store.RAMDirectory;
>
> /**
>  * Test that the number of segments is as expected.
>  * I.e. that there was not too many / too few merges.
>  *
>  * @author Doron Cohen
>  */
> public class TestNumSegments extends TestCase {
>
>       protected int nextDocNum = 0;
>       protected Directory dir = null;
>       protected IndexWriter iw = null;
>       protected IndexReader ir = null;
>
>       /* (non-Javadoc)
>        * @see junit.framework.TestCase#setUp()
>        */
>       protected void setUp() throws Exception {
>             super.setUp();
>             //dir = new RAMDirectory();
>             dir = FSDirectory.getDirectory("test.num.segments",true);
>             iw = new IndexWriter(dir, new StandardAnalyzer(), true);
>             setLimits(iw);
>             addSomeDocs(); // some docs in index
>       }
>
>       // for now, take these limits out of the "game"
>       protected void setLimits(IndexWriter iw) {
>             iw.setMaxBufferedDocs(Integer.MAX_VALUE-1);
>             iw.setMaxMergeDocs(Integer.MAX_VALUE-1);
>             iw.setMergeFactor(Integer.MAX_VALUE-1);
>       }
>
>       /* (non-Javadoc)
>        * @see junit.framework.TestCase#tearDown()
>        */
>       protected void tearDown() throws Exception {
>             closeW();
>             if (dir!=null) {
>                   dir.close();
>             }
>             super.tearDown();
>       }
>
>       // count how many segments are on a directory - index writer must be
> closed
>       protected int countDirSegments() throws IOException {
>             assertNull(iw);
>             SegmentInfos segmentInfos = new SegmentInfos();
>             segmentInfos.read(dir);
>             int nSegs = segmentInfos.size();
>             segmentInfos.clear();
>             return nSegs;
>       }
>
>       // open writer
>       private void openW() throws IOException {
>             iw = new IndexWriter(dir, new StandardAnalyzer(), false);
>             setLimits(iw);
>       }
>
>       private void closeW() throws IOException {
>             if (iw!=null) {
>                   iw.close();
>                   iw=null;
>             }
>       }
>
>       public void testNumSegments() throws IOException {
>             int numExceptions = 0;
>             for (int i=1; i<30; i++) {
>                   closeW();
>                   try {
>                         assertEquals("Oops - wrong number of segments!", i,
> countDirSegments());
>                   } catch (Throwable t) {
>                         numExceptions++;
>                         System.err.println(i+":  "+t.getMessage());
>                   }
>                   openW();
>                   addSomeDocs();
>             }
>             assertEquals("Oops!, so many times numbr of egments was
> \"wrong\"",0,numExceptions);
>       }
>
>       private void addSomeDocs() throws IOException {
>             for (int i=0; i<2; i++) {
>                   iw.addDocument(getDoc());
>             }
>       }
>
>       protected Document getDoc() {
>             Document doc = new Document();
>             doc.add(new Field("body", new Integer(nextDocNum).toString(),
> Field.Store.YES, Field.Index.UN_TOKENIZED));
>             doc.add(new Field("all", "x", Field.Store.YES,
> Field.Index.UN_TOKENIZED));
>             nextDocNum ++;
>             return doc;
>       }
>
> }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: flushRamSegments() is "over merging"?

Yonik Seeley-2
Related to merging more often than one would expect, check out my last
comment in this bug:
http://issues.apache.org/jira/browse/LUCENE-388

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

On 8/15/06, Yonik Seeley <[hidden email]> wrote:

> Yes, that's counter-intuitive.... a high merge factor is more likely
> to cause a merge with the last disk-based segment.
>
> On the other hand... if you have a high maxBufferedDocs and a normal
> mergeFactor (much more likely), you could end up with way too many
> segments if you didn't merge.
>
> Hmmm, I'm thinking of another case where you could end up with far too
> many segments... if you have a low merge factor and high
> maxBufferedDocs (a common scenario), then if you add enough docs it
> will keep creating a separate segment.
>
> Consider the following settings:
> mergeFactor=10
> maxBufferedDocs=10000
>
> Now add 11 docs at a time to an existing index, closing inbetween.
> segment sizes: 100000, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
> 11, 11, ...
>
> It seems like the merge logic somewhere should also take into account
> the number of segments at a certain level, not just the number of
> documents in those segments.
>
> -Yonik
>
> On 8/15/06, Doron Cohen <[hidden email]> wrote:
> >
> > Hi, I ran into this while reviewing the patch for 565.
> >
> > It appears that closing an index writer with non empty ram segments (at
> > least 1 doc was added) is causing a merge with the last (most recent) on
> > disk segment.
> >
> > This seems to me problematic in the case that an application has a lot of
> > interleaving - adding / removing documents, or even switching indexes,
> > therefore the indexWriter would be closed often.
> >
> > The test case below demonstrates this behavior - all maxBufferedDocs,
> > maxMergeDocs, mergeFactor are assigned very large values, and in a loop a
> > few documents are added and the indexWriter is closed and re-opened.
> >
> > Surprisingly (at least for me) the number of segments on disk remains 1.
> > In other words, each time the IndexWriter is closed, the single disk
> > segment is merged with the current ram segments and re-written to a new
> > disk segments.
> >
> > The "blame" is in the second line here:
> >     if (minSegment < 0 ||                   // add one FS segment?
> >         (docCount + segmentInfos.info(minSegment).docCount) > mergeFactor
> > ||
> >         !(segmentInfos.info(segmentInfos.size()-1).dir == ramDirectory))
> >
> > This code in flushRamSegments() merges the (temporary) ram segments with
> > the most recent non-temporary segment.
> >
> > I can see how this can make sense in some cases. Perhaps an additional
> > constraint should be added on the ratio of the size of this non-temp
> > segment to that of all temporary segments, or the difference, or both.
> >
> > Here is the test case,
> > Thanks,
> > Doron
> > ------------------------------------
> > package org.apache.lucene.index;
> >
> > import java.io.IOException;
> >
> > import junit.framework.TestCase;
> >
> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > import org.apache.lucene.document.Document;
> > import org.apache.lucene.document.Field;
> > import org.apache.lucene.store.Directory;
> > import org.apache.lucene.store.FSDirectory;
> > import org.apache.lucene.store.RAMDirectory;
> >
> > /**
> >  * Test that the number of segments is as expected.
> >  * I.e. that there was not too many / too few merges.
> >  *
> >  * @author Doron Cohen
> >  */
> > public class TestNumSegments extends TestCase {
> >
> >       protected int nextDocNum = 0;
> >       protected Directory dir = null;
> >       protected IndexWriter iw = null;
> >       protected IndexReader ir = null;
> >
> >       /* (non-Javadoc)
> >        * @see junit.framework.TestCase#setUp()
> >        */
> >       protected void setUp() throws Exception {
> >             super.setUp();
> >             //dir = new RAMDirectory();
> >             dir = FSDirectory.getDirectory("test.num.segments",true);
> >             iw = new IndexWriter(dir, new StandardAnalyzer(), true);
> >             setLimits(iw);
> >             addSomeDocs(); // some docs in index
> >       }
> >
> >       // for now, take these limits out of the "game"
> >       protected void setLimits(IndexWriter iw) {
> >             iw.setMaxBufferedDocs(Integer.MAX_VALUE-1);
> >             iw.setMaxMergeDocs(Integer.MAX_VALUE-1);
> >             iw.setMergeFactor(Integer.MAX_VALUE-1);
> >       }
> >
> >       /* (non-Javadoc)
> >        * @see junit.framework.TestCase#tearDown()
> >        */
> >       protected void tearDown() throws Exception {
> >             closeW();
> >             if (dir!=null) {
> >                   dir.close();
> >             }
> >             super.tearDown();
> >       }
> >
> >       // count how many segments are on a directory - index writer must be
> > closed
> >       protected int countDirSegments() throws IOException {
> >             assertNull(iw);
> >             SegmentInfos segmentInfos = new SegmentInfos();
> >             segmentInfos.read(dir);
> >             int nSegs = segmentInfos.size();
> >             segmentInfos.clear();
> >             return nSegs;
> >       }
> >
> >       // open writer
> >       private void openW() throws IOException {
> >             iw = new IndexWriter(dir, new StandardAnalyzer(), false);
> >             setLimits(iw);
> >       }
> >
> >       private void closeW() throws IOException {
> >             if (iw!=null) {
> >                   iw.close();
> >                   iw=null;
> >             }
> >       }
> >
> >       public void testNumSegments() throws IOException {
> >             int numExceptions = 0;
> >             for (int i=1; i<30; i++) {
> >                   closeW();
> >                   try {
> >                         assertEquals("Oops - wrong number of segments!", i,
> > countDirSegments());
> >                   } catch (Throwable t) {
> >                         numExceptions++;
> >                         System.err.println(i+":  "+t.getMessage());
> >                   }
> >                   openW();
> >                   addSomeDocs();
> >             }
> >             assertEquals("Oops!, so many times numbr of egments was
> > \"wrong\"",0,numExceptions);
> >       }
> >
> >       private void addSomeDocs() throws IOException {
> >             for (int i=0; i<2; i++) {
> >                   iw.addDocument(getDoc());
> >             }
> >       }
> >
> >       protected Document getDoc() {
> >             Document doc = new Document();
> >             doc.add(new Field("body", new Integer(nextDocNum).toString(),
> > Field.Store.YES, Field.Index.UN_TOKENIZED));
> >             doc.add(new Field("all", "x", Field.Store.YES,
> > Field.Index.UN_TOKENIZED));
> >             nextDocNum ++;
> >             return doc;
> >       }
> >
> > }
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: flushRamSegments() is "over merging"?

Doron Cohen
Thanks Yonik, you're right, I got confused with the merge factor.

My (corrected) interpretation of merge-factor is - rank of an imaginary
merge tree - controlling how many segments are merged to create a larger
segment. This way it balances io for merging during indexing, vs. io during
search.

You are saying (in my words:-) that 'over merging' is not an issue, because
if one sets a large value for merge factor, it means that many documents
can be merged at once, and you are more worried by too few merges, as in
the 10 vs. 11 example you provide below for flushRamSegments(), and as in
the pointed discussion 388.

Under-merging would hurt search, unless optimize is called explicitly, but
the index should "behave" without requiring the user to call optimize. 388
deals with this.

Over-merging - in current flushRamSegments() code - would merge at most
merge-factor documents prematurely. Since merge-fatcor is usually not very
large, this might be a minor issue - but still, if an index is growing by
small doses, does it make sense to re-merge with the last disk segment each
time the index is closed? Why not letting it be simply controlled by
maybeMergeSegments?

Thanks,
Doron

[hidden email] wrote on 15/08/2006 08:29:53:

> Related to merging more often than one would expect, check out my last
> comment in this bug:
> http://issues.apache.org/jira/browse/LUCENE-388
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
server

>
> On 8/15/06, Yonik Seeley <[hidden email]> wrote:
> > Yes, that's counter-intuitive.... a high merge factor is more likely
> > to cause a merge with the last disk-based segment.
> >
> > On the other hand... if you have a high maxBufferedDocs and a normal
> > mergeFactor (much more likely), you could end up with way too many
> > segments if you didn't merge.
> >
> > Hmmm, I'm thinking of another case where you could end up with far too
> > many segments... if you have a low merge factor and high
> > maxBufferedDocs (a common scenario), then if you add enough docs it
> > will keep creating a separate segment.
> >
> > Consider the following settings:
> > mergeFactor=10
> > maxBufferedDocs=10000
> >
> > Now add 11 docs at a time to an existing index, closing inbetween.
> > segment sizes: 100000, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
> > 11, 11, ...
> >
> > It seems like the merge logic somewhere should also take into account
> > the number of segments at a certain level, not just the number of
> > documents in those segments.
> >
> > -Yonik
> >
> > On 8/15/06, Doron Cohen <[hidden email]> wrote:
> > >
> > > Hi, I ran into this while reviewing the patch for 565.
> > >
> > > It appears that closing an index writer with non empty ram segments
(at
> > > least 1 doc was added) is causing a merge with the last (most recent)
on
> > > disk segment.
> > >
> > > This seems to me problematic in the case that an application has a
lot of
> > > interleaving - adding / removing documents, or even switching
indexes,
> > > therefore the indexWriter would be closed often.
> > >
> > > The test case below demonstrates this behavior - all maxBufferedDocs,
> > > maxMergeDocs, mergeFactor are assigned very large values, and in a
loop a
> > > few documents are added and the indexWriter is closed and re-opened.
> > >
> > > Surprisingly (at least for me) the number of segments on disk remains
1.
> > > In other words, each time the IndexWriter is closed, the single disk
> > > segment is merged with the current ram segments and re-written to a
new
> > > disk segments.
> > >
> > > The "blame" is in the second line here:
> > >     if (minSegment < 0 ||                   // add one FS segment?
> > >         (docCount + segmentInfos.info(minSegment).docCount) >
mergeFactor
> > > ||
> > >         !(segmentInfos.info(segmentInfos.size()-1).dir ==
ramDirectory))
> > >
> > > This code in flushRamSegments() merges the (temporary) ram segments
with
> > > the most recent non-temporary segment.
> > >
> > > I can see how this can make sense in some cases. Perhaps an
additional
> > > constraint should be added on the ratio of the size of this non-temp
> > > segment to that of all temporary segments, or the difference, or
both.

> > >
> > > Here is the test case,
> > > Thanks,
> > > Doron
> > > ------------------------------------
> > > package org.apache.lucene.index;
> > >
> > > import java.io.IOException;
> > >
> > > import junit.framework.TestCase;
> > >
> > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > import org.apache.lucene.document.Document;
> > > import org.apache.lucene.document.Field;
> > > import org.apache.lucene.store.Directory;
> > > import org.apache.lucene.store.FSDirectory;
> > > import org.apache.lucene.store.RAMDirectory;
> > >
> > > /**
> > >  * Test that the number of segments is as expected.
> > >  * I.e. that there was not too many / too few merges.
> > >  *
> > >  * @author Doron Cohen
> > >  */
> > > public class TestNumSegments extends TestCase {
> > >
> > >       protected int nextDocNum = 0;
> > >       protected Directory dir = null;
> > >       protected IndexWriter iw = null;
> > >       protected IndexReader ir = null;
> > >
> > >       /* (non-Javadoc)
> > >        * @see junit.framework.TestCase#setUp()
> > >        */
> > >       protected void setUp() throws Exception {
> > >             super.setUp();
> > >             //dir = new RAMDirectory();
> > >             dir = FSDirectory.getDirectory("test.num.segments",true);
> > >             iw = new IndexWriter(dir, new StandardAnalyzer(), true);
> > >             setLimits(iw);
> > >             addSomeDocs(); // some docs in index
> > >       }
> > >
> > >       // for now, take these limits out of the "game"
> > >       protected void setLimits(IndexWriter iw) {
> > >             iw.setMaxBufferedDocs(Integer.MAX_VALUE-1);
> > >             iw.setMaxMergeDocs(Integer.MAX_VALUE-1);
> > >             iw.setMergeFactor(Integer.MAX_VALUE-1);
> > >       }
> > >
> > >       /* (non-Javadoc)
> > >        * @see junit.framework.TestCase#tearDown()
> > >        */
> > >       protected void tearDown() throws Exception {
> > >             closeW();
> > >             if (dir!=null) {
> > >                   dir.close();
> > >             }
> > >             super.tearDown();
> > >       }
> > >
> > >       // count how many segments are on a directory - index writer
must be

> > > closed
> > >       protected int countDirSegments() throws IOException {
> > >             assertNull(iw);
> > >             SegmentInfos segmentInfos = new SegmentInfos();
> > >             segmentInfos.read(dir);
> > >             int nSegs = segmentInfos.size();
> > >             segmentInfos.clear();
> > >             return nSegs;
> > >       }
> > >
> > >       // open writer
> > >       private void openW() throws IOException {
> > >             iw = new IndexWriter(dir, new StandardAnalyzer(), false);
> > >             setLimits(iw);
> > >       }
> > >
> > >       private void closeW() throws IOException {
> > >             if (iw!=null) {
> > >                   iw.close();
> > >                   iw=null;
> > >             }
> > >       }
> > >
> > >       public void testNumSegments() throws IOException {
> > >             int numExceptions = 0;
> > >             for (int i=1; i<30; i++) {
> > >                   closeW();
> > >                   try {
> > >                         assertEquals("Oops - wrong number of
> segments!", i,
> > > countDirSegments());
> > >                   } catch (Throwable t) {
> > >                         numExceptions++;
> > >                         System.err.println(i+":  "+t.getMessage());
> > >                   }
> > >                   openW();
> > >                   addSomeDocs();
> > >             }
> > >             assertEquals("Oops!, so many times numbr of egments was
> > > \"wrong\"",0,numExceptions);
> > >       }
> > >
> > >       private void addSomeDocs() throws IOException {
> > >             for (int i=0; i<2; i++) {
> > >                   iw.addDocument(getDoc());
> > >             }
> > >       }
> > >
> > >       protected Document getDoc() {
> > >             Document doc = new Document();
> > >             doc.add(new Field("body", new
Integer(nextDocNum).toString(),

> > > Field.Store.YES, Field.Index.UN_TOKENIZED));
> > >             doc.add(new Field("all", "x", Field.Store.YES,
> > > Field.Index.UN_TOKENIZED));
> > >             nextDocNum ++;
> > >             return doc;
> > >       }
> > >
> > > }
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: flushRamSegments() is "over merging"?

Yonik Seeley-2
On 8/16/06, Doron Cohen <[hidden email]> wrote:
> Under-merging would hurt search, unless optimize is called explicitly, but
> the index should "behave" without requiring the user to call optimize. 388
> deals with this.

Depends on what you mean by "behave" :-)
More segments than expected can cause failure because of file
descriptor exhaustion.  It's nice to have a calculable cap on the
number of segments. It also depends on exactly what one thinks the
index invariants should be w.r.t. mergeFactor.

> Over-merging - in current flushRamSegments() code - would merge at most
> merge-factor documents prematurely.

Right.

>  Since merge-fatcor is usually not very
> large, this might be a minor issue - but still, if an index is growing by
> small doses, does it make sense to re-merge with the last disk segment each
> time the index is closed? Why not letting it be simply controlled by
> maybeMergeSegments?

I personally see mergeFactor as the maximum number of segments at any
level in the index, with level defined by
docsInSegment/maxBufferedDocs.

maybeMergeSegments doesn't enforce this in the presence of partially
filled segments because it counts documents and not segments.  Since
partially filled segments aren't written in a single IndexWriter
session, this only needs to be checked for on a close().

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]