[jira] Created: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

classic Classic list List threaded Threaded
81 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Markus Jelsma (Jira)
improve how IndexWriter uses RAM to buffer added documents
----------------------------------------------------------

                 Key: LUCENE-843
                 URL: https://issues.apache.org/jira/browse/LUCENE-843
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: 2.2
            Reporter: Michael McCandless
         Assigned To: Michael McCandless
            Priority: Minor


I'm working on a new class (MultiDocumentWriter) that writes more than
one document directly into a single Lucene segment, more efficiently
than the current approach.

This only affects the creation of an initial segment from added
documents.  I haven't changed anything after that, eg how segments are
merged.

The basic ideas are:

  * Write stored fields and term vectors directly to disk (don't
    use up RAM for these).

  * Gather posting lists & term infos in RAM, but periodically do
    in-RAM merges.  Once RAM is full, flush buffers to disk (and
    merge them later when it's time to make a real segment).

  * Recycle objects/buffers to reduce time/stress in GC.

  * Other various optimizations.

Some of these changes are similar to how KinoSearch builds a segment.
But, I haven't made any changes to Lucene's file format nor added
requirements for a global fields schema.

So far the only externally visible change is a new method
"setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
deprecated) so that it flushes according to RAM usage and not a fixed
number documents added.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.patch

I'm attaching a patch with my current state.  NOTE: this is very rough
and very much a work in progress and nowhere near ready to commit!  I
wanted to get it out there sooner rather than later to get feedback,
maybe entice some daring early adopters, iterate, etc.

It passes all unit tests except the disk-full tests.

There are some big issues yet to resolve:

  * Merge policy has problems when you "flush by RAM" (this is true
    even before my patch).  Not sure how to fix yet.

  * Thread safety and thread concurrency aren't there yet.

  * Norms are not flushed (just use up RAM until you close the
    writer).

  * Many other things on my TODO list :)



> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

steven_parkes
  * Merge policy has problems when you "flush by RAM" (this is true
    even before my patch).  Not sure how to fix yet.

Do you mean where one would be trying to use RAM usage to determine when
to do a flush?

-----Original Message-----
From: Michael McCandless (JIRA) [mailto:[hidden email]]
Sent: Thursday, March 22, 2007 10:09 AM
To: [hidden email]
Subject: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM
to buffer added documents


     [
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira
.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.patch

I'm attaching a patch with my current state.  NOTE: this is very rough
and very much a work in progress and nowhere near ready to commit!  I
wanted to get it out there sooner rather than later to get feedback,
maybe entice some daring early adopters, iterate, etc.

It passes all unit tests except the disk-full tests.

There are some big issues yet to resolve:

  * Merge policy has problems when you "flush by RAM" (this is true
    even before my patch).  Not sure how to fix yet.

  * Thread safety and thread concurrency aren't there yet.

  * Norms are not flushed (just use up RAM until you close the
    writer).

  * Many other things on my TODO list :)



> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2

"Steven Parkes" <[hidden email]> wrote:
>   * Merge policy has problems when you "flush by RAM" (this is true
>     even before my patch).  Not sure how to fix yet.
>
> Do you mean where one would be trying to use RAM usage to determine when
> to do a flush?

Right, if you have your indexer measure RAM usage
(writer.ramSizeInByts()) after each added doc and flush whenever that
crosses X MB then depending on the value of maxBufferedDocs, you may
over-merge.

EG if you set maxBufferedDocs to say 10000 but then it turns out based
on RAM usage you actually flush every 300 docs then the merge policy
will incorrectly merge a level 1 segment (with 3000 docs) in with the
level 0 segments (with 300 docs).  This is because the merge policy
looks at the current value of maxBufferedDocs to compute the levels
so a 3000 doc segment and a 300 doc segment all look like "level 0".

(I'm doing this to try to do apples to apples performance comparison
of current Lucene trunk vs my patch and "flushing by RAM" seems like
the fair comparison but then I have to carefully pick maxBufferedDocs
to make sure I don't hit this).

I will open a separate issue for this.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

steven_parkes
> EG if you set maxBufferedDocs to say 10000 but then it turns out based
> on RAM usage you actually flush every 300 docs then the merge policy
> will incorrectly merge a level 1 segment (with 3000 docs) in with the
> level 0 segments (with 300 docs).  This is because the merge policy
> looks at the current value of maxBufferedDocs to compute the levels
> so a 3000 doc segment and a 300 doc segment all look like "level 0".

Are you calling the 3K segment a level 1 segment because it was created
from level 0 segments? Because based on size, it is a level 0 segment,
right? With the current merge policy, you can merge level n segments and
get a level n segment. Deletes will do this, plus other things like
changing merge policy parameters and combining indexes.

Leads to the question of what is "over merging". The current merge
policy doesn't consider the size of the result, it simply counts the
number of segments at a level. Do you think this qualifies as over
merging? It still should only merge when there are mergeFactor segments
at a level, so you shouldn't be doing too terribly much merging.  And
you have to be careful not to do less, right? By bounding the number of
segments at each level, you ensure that your file descriptor usage only
grows logarithmically.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2

On Thu, 22 Mar 2007 13:34:39 -0700, "Steven Parkes" <[hidden email]> said:

> > EG if you set maxBufferedDocs to say 10000 but then it turns out based
> > on RAM usage you actually flush every 300 docs then the merge policy
> > will incorrectly merge a level 1 segment (with 3000 docs) in with the
> > level 0 segments (with 300 docs).  This is because the merge policy
> > looks at the current value of maxBufferedDocs to compute the levels
> > so a 3000 doc segment and a 300 doc segment all look like "level 0".
>
> Are you calling the 3K segment a level 1 segment because it was created
> from level 0 segments? Because based on size, it is a level 0 segment,
> right? With the current merge policy, you can merge level n segments and
> get a level n segment. Deletes will do this, plus other things like
> changing merge policy parameters and combining indexes.

Right I'm calling a newly created segment (ie flushed from RAM) level
0 and then a level 1 segment is created when you merge 10 level 0
segments, level 2 is created when merge 10 level 1 segments, etc.

> Because based on size, it is a level 0 segment, right?

Well, I don't think it's right to call something level 0 just because
it's under the the current maxBufferedDocs.

Backing up a bit ... I think the lowest amortized cost merge policy
should always try to merge roughly equal sized segments subject to
restrictions of 1) max # segments that can be merged at once
(mergeFactor) presumably due to file descriptor limits and/or
substantial degradation in merge performance as mergeFactor increases
eg due to lack of concurrency in IO system (??) and 2) that you must
merge adjacent segments I think (so docIDs, though changing, remain
"monotonic").

Actually is #2 a hard requirement?  Do the loose ports of Lucene
(KinoSearch, Ferret, etc.) also follow this restriction?  We say that
developers should not rely on docIDs but people still seem to rely on
their monotonic ordering (even though they change).

Merging is costly because you read all data in then write all data
out, so, you want to minimize for byte of data in the index in the
index how many times it will be "serviced" (read in, written out) as
part of a merge.  I think if N equal sized segments are always merged
then the # copies for each byte of data will be minimized?

So, the fact that due to this bug we will merge a 3000 doc segment
with 9 300 doc segments is not efficient (amortized) because those
3000 docs in the first segment will net/net have to get merged again
far sooner than they would have had they been merged with 9 3000 doc
segments.

I think instead of calling segments "level N" we should just measure
their net sizes and merge on that basis?

> Leads to the question of what is "over merging". The current merge
> policy doesn't consider the size of the result, it simply counts the
> number of segments at a level. Do you think this qualifies as over
> merging? It still should only merge when there are mergeFactor segments
> at a level, so you shouldn't be doing too terribly much merging.  And
> you have to be careful not to do less, right? By bounding the number of
> segments at each level, you ensure that your file descriptor usage only
> grows logarithmically.

Yes, at no time should you merge more than mergeFactor segments at once.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

steven_parkes
> Right I'm calling a newly created segment (ie flushed from RAM) level
> 0 and then a level 1 segment is created when you merge 10 level 0
> segments, level 2 is created when merge 10 level 1 segments, etc.

This isn't the way the current code treats things. I'm not saying it's
the only way to look at things, just making sure I/we are clear on the
current state. The current code effectively computes level as a function
of size of the segment, independent of how the segment was created and
of the size of its neighbors.

Hmmm ... that's not completely true. Lets say it's mostly true. Okay,
kinda true.

> Backing up a bit ... I think the lowest amortized cost merge policy
> should always try to merge roughly equal sized segments

Well, logarithmic does do this, it just clumps logarithmically, with, in
this case, the boundary condition (size of the lowest level) being
defined by maxBufferedDocs. Do you consider 99999 and 10001 to be
roughly equal sized?

Of course, without deletes and not worrying about edge cases caused by
flushes/closes, everything will be of the form
maxBufferedDocs*mergeFactor^n. I'm curious to know if they stay that
way: in particular what happens to the sizes of big old segments. Can
everybody afford an optimize?

I think it'd be interesting to play with merge policies and see what we
can do. One thing I'm interested in is tweaking the behavior on the big
end of the scale. Exponentials get big fast and I think in some cases,
treating the big segments differently from the smaller segments makes
sense, for example, self-merging an old segment as deletes build up, or
merging small subsequences of segments for a simlar reason. This is
mostly what motivated me to work on factoring out the merge policy from
the rest of IndexWriter. I've got something reasonably stable (though by
no means perfect) that'll I'll post, hopefully later today.

Anyway, back to some of your comments:

It's really interesting to think about this. I'm not a algorithms
expert, but I still find it interesting. It's appealing to think of a
"best" policy but is it possible to have a best policy without making
assumptions on the operations sequence (adds, deletes)?

Also, "best" has to trade off multiple constraints. You mentioned
copying bytes as rarely as possible. But the number of segments has both
hard and soft impacts on lookups, too, right? The hard limit is the
number of file descriptors: every reader is opening every segment:
that's the limiting case for file descriptors (mergers only need to open
mergeFactor+1). The soft impact is doing the join when walking posting
lists from each segment.

My impression is that the logarithmic policy (in size), was chosen as a
nice closed form that does an "okay" job in all these areas.

> Merging is costly because you read all data in then write all data
> out, so, you want to minimize for byte of data in the index in the
> index how many times it will be "serviced" (read in, written out) as
> part of a merge.

I've thought about this, as I looked through the existing merge code. I
think you can get cases where you'll get a perfect storm of bad merges,
copying a large segment multiple times to merge it with small segments.
Ick. But an edge condition? I guess the question is what is the expected
value of the cost, i.e., the absolute cost but mitigated by the
unlikelihood of it occurring.

Interesting.

> I think instead of calling segments "level N" we should just measure
> their net sizes and merge on that basis?

I think this is a great candidate, given that you factor in the number
of segments and sometimes merge just to keep the total number down.

> Yes, at no time should you merge more than mergeFactor segments at
once.

Why is this? File descriptors? Like I said before, readers are more an
issue there, right? I don't see that there's a huge difference between
mergeFactor and mergeFactor+1 if for other reasons, one might think
mergeFactor+1 was better than mergeFactor for a particular merge.

Hmmm ... I guess I think of mergeFactor as more setting the size
boundaries of segments than I do the n-way-ness of a merge, though the
word itself lends itself to the latter interpretation.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Marvin Humphrey
In reply to this post by Michael McCandless-2

On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote:

> Actually is #2 a hard requirement?

A lot of Lucene users depend on having document number correspond to  
age, I think.  ISTR Hatcher at least recommending techniques that  
require it.

> Do the loose ports of Lucene
> (KinoSearch, Ferret, etc.) also follow this restriction?

KS: Nope.  So you can't use those tricks.

> I think instead of calling segments "level N" we should just measure
> their net sizes and merge on that basis?

Here's the fibonacci-series-based algorithm used in KinoSearch, taken  
from MultiReader:

sub segreaders_to_merge {
     my ( $self, $all ) = @_;
     return unless @{ $self->{sub_readers} };
     return @{ $self->{sub_readers} } if $all;

     # sort by ascending size in docs
     my @sorted_sub_readers
         = sort { $a->num_docs <=> $b->num_docs } @{ $self->
{sub_readers} };

     # find sparsely populated segments
     my $total_docs = 0;
     my $threshold  = -1;
     for my $i ( 0 .. $#sorted_sub_readers ) {
         $total_docs += $sorted_sub_readers[$i]->num_docs;
         if ( $total_docs < fibonacci( $i + 5 ) ) {
             $threshold = $i;
         }
     }

     # if any of the segments are sparse, return their readers
     if ( $threshold > -1 ) {
         return @sorted_sub_readers[ 0 .. $threshold ];
     }
     else {
         return;
     }
}

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2
In reply to this post by steven_parkes
Steven Parkes wrote:

>> Right I'm calling a newly created segment (ie flushed from RAM)
>> level 0 and then a level 1 segment is created when you merge 10
>> level 0 segments, level 2 is created when merge 10 level 1 segments,
>> etc.
>
> This isn't the way the current code treats things. I'm not saying
> it's the only way to look at things, just making sure I/we are clear
> on the current state. The current code effectively computes level
> as a function of size of the segment, independent of how the
> segment was created and of the size of its neighbors.

Yes the code re-computes the level of a given segment from the current
values of maxBufferedDocs & mergeFactor.  But when these values have
changed (or, segments were flushed by RAM not by maxBufferedDocs) then
the way it computes level no longer results in the logarithmic policy
that it's trying to implement, I think.

>> Backing up a bit ... I think the lowest amortized cost merge policy
>> should always try to merge roughly equal sized segments
>
> Well, logarithmic does do this, it just clumps logarithmically,
> with, in this case, the boundary condition (size of the lowest
> level) being defined by maxBufferedDocs. Do you consider 99999 and
> 10001 to be roughly equal sized?

Exactly, when logarithmic works "correctly" (you don't change
mergeFactor/maxBufferedDocs and your docs are all uniform in size), it
does achieve this "merge roughly equal size in byte" segments (yes
those two numbers are roughly equal).  Though now I have to go ponder
KS's Fibonacci series approach!.

> Of course, without deletes and not worrying about edge cases caused
> by flushes/closes, everything will be of the form
> maxBufferedDocs*mergeFactor^n. I'm curious to know if they stay that
> way: in particular what happens to the sizes of big old
> segments. Can everybody afford an optimize?

I think for high turnover content the way old segments will be mostly
deleted but at some point the merge will fully cascade (basically
an optimize) and they will be cleaned up?

> I think it'd be interesting to play with merge policies and see what
> we can do. One thing I'm interested in is tweaking the behavior on
> the big end of the scale. Exponentials get big fast and I think in
> some cases, treating the big segments differently from the smaller
> segments makes sense, for example, self-merging an old segment as
> deletes build up, or merging small subsequences of segments for a
> simlar reason. This is mostly what motivated me to work on factoring
> out the merge policy from the rest of IndexWriter. I've got
> something reasonably stable (though by no means perfect) that'll
> I'll post, hopefully later today.

Agreed!  There are lots of interesting things to explore here.

> It's really interesting to think about this. I'm not a algorithms
> expert, but I still find it interesting. It's appealing to think of
> a "best" policy but is it possible to have a best policy without
> making assumptions on the operations sequence (adds, deletes)?

I think there is no global "best"; it does depend on adds/deletes and
also what's important to you (search performance vs index
performance).

> Also, "best" has to trade off multiple constraints. You mentioned
> copying bytes as rarely as possible. But the number of segments has
> both hard and soft impacts on lookups, too, right? The hard limit is
> the number of file descriptors: every reader is opening every
> segment: that's the limiting case for file descriptors (mergers only
> need to open mergeFactor+1). The soft impact is doing the join when
> walking posting lists from each segment.

You're right, it's searching that has to open all the segments and
that's a hard limit once you start bumping up to max # descriptors and
soft, in slowing down your search.

>> Merging is costly because you read all data in then write all data
>> out, so, you want to minimize for byte of data in the index in the
>> index how many times it will be "serviced" (read in, written out) as
>> part of a merge.

> I've thought about this, as I looked through the existing merge
> code. I think you can get cases where you'll get a perfect storm of
> bad merges, copying a large segment multiple times to merge it with
> small segments.  Ick. But an edge condition? I guess the question is
> what is the expected value of the cost, i.e., the absolute cost but
> mitigated by the unlikelihood of it occurring.

Hmmm.  What cases lead to this?

>> I think instead of calling segments "level N" we should just measure
>> their net sizes and merge on that basis?

> I think this is a great candidate, given that you factor in the number
> of segments and sometimes merge just to keep the total number down.

Basically, this would keep the same logarithmic approach now, but
derive levels somehow from the net size in bytes.

>> Yes, at no time should you merge more than mergeFactor segments at
>> once.
>
> Why is this? File descriptors? Like I said before, readers are more
> an issue there, right? I don't see that there's a huge difference
> between mergeFactor and mergeFactor+1 if for other reasons, one
> might think mergeFactor+1 was better than mergeFactor for a
> particular merge.

Good point and good question.  I'm not sure.  The only thing I wonder
is whether reading from say 20 segments at once is slower (more
than 2X slower) than reading from 10?  Or it could actually be
faster (since OS/disk heads can better schedule IO access to a
larger number of spots)?  I don't know.

> Hmmm ... I guess I think of mergeFactor as more setting the size
> boundaries of segments than I do the n-way-ness of a merge, though
> the word itself lends itself to the latter interpretation.

True.  It means both!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

steven_parkes
> But when these values have
> changed (or, segments were flushed by RAM not by maxBufferedDocs) then
> the way it computes level no longer results in the logarithmic policy
> that it's trying to implement, I think.

That's right. Parts of the implementation assume that the segments are
logarithmically related and if they aren't, the result is kind of
undefined, in the sense that it does whatever it does. It doesn't try to
restore the logarithmic invariant.

In some cases, where we know we're combining multiple indexes, we try to
restore the invariant, but it's a little weakly defined what that
invariant is, so the results aren't canonically defined.

> Though now I have to go ponder
> KS's Fibonacci series approach!.

Yeah, me too. I roughed out a version using the factored merge policy
... not that that means I completely understand the implications. I
still have to teach IndexWriter how to deal with non-contiguous merge
specs, which I didn't originally anticipate.

But a very interesting exercise.

> I think for high turnover content the way old segments will be mostly
> deleted but at some point the merge will fully cascade (basically
> an optimize) and they will be cleaned up?

Well, as it stands, the only way to get rid of deleted docs at level n
is either to merge them into a level n+1 segment or optimize. As n gets
big, the chances of getting to a level n+1 segment go down. So those
level n segments live a long time, a manual optimize not withstanding:
hence my question: Do we see big segments that are mostly sparse? And a
merge there is one of those kinda bad things: if you have to level n
segments, one pretty sparse and the next fairly full, you have to merge
left to right, copying all of the full segment just to recover the
deleted space in the sparse segment.

> Hmmm.  What cases lead to this?

Well, it's kinda artificial, but combining indexes will do this. They're
combined in series, so the little segments of one index are followed by
the big segments of the next. So if you need to do some merging, and
given the current ordering constraint, you can merge a bunch of little
segments with one big segment to create only a slightly bigger segment.
Similarly with lots of deleted documents ...

But I don't know if these cases are observed in the wild ...

> The only thing I wonder
> is whether reading from say 20 segments at once is slower (more
> than 2X slower) than reading from 10?  Or it could actually be
> faster (since OS/disk heads can better schedule IO access to a
> larger number of spots)?  I don't know.

Me neither. I can imagine that order of magnitude differences (binary or
otherwise) would show changes but that small changes like +/- one
wouldn't, so that we might want to give ourselves that freedom.

Should be pretty easy to experiment with this stuff with the factored
merge, which I'll post Real Soon Now.

I haven't tried any performance testing. I can't remember: I need to
look and see if the performance framework measure indexing as well as
querying.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Chris Hostetter-3
In reply to this post by Marvin Humphrey
: > Actually is #2 a hard requirement?
:
: A lot of Lucene users depend on having document number correspond to
: age, I think.  ISTR Hatcher at least recommending techniques that
: require it.

"Corrispond to age" may be missleading as it implies that the actual
docid has meaning ... it's more that the relative order of addition is
preserved regardless of deletions/merging

A trivial example of using this is getting the N newest documents matching
a search using a HitCollector, it's just a bounded queue that only
remembers the last N things you put in it.

An more complicated example is duplicate unique field detection: iterating
over a TermDoc and knowing that the doc with the higheest docId is the
last one added, so the earlier ones can be ignored/deleted.  (as i recall,
Solr takes advantage of this.)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2

"Chris Hostetter" <[hidden email]> wrote:

> : > Actually is #2 a hard requirement?
> :
> : A lot of Lucene users depend on having document number correspond to
> : age, I think.  ISTR Hatcher at least recommending techniques that
> : require it.
>
> "Corrispond to age" may be missleading as it implies that the actual
> docid has meaning ... it's more that the relative order of addition is
> preserved regardless of deletions/merging
>
> A trivial example of using this is getting the N newest documents
> matching
> a search using a HitCollector, it's just a bounded queue that only
> remembers the last N things you put in it.
>
> An more complicated example is duplicate unique field detection:
> iterating
> over a TermDoc and knowing that the doc with the higheest docId is the
> last one added, so the earlier ones can be ignored/deleted.  (as i
> recall,
> Solr takes advantage of this.)

Got it, so we need to preserve this invariant.  So this puts the
general restriction on the Lucene merge policy that only adjacent
segments (ie, when ordered by segment number) can be merged.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Erik Hatcher
In reply to this post by Marvin Humphrey

On Mar 22, 2007, at 8:13 PM, Marvin Humphrey wrote:
> On Mar 22, 2007, at 3:18 PM, Michael McCandless wrote:
>
>> Actually is #2 a hard requirement?
>
> A lot of Lucene users depend on having document number correspond  
> to age, I think.  ISTR Hatcher at least recommending techniques  
> that require it.

I may have recommended it only as "if you can guarantee you index in  
age order then you can ... ", but given FunctionQuery's ability to  
rank based on age given a date field per document its not needed.  
Guaranteeing any kind of order of document insertion (given updates  
that delete and re-add) is not really possible in most cases anyway.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Yonik Seeley-2
In reply to this post by Michael McCandless-2
On 3/22/07, Michael McCandless <[hidden email]> wrote:
> We say that
> developers should not rely on docIDs but people still seem to rely on
> their monotonic ordering (even though they change).

Yes.  If the benefits of removing that guarantee are large enough, we
could consider dumping it... but not in Lucene 2.x IMO.  We should see
how people are using this, and if there are acceptable workarounds.
Solr's current dependence on document ordering could me removed.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Yonik Seeley-2
In reply to this post by Michael McCandless-2
On 3/22/07, Michael McCandless <[hidden email]> wrote:
> Merging is costly because you read all data in then write all data
> out, so, you want to minimize for byte of data in the index in the
> index how many times it will be "serviced" (read in, written out) as
> part of a merge.

Avoiding the re-writing of stored fields might be nice:
http://www.nabble.com/Re%3A--jira--Commented%3A-%28LUCENE-565%29-Supporting-deleteDocuments-in-IndexWriter-%28Code-and-Performance-Results-Provided%29-p6177280.html

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

steven_parkes
In reply to this post by Michael McCandless-2
Given history, perhaps the default merge policy should conserve this,
but with pluggable merge policies, I don't see that all merge policies
need to.

-----Original Message-----
From: Michael McCandless [mailto:[hidden email]]
Sent: Friday, March 23, 2007 1:53 AM
To: [hidden email]
Subject: Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses
RAM to buffer added documents


"Chris Hostetter" <[hidden email]> wrote:

> : > Actually is #2 a hard requirement?
> :
> : A lot of Lucene users depend on having document number correspond to
> : age, I think.  ISTR Hatcher at least recommending techniques that
> : require it.
>
> "Corrispond to age" may be missleading as it implies that the actual
> docid has meaning ... it's more that the relative order of addition is
> preserved regardless of deletions/merging
>
> A trivial example of using this is getting the N newest documents
> matching
> a search using a HitCollector, it's just a bounded queue that only
> remembers the last N things you put in it.
>
> An more complicated example is duplicate unique field detection:
> iterating
> over a TermDoc and knowing that the doc with the higheest docId is the
> last one added, so the earlier ones can be ignored/deleted.  (as i
> recall,
> Solr takes advantage of this.)

Got it, so we need to preserve this invariant.  So this puts the
general restriction on the Lucene merge policy that only adjacent
segments (ie, when ordered by segment number) can be merged.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2
In reply to this post by Yonik Seeley-2

"Yonik Seeley" <[hidden email]> wrote:
> > We say that
> > developers should not rely on docIDs but people still seem to rely on
> > their monotonic ordering (even though they change).
>
> Yes.  If the benefits of removing that guarantee are large enough, we
> could consider dumping it... but not in Lucene 2.x IMO.  We should see
> how people are using this, and if there are acceptable workarounds.
> Solr's current dependence on document ordering could me removed.

I don't think we should change this "feature" of the current merge
policy.  I think it may break too many people in hard-to-figure-out
ways.  It's also not clear that we have much to gain if we were
allowed to break this?  (The current "logarithmic" merge policy I
think works quite well.)

And as Steve said, once merge policy is decoupled from the writer then
apps have the freedom to pick or build a different merge policy
(leaving current one as the default).

However, I do think we should fix the bug with the current merge
policy when you "flush by RAM" (LUCENE-845).  Since the recommended
way (I think?) to maximize indexing performance is to "flush by RAM
usage" I expect people will start hitting this bug fairly often.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2
In reply to this post by Yonik Seeley-2

"Yonik Seeley" <[hidden email]> wrote:
> On 3/22/07, Michael McCandless <[hidden email]> wrote:
> > Merging is costly because you read all data in then write all data
> > out, so, you want to minimize for byte of data in the index in the
> > index how many times it will be "serviced" (read in, written out) as
> > part of a merge.
>
> Avoiding the re-writing of stored fields might be nice:
> http://www.nabble.com/Re%3A--jira--Commented%3A-%28LUCENE-565%29-Supporting-deleteDocuments-in-IndexWriter-%28Code-and-Performance-Results-Provided%29-p6177280.html

That's exactly the approach I'm taking in LUCENE-843: stored fields and term
vectors are immediately written to disk.  Only frq, prx and tis use up
memory.  This greatly extends how many docs you can buffer before
having to flush (assuming your docs have stored fields and term
vectors).

When memory is full, I either flush a segment to disk (when writer is
in autoCommit=true mode), else I flush the data to tmp files which are
finally merged into a segment when the writer is closed.  This merging
is less costly because the bytes in/out are just frq, prx and tis, so
this improves performance of autoCommit=false mode vs autoCommit=true
mode.

But, this is only for the segment created from buffered docs (ie the
segment created by a "flush").  Subsequent merges still must copy
bytes in/out and in LUCENE-843 I haven't changed anything about how
segments are merged.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Ning Li-3
In reply to this post by Michael McCandless-2
On 3/22/07, Michael McCandless <[hidden email]> wrote:
> Right I'm calling a newly created segment (ie flushed from RAM) level
> 0 and then a level 1 segment is created when you merge 10 level 0
> segments, level 2 is created when merge 10 level 1 segments, etc.

That is not how the current merge policy works. There are two
orthogonal aspects to this problem:
  1 the measurement of a segment size
  2 the merge behaviour given a measurement

In the current code:
  1 The measurement of a segment size is the document count in the
segment, not the actual RAM or file size. Levels are defined according
to this measurement.
  2 The behaviour is the two invariants when mergeFactor (M) does not
change and segment doc count is not reaching maxMergeDocs: B for
maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B)))
     1) If i (left*) and i+1 (right*) are two consecutive segments of
doc counts x and y, then f(x) >= f(y).
     2) The number of committed segments on the same level (f(n)) <= M.

The document counts are approximation of segment sizes thus
approximation of merge cost. Sometimes, however, they do not correctly
reflect segment sizes. So it is probably a good idea to use RAM or
file size as measurement of a segment size as Mike suggested. But the
behaviour does not have to change: the two invariants can still be
guaranteed, with the definition of sizes and levels modified according
to the new measurement.

Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Ning Li-3
In reply to this post by Michael McCandless-2
On 3/22/07, Michael McCandless <[hidden email]> wrote:
> Yes the code re-computes the level of a given segment from the current
> values of maxBufferedDocs & mergeFactor.  But when these values have
> changed (or, segments were flushed by RAM not by maxBufferedDocs) then
> the way it computes level no longer results in the logarithmic policy
> that it's trying to implement, I think.

The algorithm gradually re-adjusts toward the latest maxBufferedDocs &
mergeFactor - see case 3 of the "Overview of merge policy" comment in
the code.

With the modification that RAM or file size as segment size, the
algorithm would work by maxBufferedSize & mergeFactor. Let's say
maxBufferedDocs or maxBufferedSize is the base size. Lucene-845
complains that the merge behaviour for segments <= base size in some
cases is not logrithmic. It's a tradeoff. We always keep small
segments in check. The algorithm reflects the tradeoff made when
segments <= base size.


> Exactly, when logarithmic works "correctly" (you don't change
> mergeFactor/maxBufferedDocs and your docs are all uniform in size), it
> does achieve this "merge roughly equal size in byte" segments (yes
> those two numbers are roughly equal).  Though now I have to go ponder
> KS's Fibonacci series approach!.

It doesn't have to be Fibonacci series. Logrithmic would work well
too. The main difference is KS can choose any segments to merge, not
just adjacent segments. Thus it may find better candidates for merge.


> Basically, this would keep the same logarithmic approach now, but
> derive levels somehow from the net size in bytes.

Exactly! Levels defined in size in bytes.

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12345