[lucy-user] input 47 too high

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] input 47 too high

Thomas den Braber
I have a problem when adding a lot of records to the index. After about 1000 records I get
this error: input 47 too high.

I use the Light merger with a background merger as described in the documentation.

$indxobj = Lucy::Index::Indexer->new(
        index  => 'lcyindx1',
        manager => LightMergeManager->new,
  );

......

$indxobj->add_doc( $doc  );


After 20 records I do a commit to prevent the process from locking thinks up.


The Light merger looks like:

package LightMergeManager;

use base qw( Lucy::Index::IndexManager );
   
sub recycle {
  my $self = shift;
  my $seg_readers = $self->SUPER::recycle(@_);
  @$seg_readers = grep { $_->doc_max < 10 } @$seg_readers;
  return $seg_readers;
}

I do a commit every 20 records instead of after every record to increase the performance.
Is this a good idea? or must I do a commit after every record added

----
Thomas den Braber


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] input 47 too high

Peter Karman
On 3/1/13 8:26 AM, Thomas den Braber wrote:

>
> I do a commit every 20 records instead of after every record to increase the performance.
> Is this a good idea? or must I do a commit after every record added
>

$indexer->commit() should be called once per $indexer object. You are
right to batch up as many docs as possible per $indexer, but only one
commit() call is needed.


--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] input 47 too high

Thomas den Braber

> On 3/1/13 8:26 AM, Thomas den Braber wrote:
>
>
> $indexer->commit() should be called once per $indexer object. You are
> right to batch up as many docs as possible per $indexer, but only one
> commit() call is needed.

That is what I do. After every 20 ->add(..) calls I do one commit and then create a new
Indexer object and add another 20. This works fine for most of the time but sometimes when
I add a lot I get the 'input 47 too high' error.

---
Thomas den Braber


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] input 47 too high

Marvin Humphrey
In reply to this post by Thomas den Braber
On Fri, Mar 1, 2013 at 6:26 AM, Thomas den Braber <[hidden email]> wrote:
> I have a problem when adding a lot of records to the index. After about 1000
> records I get this error: input 47 too high.

This is a bug, which the following patch should address:

--- a/core/Lucy/Index/IndexManager.c
+++ b/core/Lucy/Index/IndexManager.c
@@ -122,7 +122,7 @@ static uint32_t
 S_fibonacci(uint32_t n) {
     uint32_t result = 0;
     if (n > 46) {
-        THROW(ERR, "input %u32 too high", n);
+        return UINT32_MAX;
     }
     else if (n < 2) {
         result = n;

> I do a commit every 20 records instead of after every record to increase the
> performance.  Is this a good idea? or must I do a commit after every record
> added

In general, you should batch up docs together, as that will result in less
file churn and make indexing more efficient.  The only reason to commit more
frequently is to make data available to searches sooner and meet
application-specific requirements for responsiveness during near-real-time
updates.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] input 47 too high

Thomas den Braber


On Mon, Mar 4, 2013 at 03:17 PM, Marvin Humphrey <[hidden email]> wrote:

>
> This is a bug, which the following patch should address:
>
> --- a/core/Lucy/Index/IndexManager.c
> +++ b/core/Lucy/Index/IndexManager.c
> @@ -122,7 +122,7 @@ static uint32_t
>  S_fibonacci(uint32_t n) {
>      uint32_t result = 0;
>      if (n > 46) {
> -        THROW(ERR, "input %u32 too high", n);
> +        return UINT32_MAX;
>      }
>      else if (n < 2) {
>          result = n;

I am trying to understand this code and the bug.
Does this mean that after 47 commits without an index merge this warning is shown ?
What was the original idea behind this exception handling ("input %u32 too high"), was it
for testing only ?


> > I do a commit every 20 records instead of after every record to increase the
> > performance.  Is this a good idea? or must I do a commit after every record
> > added
>
> In general, you should batch up docs together, as that will result in less
> file churn and make indexing more efficient.  The only reason to commit more
> frequently is to make data available to searches sooner and meet
> application-specific requirements for responsiveness during near-real-time
> updates.
>

This is exactly the case. I have several processes that can take a longer time to complete
and I don't want to lock-up high priority index changes that take only a short time
and that parts of the data become available sooner is a welcome side effect.

---
Thomas den Braber


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] input 47 too high

Marvin Humphrey
On Mon, Mar 4, 2013 at 7:11 AM, Thomas den Braber <[hidden email]> wrote:

> On Mon, Mar 4, 2013 at 03:17 PM, Marvin Humphrey <[hidden email]> wrote:
>>
>> This is a bug, which the following patch should address:
>>
>> --- a/core/Lucy/Index/IndexManager.c
>> +++ b/core/Lucy/Index/IndexManager.c
>> @@ -122,7 +122,7 @@ static uint32_t
>>  S_fibonacci(uint32_t n) {
>>      uint32_t result = 0;
>>      if (n > 46) {
>> -        THROW(ERR, "input %u32 too high", n);
>> +        return UINT32_MAX;
>>      }
>>      else if (n < 2) {
>>          result = n;
>
> I am trying to understand this code and the bug.

Thanks for the feedback.  It is useful for us to know which parts of the
codebase are easy to grok and which parts are not.  Ideally, everything would
be simple and transparent.

> Does this mean that after 47 commits without an index merge this warning is
> shown?  What was the original idea behind this exception handling ("input
> %u32 too high"), was it for testing only ?

I wrote that line.  The goal is to avoid overflowing a 32-bit integer.  I
didn't think too hard about the failure case because I didn't expect it to
trigger under normal circumstances -- and when I don't want to think too hard,
my habit is to insert an exception so that we fail noisily rather than
silently.

In this algorithm, we have an array of SegReader objects sorted by
Doc_Max(), and we're trying to figure out which ones to recycle.  The goal is
to end up with a list of segments whose sizes roughly approximate the
fibonacci series.

Here's the relevant block from IndexManager.c:

    // Find sparsely populated segments.
    for (uint32_t i = 0; i < num_candidates; i++) {
        uint32_t num_segs_when_done = num_candidates - threshold + 1;
        total_docs += I32Arr_Get(doc_counts, i);
        if (total_docs < S_fibonacci(num_segs_when_done + 5)) {
            threshold = i + 1;
        }
    }

Now I've had to think about the failure case. :)  Here's the reasoning behind
the patch:

The `threshold` variable -- an array index -- starts at 0, so
`num_segs_when_done` starts high.  As `threshold` grows, the number passed
through S_fibonacci() drops.  It's fine if we clip S_fibonacci() early in the
loop because `total_docs` will never approach `UINT32_MAX`[1] and we really
only care what happens once `num_segs_when_done` drops to a reasonable number,
late in the loop.  Until then, we'll continue to accumulate small segments
that we want to recycle.

Marvin Humphrey

[1] Doc ids are signed 32-bit integers, so even if we ignore practical
    performance considerations, we can't exceed INT32_MAX -- and UINT32_MAX is
    out of reach.