32 bit CentOS Indexing Question

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

32 bit CentOS Indexing Question

Nick D.
Hi all,

I am having issues indexing large files. The format of the files I'm indexing is a syslog formatted file that is pretty large around 4.4gb. During the process I am only adding docs to the index making a doc per line of the syslog log file and committing once at the very end. During this process the index grows to a relatively enormous size (around 14gb) and (im guessing) during the commit it uses huge amounts of ram slowing the computer down to a crawl and then once the commit is done the index size shrinks to 4.1gb on a 64 bit system and on a 32 bit system I get a malloc error saying it can't allocate more space. Each box has the same amount of ram and they are the same OS only 1 is 32-bit and the other is 64.

Questions:

Why do I get a memory allocation error on a 32 bit OS and not a 64 bit OS?
Are there any 32 bit limitations of Lucy?
Why does the index file grow so large and then shrinks after commit is done? Should I commit more often?
Would committing often slow down the indexing process?
Would committing often make the over growth of the index go away?

Any help would be greatly appreciated,

Nick D.


Code Snippet:
# Create Schema.
my $schema = Lucy::Plan::Schema->new;
my $case_folder  = Lucy::Analysis::CaseFolder->new;
my $tokenizer    = Lucy::Analysis::RegexTokenizer->new; #purposely leave out the Steemer
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
      analyzers => [ $case_folder, $tokenizer ],  
      );  
my $unstored_full_text_type = Lucy::Plan::FullTextType->new(
                analyzer => $polyanalyzer,
                stored => 0,
                );  
my $unindexed_int_type = Lucy::Plan::Int64Type->new( indexed => 0, sortable => 1, );
my $unindexed_string_type = Lucy::Plan::StringType->new( indexed => 0, sortable => 1, );

$schema->spec_field( name => 'line', type => $unstored_full_text_type );
$schema->spec_field( name => 'offset',     type => $unindexed_int_type );
$schema->spec_field( name => 'time_sec',     type => $unindexed_string_type );

.........................

open(my $fh, '<', $filename ) or die "Can't open '$filename': $!";
my $offset = 0;
my $time = 0;
while( my $line = <$fh> ) {

   $line =~ /^\w+\s+\d+\s+(\d+)\:(\d+)\:(\d+)/;

   $time = ($1*60*60) + ($2*60) + $3;

   my %doc = (
         line      => $line,
         offset     => $offset,
         time_sec   => sprintf("%0.5d", $time),
         );

   #print Dumper(\%doc);
   $indexer->add_doc(\%doc);  # ta-da!
   $offset = tell($fh);
}

$indexer->commit;
-------------------------------------end of snippet---------------------------------------

Example format of file to be indexed

Mar 12 12:27:00 server3 named[32172]: lame server resolving 'jakarta5.wasantara.net.id' (in 'wasantara.net.id'?): 202.159.65.171#53
Mar 12 12:27:03 server3 named[32173]: lame server resolving 'jakarta5.wasantara.net.id' (in 'wasantara.net.id'?): 202.159.65.171#
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] 32 bit CentOS Indexing Question

Marvin Humphrey
On Tue, Jan 28, 2014 at 11:26 AM, Nick D. <[hidden email]> wrote:
> Why do I get a memory allocation error on a 32 bit OS and not a 64 bit OS?

It's probably a known architectural flaw in SortWriter which makes it consume
too much RAM.

> Are there any 32 bit limitations of Lucy?

In theory, there should not be.  We have expended considerable effort to
provide compatibility with 32-bit systems, though our optimization target
remains 64-bit.

> Why does the index file grow so large and then shrinks after commit is done?

There is a lot of temporary data produced during indexing.  Before you can
search a large amount of material, you have to sort it.  That takes a lot of
space.

> Should I commit more often?

If you are only generating this index in a single shot, that should be an
adequqate workaround to overcome the SortWriter problem.  However, you must
also override IndexManager#recycle to return an empty arrayref.  Check out
Lucy::Docs::Cookbook::FastUpdates.

> Would committing often slow down the indexing process?

I don't think the difference would be unreasonable.

> Would committing often make the over growth of the index go away?

If you override IndexManager#recycle, yes.

This is assuming you don't need to modify the index later, which I'm guessing
based on the script that you supplied.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] 32 bit CentOS Indexing Question

Nick Wellnhofer
On Jan 29, 2014, at 02:59 , Marvin Humphrey <[hidden email]> wrote:

> On Tue, Jan 28, 2014 at 11:26 AM, Nick D. <[hidden email]> wrote:
>> Why do I get a memory allocation error on a 32 bit OS and not a 64 bit OS?
>
> It's probably a known architectural flaw in SortWriter which makes it consume
> too much RAM.

This issue should be resolved in the sortfieldwriter branch. The following two commits are the crucial ones:

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=0e49ac6f6ca45860d5598060b89bdac3fbfed2db

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=a5aa40a93d0b2542dc04afd387619b015cf273b5

It would be interesting to know whether they make a difference in Nick D.’s case. If they solve his problem, we should consider backporting the fix to the 0.3 branch.

Nick W.

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] 32 bit CentOS Indexing Question

Nick D.
Thanks Nick (cool name by the way). If I continue to have problems with this I will get those 2 commits and see if there is a difference.

Would these commits help with speed of indexing? mainly add_doc and commit functions that write/re-write segments?
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] 32 bit CentOS Indexing Question

Nick Wellnhofer
On Jan 31, 2014, at 21:43 , Nick D. <[hidden email]> wrote:

> Thanks Nick (cool name by the way). If I continue to have problems with this
> I will get those 2 commits and see if there is a difference.
>
> Would these commits help with speed of indexing? mainly add_doc and commit
> functions that write/re-write segments?

That’s hard to tell. The first commit should make things a bit faster. The second commit helps with memory usage when indexing many documents with sortable fields. This should actually make things slower but there’s a tunable which might help:

    Lucy::Index::SortWriter::set_default_mem_thresh($bytes);

The default is 4MB (0x400000). Larger values should speed up indexing at the expense of memory.

Then the sortfieldwriter branch contains another commit which might improve performance noticably:

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=ad178f10692659b4ed8b170ebfa42d13fd3eed20

If you checkout the sortfieldwriter branch, you’ll get all these commits. If you’re using the 0.3 branch, you have apply them one-by-one. There’s a good chance that this will work without conflicts.

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] 32 bit CentOS Indexing Question

Nick D.
Does the Lucy::Index::SortWriter::set_default_mem_thresh($bytes); function exist in the latest public 0.3.3 version of lucy?

Is there a function like this for SegWriter (I'm assuming this is used for writing segments that are not sortable)? if so what is the default?

Are there any downsides to increasing this threshold to say 40MB?
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] 32 bit CentOS Indexing Question

Nick Wellnhofer
On Jan 31, 2014, at 23:18 , Nick D. <[hidden email]> wrote:

> Does the Lucy::Index::SortWriter::set_default_mem_thresh($bytes); function
> exist in the latest public 0.3.3 version of lucy?

Yes, but it’s ineffective due to a bug which the sortfieldwriter branch should fix.

> Is there a function like this for SegWriter (I'm assuming this is used for
> writing segments that are not sortable)? if so what is the default?

Yes, there’s

    Lucy::Index::PostingListWriter::set_default_mem_thresh($bytes);

with a default of 16MB. This affects segment merging for indexed fields.

(A segment contains data for all the fields of your schema. PostingListWriter creates the posting lists for indexed fields. SortWriter creates the sort cache for sortable fields. Both posting lists and sort caches are contained in a segment.)

> Are there any downsides to increasing this threshold to say 40MB?

No, if you have enough memory, you can probably use a much higher value. Maybe Marvin can give some additional details.

Nick


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] 32 bit CentOS Indexing Question

Marvin Humphrey
On Fri, Jan 31, 2014 at 4:33 PM, Nick Wellnhofer <[hidden email]> wrote:
>> Are there any downsides to increasing this threshold to say 40MB?
>
> No, if you have enough memory, you can probably use a much higher value.
> Maybe Marvin can give some additional details.

It's hard to say, it might depend on CPU cache behavior.

The primary reason that global setting exists is not performance tweakery,
it's testing.

From perl/lib/Lucy/Test.pm:

    # Set the default memory threshold for PostingListWriter to a low number
    # so that we simulate large indexes by performing a lot of PostingPool
    # flushes.
    Lucy::Index::PostingListWriter::set_default_mem_thresh(0x1000);

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] 32 bit CentOS Indexing Question

Nick D.
If I have the current Lucy-0.3.3 version that is on cpan how do I go about getting those two commits mentioned earlier into the source that I have?
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] 32 bit CentOS Indexing Question

Nick D.
In reply to this post by Nick Wellnhofer
Nick Wellnhofer wrote
On Jan 31, 2014, at 21:43 , Nick D. <[hidden email]> wrote:

If you checkout the sortfieldwriter branch, you’ll get all these commits. If you’re using the 0.3 branch, you have apply them one-by-one. There’s a good chance that this will work without conflicts.

Nick
I've checkout out the sorfieldwriter branch like so:
git clone https://git-wip-us.apache.org/repos/asf/lucy.git
git checkout -b test  origin/sortfieldwriter

And when I do a `git log` I see the commits:
commit ad178f10692659b4ed8b170ebfa42d13fd3eed20
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date:   Thu Sep 26 19:43:42 2013 +0200

    Use counting sort to sort doc_ids in SortFieldWriter#Refill
   
    Since we already have the ordinals for each doc_id, we can use a
    counting sort. This uses a temporary array of size run_cardinality but
    runs in linear time.

commit 98a960ed16c601569ab1b78e5a3e1e9302065180
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date:   Thu Sep 26 02:28:04 2013 +0200

    Free sorted_ids in SortFieldWriter a little earlier

commit 393723d354d8ce44841cd006a26d03894315088d
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date:   Thu Sep 26 02:13:49 2013 +0200

    Initialize SortFieldWriter#run_tick to 1
   
    Make sure we never use a run_tick of 0.

commit a5aa40a93d0b2542dc04afd387619b015cf273b5
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date:   Thu Sep 26 02:04:42 2013 +0200

    Make SortFieldWriter#Refill obey the memory limit
   
    The old logic was broken.

commit 0e49ac6f6ca45860d5598060b89bdac3fbfed2db
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date:   Thu Sep 26 01:25:52 2013 +0200

    Don't sort documents twice in SortFieldWriter#Refill
   
    The doc_ids are already sorted in S_lazy_init_sorted_ids. We only have
    to make sure that S_lazy_init_sorted_ids uses the doc_id as secondary
    sort key.
But when I look at commit you replied with "https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=0e49ac6f6ca45860d5598060b89bdac3fbfed2db" and looking at the file SortFieldWriter.c here: https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/SortFieldWriter.c;h=6ae42d10a3e62f6e99058d668cb0b90fd91b53b1;hb=0e49ac6f6ca45860d5598060b89bdac3fbfed2db

I dont see the function "S_compare_doc_ids_by_ord_rev" in my branch. I've made sure to do a `git remote update` and merge but it was up to date. Is there something thaat I need to do extra?

Attached is my SortFieldWriter.c file SortFieldWriter.c

Any help is always appreciated.
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] 32 bit CentOS Indexing Question

Nick Wellnhofer
On 05/02/2014 00:10, Nick D. wrote:
> I've checkout out the sorfieldwriter branch like so:
> git clone https://git-wip-us.apache.org/repos/asf/lucy.git
> git checkout -b test  origin/sortfieldwriter
>
> And when I do a `git log` I see the commits:

Looks good.

> But when I look at commit you replied with
> "https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=0e49ac6f6ca45860d5598060b89bdac3fbfed2db"
> and looking at the file SortFieldWriter.c here:
> https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/SortFieldWriter.c;h=6ae42d10a3e62f6e99058d668cb0b90fd91b53b1;hb=0e49ac6f6ca45860d5598060b89bdac3fbfed2db
>
> I dont see the function "S_compare_doc_ids_by_ord_rev" in my branch. I've
> made sure to do a `git remote update` and merge but it was up to date. Is
> there something thaat I need to do extra?

That's OK. The function S_compare_doc_ids_by_ord_rev was removed in a later
commit in the branch.

Nick
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] 32 bit CentOS Indexing Question

Nick D.
In reply to this post by Nick Wellnhofer
Nick Wellnhofer wrote
On Jan 29, 2014, at 02:59 , Marvin Humphrey <[hidden email]> wrote:

> On Tue, Jan 28, 2014 at 11:26 AM, Nick D. <[hidden email]> wrote:
>> Why do I get a memory allocation error on a 32 bit OS and not a 64 bit OS?
>
> It's probably a known architectural flaw in SortWriter which makes it consume
> too much RAM.


It would be interesting to know whether they make a difference in Nick D.’s case. If they solve his problem, we should consider backporting the fix to the 0.3 branch.

Nick W.
It did not fix the issue but committing every 50k records or so (syslog style records) fixed the issue and sped up indexing a bit. I was able to install the SortWriter branch and unfortunately the set_mem_threshold did not speed up indexing (possibly I/O device is the bottleneck).