Benchmarking results

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Benchmarking results

Marvin Humphrey
RESULTS A: 'body' neither stored nor vectorized
========================================================================
===
configuration               avg secs       max memory consumed
------------------------------------------------------------------------
---
Lucene / JVM 1.4               50.14               79 MB
Lucene / JVM 1.5               51.86               93 MB
KinoSearch / Perl 5.8.8        70.25               29 MB
KinoSearch / Perl 5.8.6        83.43               31 MB


RESULTS B: 'body' stored and vectorized
========================================================================
===
configuration               avg secs       max memory consumed
------------------------------------------------------------------------
---
KinoSearch / Perl 5.8.8        76.01               29 MB
Lucene / JVM 1.4               86.70              178 MB
KinoSearch / Perl 5.8.6        88.79               31 MB
Lucene / JVM 1.5               89.28              147 MB
Plucene / Perl 5.8.6         2014.00*            skipped


DISCUSSION
========================================================================
===

1) Lucene performs better than KinoSearch when there is less data to  
be stored, while KinoSearch does better when there is a lot of data  
to be stored.  This may be because Lucene rewrites the stored field  
data and the term vector data whenever segments are merged, while  
KinoSearch writes that data only once (twice if you count the fact  
that KinoSearch only supports the compound file format, which we've  
disabled in Lucene for the sake of speed).  It probably also helps  
that KinoSearch stores term vector data with the stored field data in  
the .fdx file.

2) The memory consumed by Lucene is due to the generous value (1000)  
assigned to maxBufferedDocs, which is critical for indexing  
performance.  KinoSearch's memory consumption is primarily dependent  
on the mem_threshold argument to the KinoSearch::Util::SortExternal  
constructor, which isn't accessible from the public API at present.  
Increasing this from the default of 16 MB to 256 MB improves speed by  
another 15% or so.

3) The difference between Perl 5.8.8 and 5.8.6 probably has less to  
do with the version number and more to do with the fact that the  
5.8.6 install has threads enabled, while the 5.8.8 install does not.  
The 5.8.6 install is the Perl that Apple ships with OS X 10.4.  The  
5.8.8 install is compiled from source using all the Configure  
script's suggestions/defaults except for the two pertaining to  
installation location.

4) While Plucene is written in pure Perl and KinoSearch is written in  
Perl and C/XS, there are also substantial algorithmic differences  
between them.  These have been covered in depth elsewhere.

METHODOLOGY
========================================================================
===

Source code for the experiment can be found at <http://
www.rectangular.com/svn/kinosearch/trunk/t/benchmarks/>. The tests  
were run using subversion repository revision 762.

The test corpus was Reuters-21578, Distribution 1.0.  Reuters-21578  
is available from David D. Lewis' professional home page, currently:

     http://www.research.att.com/~lewis

The times for KinoSearch and Lucene are 5-run averages.  OS X is a  
busy operating system, which injects some noise into the results.  
It's crucial that iters occur one right after another, as a second  
run immediately following another is often faster, but even a few  
seconds lag between them can slow the second run.  (Presumably this  
is due to cache reassignment.)  Therefore, the same command was  
issued on the command line 6 times, separated by semicolons.  The  
first iter was discarded, and the rest were averaged.

The maximum memory consumption was measured during auxiliary passes  
(i.e. not averaged in), using the crude method of eyeballing RPRVT in  
the output of top.

* The sole Plucene stat isn't an average, it's just one run, as there  
wasn't time to perform multiple runs.

HARDWARE
========================================================================
===

     PowerBook G4 17" 1.67 MHz
     Mac OS X 10.4.5
     1.5 GB ram
     Seagate 5400 rpm, 100 MB ATA HD


SOFTWARE
========================================================================
===

Lucene 1.9.1
KinoSearch 0.09_03
Plucene 1.24

JVM 1.4.2_09
JVM 1.5.0_02
Apple's Perl 5.8.6 (shipped with OS X 10.4)
Perl 5.8.8 from source


RAW DATA
========================================================================
===

slothbear:~/Desktop/ks/t/benchmarks marvin$ javac -d . indexers/
LuceneIndexer.java
slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M  
LuceneIndexer; java -server -Xmx500M LuceneIndexer; java -server -
Xmx500M LuceneIndexer; java -server -Xmx500M LuceneIndexer; java -
server -Xmx500M LuceneIndexer; java -server -Xmx500M LuceneIndexer
Java Lucene 1.9.1 DOCS: 19043 SECS: 50.99
Java Lucene 1.9.1 DOCS: 19043 SECS: 50.42
Java Lucene 1.9.1 DOCS: 19043 SECS: 50.08
Java Lucene 1.9.1 DOCS: 19043 SECS: 49.54
Java Lucene 1.9.1 DOCS: 19043 SECS: 50.48
Java Lucene 1.9.1 DOCS: 19043 SECS: 50.18
slothbear:~/Desktop/ks/t/benchmarks marvin$ javac15 -d . indexers/
LuceneIndexer.java Note: indexers/LuceneIndexer.java uses unchecked  
or unsafe operations.Note: Recompile with -Xlint:unchecked for details.
slothbear:~/Desktop/ks/t/benchmarks marvin$ java15 -server -Xmx500M  
LuceneIndexer; java15 -server -Xmx500M LuceneIndexer; java15 -server -
Xmx500M LuceneIndexer; java15 -server -Xmx500M LuceneIndexer; java15 -
server -Xmx500M LuceneIndexer; java15 -server -Xmx500M LuceneIndexer;
Java Lucene 1.9.1 DOCS: 19043 SECS: 52.26
Java Lucene 1.9.1 DOCS: 19043 SECS: 51.91
Java Lucene 1.9.1 DOCS: 19043 SECS: 52.19
Java Lucene 1.9.1 DOCS: 19043 SECS: 51.80
Java Lucene 1.9.1 DOCS: 19043 SECS: 51.23
Java Lucene 1.9.1 DOCS: 19043 SECS: 52.19
slothbear:~/Desktop/ks/t/benchmarks marvin$ vim indexers/
LuceneIndexer.java
slothbear:~/Desktop/ks/t/benchmarks marvin$ javac -d . indexers/
LuceneIndexer.java slothbear:~/Desktop/ks/t/benchmarks marvin$ java -
server -Xmx500M LuceneIndexer; java -server -Xmx500M LuceneIndexer;  
java -server -Xmx500M LuceneIndexer; java -server -Xmx500M  
LuceneIndexer; java -server -Xmx500M LuceneIndexer; java -server -
Xmx500M LuceneIndexer
Java Lucene 1.9.1 DOCS: 19043 SECS: 87.50
Java Lucene 1.9.1 DOCS: 19043 SECS: 87.42
Java Lucene 1.9.1 DOCS: 19043 SECS: 86.29
Java Lucene 1.9.1 DOCS: 19043 SECS: 86.74
Java Lucene 1.9.1 DOCS: 19043 SECS: 86.11
Java Lucene 1.9.1 DOCS: 19043 SECS: 86.96
slothbear:~/Desktop/ks/t/benchmarks marvin$ javac15 -d . indexers/
LuceneIndexer.java Note: indexers/LuceneIndexer.java uses unchecked  
or unsafe operations.Note: Recompile with -Xlint:unchecked for details.
slothbear:~/Desktop/ks/t/benchmarks marvin$ java15 -server -Xmx500M  
LuceneIndexer; java15 -server -Xmx500M LuceneIndexer; java15 -server -
Xmx500M LuceneIndexer; java15 -server -Xmx500M LuceneIndexer; java15 -
server -Xmx500M LuceneIndexer; java15 -server -Xmx500M LuceneIndexer;
Java Lucene 1.9.1 DOCS: 19043 SECS: 90.43
Java Lucene 1.9.1 DOCS: 19043 SECS: 90.52
Java Lucene 1.9.1 DOCS: 19043 SECS: 90.06
Java Lucene 1.9.1 DOCS: 19043 SECS: 89.69
Java Lucene 1.9.1 DOCS: 19043 SECS: 87.87
Java Lucene 1.9.1 DOCS: 19043 SECS: 88.24
slothbear:~/Desktop/ks/t/benchmarks marvin$ perl -Mblib indexers/
kinosearch_indexer.plx; perl -Mblib indexers/kinosearch_indexer.plx;  
perl -Mblib indexers/kinosearch_indexer.plx; perl -Mblib indexers/
kinosearch_indexer.plx; perl -Mblib indexers/kinosearch_indexer.plx;  
perl -Mblib indexers/kinosearch_indexer.plx;
KinoSearch 0.09_03 DOCS: 19043  SECS: 87.20
KinoSearch 0.09_03 DOCS: 19043  SECS: 82.55
KinoSearch 0.09_03 DOCS: 19043  SECS: 82.38
KinoSearch 0.09_03 DOCS: 19043  SECS: 81.86
KinoSearch 0.09_03 DOCS: 19043  SECS: 87.79
KinoSearch 0.09_03 DOCS: 19043  SECS: 82.52
slothbear:~/Desktop/ks/t/benchmarks marvin$ vim indexers/
kinosearch_indexer.plx
slothbear:~/Desktop/ks/t/benchmarks marvin$ perl -Mblib indexers/
kinosearch_indexer.plx; perl -Mblib indexers/kinosearch_indexer.plx;  
perl -Mblib indexers/kinosearch_indexer.plx; perl -Mblib indexers/
kinosearch_indexer.plx; perl -Mblib indexers/kinosearch_indexer.plx;  
perl -Mblib indexers/kinosearch_indexer.plx;
KinoSearch 0.09_03 DOCS: 19043  SECS: 88.16
KinoSearch 0.09_03 DOCS: 19043  SECS: 87.70
KinoSearch 0.09_03 DOCS: 19043  SECS: 92.67
KinoSearch 0.09_03 DOCS: 19043  SECS: 87.32
KinoSearch 0.09_03 DOCS: 19043  SECS: 88.35
KinoSearch 0.09_03 DOCS: 19043  SECS: 87.92
slothbear:~/Desktop/ks/t/benchmarks marvin$ cd ~/Desktop/ks588/t/
benchmarks/
slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx; /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx; /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx; /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx; /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx; /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx
KinoSearch 0.09_03 DOCS: 19043  SECS: 69.67
KinoSearch 0.09_03 DOCS: 19043  SECS: 70.44
KinoSearch 0.09_03 DOCS: 19043  SECS: 72.87
KinoSearch 0.09_03 DOCS: 19043  SECS: 69.94
KinoSearch 0.09_03 DOCS: 19043  SECS: 69.16
KinoSearch 0.09_03 DOCS: 19043  SECS: 68.82
slothbear:~/Desktop/ks588/t/benchmarks marvin$ vim indexers/
kinosearch_indexer.plx
slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx; /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx; /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx; /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx; /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx; /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx
KinoSearch 0.09_03 DOCS: 19043  SECS: 87.58
KinoSearch 0.09_03 DOCS: 19043  SECS: 75.17
KinoSearch 0.09_03 DOCS: 19043  SECS: 75.86
KinoSearch 0.09_03 DOCS: 19043  SECS: 75.05
KinoSearch 0.09_03 DOCS: 19043  SECS: 78.55
KinoSearch 0.09_03 DOCS: 19043  SECS: 75.41
slothbear:~/Desktop/ks588/t/benchmarks marvin$ cd ~/Desktop/ks/t/
benchmarks/
slothbear:~/Desktop/ks/t/benchmarks marvin$ perl indexers/
plucene_indexer.plx; perl indexers/plucene_indexer.plx; perl indexers/
plucene_indexer.plx; perl indexers/plucene_indexer.plx; perl indexers/
plucene_indexer.plx;
Plucene 1.24 DOCS: 19043  SECS: 2013.70
^C
Couldn't get lock at indexers/plucene_indexer.plx line 56
^C
^C
slothbear:~/Desktop/ks/t/benchmarks marvin$




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Benchmarking results

Pasha Bizhan-2
Hi,

> From: Marvin Humphrey [mailto:[hidden email]]


> The test corpus was Reuters-21578, Distribution 1.0.  
> Reuters-21578 is available from David D. Lewis' professional
> home page, currently:
>
>      http://www.research.att.com/~lewis

The correct link is
http://www.daviddlewis.com/resources/testcollections/reuters21578/
 
Pasha Bizhan


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Benchmarking results

Tatu Saloranta
In reply to this post by Marvin Humphrey

> The times for KinoSearch and Lucene are 5-run
...
> is due to cache reassignment.)  Therefore, the same
> command was  
> issued on the command line 6 times, separated by
> semicolons.  The  
> first iter was discarded, and the rest were
> averaged.
...
> The maximum memory consumption was measured during
> auxiliary passes  
> (i.e. not averaged in), using the crude method of
> eyeballing RPRVT in  
> the output of top.

Marvin, I think it is great that different
implementations are
compared, and your results are interesting. However, I
think that
above methodology does not work well with Java (it may
work
better for/with Perl, but might have problems there as
well).
In this case it is maybe not quite as big a difference
as for
some other tests (since test runs were almost minute
long), ie.
no order of magnitude difference, but it will be
noticeable.

The reason is that it is crucial NOT to run
consequtive tests
by restarting JVM, unless you really want to measure
one-shot
single-run command line total times. The reason is
that the
startup overhead and warmup of HotSpot essentially
mean
that if you did run second indexing right after first
one,
it would be significantly faster, and not just due to
caching
effects. And consequtive runs would have run times
that converge
towards sustainable long-term performance -- in this
case the second
run may already be as fast as it'll get, since it's
running for
significant amount of time (I have noticed 30 or even
10 second
warm up time is often sufficient).
HotSpot only compiles Java bytecode when it determines
a need, and
figuring that out will take a while.

So in this case, what would give more comparable
results (assuming
you are interested in measuring likely server-side
usage scenario,
which is usually what Lucene is used for) would be to
run all
runs within same JVM / execution (for Perl), and
either take
the fastest runs, or discard the first one and take
median or
average.

Would this be possible? I am not really concerned
about "whose
language is faster" here, but about relevancy of the
results, using
methodology that gives realistic numbers for the usual
use case.
Chances are, Perl-based version would also perform
better (depending
on how Perl runtime optimizes things) if tests were
run under
a single process.

Anyway, above is intended as constructive criticism,
so once again
thank you for doing these tests!

-+ Tatu +-

ps. Regarding memory usage: it is also quite tricky to
measure
 reliably, since Garbage Collection only kicks in when
it has to...
 so Java uses as much memory as it can (without
expanding heap)...
 plus, JVMs do not necessarily (or even usually)
return unused
 chunks later on.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Benchmarking results

Igor Bolotin
For faster Hotspot warm-up you can use Hotspot VM option:
-XX:CompileThreshold=NN

This option controls number of method invocations/branches before
(re-)compiling. Defaults are: 10,000 -server, 1,500 -client.
See documentation here: http://java.sun.com/docs/hotspot/VMOptions.html

In one of my previous benchmarking projects we used -XX:CompileThreshold=100
to force compilation to happen as soon as possible.

However - you still have to warm up JVM before measuring performance. What
we did - we indexed relatively small corpus in different directory without
taking measurments before running actual benchmark in the same JVM session.

Igor

On 4/4/06, Tatu Saloranta <[hidden email]> wrote:

>
>
> > The times for KinoSearch and Lucene are 5-run
> ...
> > is due to cache reassignment.)  Therefore, the same
> > command was
> > issued on the command line 6 times, separated by
> > semicolons.  The
> > first iter was discarded, and the rest were
> > averaged.
> ...
> > The maximum memory consumption was measured during
> > auxiliary passes
> > (i.e. not averaged in), using the crude method of
> > eyeballing RPRVT in
> > the output of top.
>
> Marvin, I think it is great that different
> implementations are
> compared, and your results are interesting. However, I
> think that
> above methodology does not work well with Java (it may
> work
> better for/with Perl, but might have problems there as
> well).
> In this case it is maybe not quite as big a difference
> as for
> some other tests (since test runs were almost minute
> long), ie.
> no order of magnitude difference, but it will be
> noticeable.
>
> The reason is that it is crucial NOT to run
> consequtive tests
> by restarting JVM, unless you really want to measure
> one-shot
> single-run command line total times. The reason is
> that the
> startup overhead and warmup of HotSpot essentially
> mean
> that if you did run second indexing right after first
> one,
> it would be significantly faster, and not just due to
> caching
> effects. And consequtive runs would have run times
> that converge
> towards sustainable long-term performance -- in this
> case the second
> run may already be as fast as it'll get, since it's
> running for
> significant amount of time (I have noticed 30 or even
> 10 second
> warm up time is often sufficient).
> HotSpot only compiles Java bytecode when it determines
> a need, and
> figuring that out will take a while.
>
> So in this case, what would give more comparable
> results (assuming
> you are interested in measuring likely server-side
> usage scenario,
> which is usually what Lucene is used for) would be to
> run all
> runs within same JVM / execution (for Perl), and
> either take
> the fastest runs, or discard the first one and take
> median or
> average.
>
> Would this be possible? I am not really concerned
> about "whose
> language is faster" here, but about relevancy of the
> results, using
> methodology that gives realistic numbers for the usual
> use case.
> Chances are, Perl-based version would also perform
> better (depending
> on how Perl runtime optimizes things) if tests were
> run under
> a single process.
>
> Anyway, above is intended as constructive criticism,
> so once again
> thank you for doing these tests!
>
> -+ Tatu +-
>
> ps. Regarding memory usage: it is also quite tricky to
> measure
> reliably, since Garbage Collection only kicks in when
> it has to...
> so Java uses as much memory as it can (without
> expanding heap)...
> plus, JVMs do not necessarily (or even usually)
> return unused
> chunks later on.
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Benchmarking results

Marvin Humphrey
In reply to this post by Tatu Saloranta

On Apr 4, 2006, at 10:23 AM, Tatu Saloranta wrote:
> So in this case, what would give more comparable results (assuming
> you are interested in measuring likely server-side
> usage scenario, which is usually what Lucene is used for)

My main interest with these tests is algorithmic performance.  How  
much time it takes to start up or warm up a JVM isn't something I  
want to be measuring.  There are startup issues I'm concerned about,  
but they mostly relate to file format design.  The load time for  
field norms is a significant concern.  So is the IndexInterval, which  
is set to 1024 by default instead of 128 as in Lucene.  So is the  
locality of reference issue for where the term vector data gets  
stored.  All of those things affect the total time it takes for a  
KinoSearch app to launch, load, search, and return results, which  
needs to be as small as possible so that e.g. website search apps  
indexing up to [some large number of] documents can be run as simple  
CGI scripts.  I'm considering further modifications to the file  
format to keep that total time down...

Actually, I think the benchmark results illustrate that everyone  
should be at least mildly concerned about where the Term Vector data  
gets stored.  KinoSearch only writes that data once.  Lucene,  
however, has to read/write that data during each merge, and the more  
streams you have, the more complex the merge.  It stands to reason  
that storing term vector data with the stored fields data would speed  
up the merge process.

I brought this issue up a few weeks ago, but in a search-time  
context.  The two primary applications for Term Vector data that I am  
aware of are excerpting/highlighting and "more like this" searches,  
both of which would benefit from having the term vectors stored with  
the documents, because each search would require fewer disk seeks.    
Term Vectors might also be used to build a pure vector space search  
engine, like the one described in this article <http://www.perl.com/ 
pub/a/2003/02/19/engine.html>, but that's impractical for indexes  
larger than a handful of documents and of academic interest only.  
Are there any other significant applications?  If not, I submit that  
term vectors belong in the .fdx file.

> would be to run all runs within same JVM / execution (for Perl),

Thanks for the critique.  I've updated the indexer apps to accept two  
command line arguments.  They're now run like so:

     java [ARGS] LuceneIndexer -reps 6 -docs 1000
     perl indexers/kinosearch_indexer.plx --reps=6 --docs=1000

With the new methodology, the numbers are slightly better for  
Lucene.  They're actually worse for KinoSearch.   I've isolated the  
code that's responsible for the slowdown that and I speculate that  
it's a memory fragmentation issue, as I can solve it by forcing  
KinoSearch to consume more memory at that point.  However, having  
established that KinoSearch is in Lucene's league with regards to  
indexing speed, I'm not worried about absolute numbers, and the new  
benchmarker interface is slightly more stable, allowing more accurate  
comparative analysis of algorithmic efficiency.  The trends are still  
apparent: KinoSearch gains ground when there's stored and vectorized  
content.

Raw data is below.

> and either take the fastest runs, or discard the first one and take  
> median or
> average.

As you'll see in the raw data, the apps now produce two aggregate  
numbers: a mean, and a truncated mean <http://en.wikipedia.org/wiki/ 
Truncated_mean>.

> ps. Regarding memory usage: it is also quite tricky to measure
>  reliably, since Garbage Collection only kicks in when it has to...
>  so Java uses as much memory as it can (without expanding heap)...
>  plus, JVMs do not necessarily (or even usually) return unused
>  chunks later on.

Yes.  Still, there is a correlation  between maxBufferedDocs and max  
memory consumption by the process.  So Java must be reusing something...

     maxBufferedDocs   max memory (1 rep)   truncated mean time (6 reps)
     -------------------------------------------------------------------
         10                69 MB                124.89 secs
        100                91 MB                 88.17 secs
       1000               169 MB                 84.80 secs

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

RAW DATA - JVM warmup / truncated mean experiment
===================================================

slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M -
XX:CompileThreshold=100 LuceneIndexer -reps 6
---------------------------------------------------
1   Secs: 87.02  Docs: 19043
2   Secs: 84.56  Docs: 19043
3   Secs: 85.04  Docs: 19043
4   Secs: 83.83  Docs: 19043
5   Secs: 84.75  Docs: 19043
6   Secs: 84.84  Docs: 19043
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.5 ppc
Mean: 85.01 secs
Truncated mean (4 kept, 2 discarded): 84.80 secs
---------------------------------------------------
slothbear:~/Desktop/ks/t/benchmarks marvin$ cd ~/Desktop/ks588/t/
benchmarks/
slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx --reps 6
------------------------------------------------------------
1    Secs: 75.51  Docs: 19043
2    Secs: 80.79  Docs: 19043
3    Secs: 81.12  Docs: 19043
4    Secs: 84.68  Docs: 19043
5    Secs: 81.78  Docs: 19043
6    Secs: 79.65  Docs: 19043
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.5.0 Power Macintosh
Mean: 80.59 secs
Truncated mean (4 kept, 2 discarded): 80.83 secs
------------------------------------------------------------
slothbear:~/Desktop/ks588/t/benchmarks marvin$


RAW DATA - mergefactor experiment
============================================================

slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M -
XX:CompileThreshold=100 LuceneIndexer -reps 6
---------------------------------------------------
1   Secs: 127.05  Docs: 19043
2   Secs: 125.50  Docs: 19043
3   Secs: 125.44  Docs: 19043
4   Secs: 124.53  Docs: 19043
5   Secs: 124.10  Docs: 19043
6   Secs: 121.57  Docs: 19043
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.5 ppc
Mean: 124.70 secs
Truncated mean (4 kept, 2 discarded): 124.89 secs
---------------------------------------------------
slothbear:~/Desktop/ks/t/benchmarks marvin$ vim indexers/
LuceneIndexer.java
slothbear:~/Desktop/ks/t/benchmarks marvin$ javac -d . indexers/
LuceneIndexer.java
slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M -
XX:CompileThreshold=100 LuceneIndexer -reps 6
---------------------------------------------------
1   Secs: 89.91  Docs: 19043
2   Secs: 87.59  Docs: 19043
3   Secs: 88.51  Docs: 19043
4   Secs: 88.59  Docs: 19043
5   Secs: 87.97  Docs: 19043
6   Secs: 86.75  Docs: 19043
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.5 ppc
Mean: 88.22 secs
Truncated mean (4 kept, 2 discarded): 88.17 secs
---------------------------------------------------
slothbear:~/Desktop/ks/t/benchmarks marvin$

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Benchmarking results

Grant Ingersoll


Marvin Humphrey wrote:

>
> On Apr 4, 2006, at 10:23 AM, Tatu Saloranta wrote:
>> So in this case, what would give more comparable results (assuming
>> you are interested in measuring likely server-side
>> usage scenario, which is usually what Lucene is used for)
>
> Actually, I think the benchmark results illustrate that everyone
> should be at least mildly concerned about where the Term Vector data
> gets stored.  KinoSearch only writes that data once.  Lucene, however,
> has to read/write that data during each merge, and the more streams
> you have, the more complex the merge.  It stands to reason that
> storing term vector data with the stored fields data would speed up
> the merge process.
>
This seems like a good idea., especially combined with the lazy
loading/retrieve specified fields approach that we are proposing, so
that we aren't getting the term vector every time we retrieve a
document.   We could deprecate the IndexReader.getTermVector methods and
move it to be accessed via the Field.  Not sure what the issues are
completely, but it makes sense, since the TV data is not changing.


> Are there any other significant applications?
Clustering.  Corpora analysis/browsing.  Most likely others

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Benchmarking results

Doug Cutting
In reply to this post by Marvin Humphrey
Marvin Humphrey wrote:
> However, having  established that KinoSearch
> is in Lucene's league with regards to  indexing speed, I'm not worried
> about absolute numbers, and the new  benchmarker interface is slightly
> more stable, allowing more accurate  comparative analysis of algorithmic
> efficiency.  The trends are still  apparent: KinoSearch gains ground
> when there's stored and vectorized  content.

Another axis that I don't think you're yet measuring is how things
change as the index grows.  What happens with 10k, 100k and 1M and 10M
documents?  There are typically knees in search engine performance
curves when indexes get substantially larger than RAM, and behaviour on
either side of the knee may differ with different indexing strategies.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Benchmarking results

Marvin Humphrey
In reply to this post by Marvin Humphrey
Hello,

I have discovered a serious bug in the LuceneIndexer benchmarking  
app.  All tests have been rerun, and the new numbers reflect a 13-15%  
improvement for Lucene.  I apologize for having reported bad data.

Here are some of the new results, both with and without the bug so  
that you can see how the numbers were affected.  They were prepared  
using subversion repository 779.

RESULTS A: 'body' neither stored nor vectorized
========================================================================
===
configuration             truncated mean secs (6 reps)   max memory  
(1 rep)
------------------------------------------------------------------------
---
Lucene / JVM 1.4                  43.68                         79 MB
Lucene / JVM 1.5                  44.95                         93 MB
Lucene / JVM 1.4 with bug         49.63                         79 MB
Lucene / JVM 1.5 with bug         50.93                         92 MB

RESULTS B: 'body' stored and vectorized
========================================================================
===
configuration             truncated mean secs (6 reps)   max memory  
(1 rep)
------------------------------------------------------------------------
---
Lucene / JVM 1.4                  71.96                        118 MB
Lucene / JVM 1.5                  73.81                        214 MB
Lucene / JVM 1.4 with bug         84.73                        182 MB
Lucene / JVM 1.5 with bug         88.96                        199 MB

The bug was in buildFileList() and resulted in a bogus list of  
filepaths.  KinoSearch and Plucene were indexing 19043 documents once  
each.  Lucene was indexing 22 documents over and over, about 900  
times each.

   // Return a lexically sorted list of all article files from all  
subdirs.
   static String[] buildFileList () throws Exception {
     File[] articleDirs = corpusDir.listFiles();
     Vector filePaths = new Vector();
     for (int i = 0; i < articleDirs.length; i++) {
       File[] articles = articleDirs[i].listFiles();
       for (int j = 0; j < articles.length; j++) {
         String path = articles[i].getPath();   // <-- BUG: should be  
j, not i
         if (path.indexOf("article") == -1)
           continue;
         filePaths.add(path);
       }
     }
     Collections.sort(filePaths);
     return (String[])filePaths.toArray(new String[filePaths.size()]);
   }

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Benchmarking results

Marvin Humphrey
In reply to this post by Doug Cutting

On Apr 7, 2006, at 10:05 AM, Doug Cutting wrote:

 > Another axis that I don't think you're yet measuring is how things  
change as
 > the index grows.

There are lots of axes that I haven't measured yet, but I do have to  
move on
to other things sooner or later.  :)  Running decent scientific  
experiments is
painstaking work.  I have thrown away many, many hours of invalid data.

Besides, my laptop deserves a break.  I've been running it flat out  
for so
long, it thinks it's a space heater.

 > What happens with 10k, 100k and 1M and 10M documents?

There are 19043 docs in the extracted Reuters corpus, comprising a  
little over
16 MB of content.  I've set up the benchmarking apps so that if you  
specify
more than 19043 docs on the command line, it loops back through the
collection, allowing an arbitrarily large number of docs to be indexed.

Here are results for 100,000 docs:

config                  truncated mean secs (6 reps)    max memory (1  
rep)
------------------------------------------------------------------------
--
Lucene / JVM 1.4                       356 secs              224 MB
KinoSearch / Perl 5.8.8                407 secs               30 MB

... and 1 million docs:

config                  truncated mean secs (6 reps)    max memory (1  
rep)
------------------------------------------------------------------------
--
Lucene / JVM 1.4                      4013                  skipped
KinoSearch / Perl 5.8.8               4776                  skipped

I'll try to get the 10 million benchmark done, but it's going to be a  
little
tough, as I'm running these on my main working laptop.  I can predict  
the
trend, though, as we pile more and more docs on KinoSearch 0.09_03.

At 19000 docs and a mem_threshold of 16 MB, the external sorting  
algorithm
flushes 6 times, so 6 sorted runs must be merged when the postings  
sort pool
gets sorted and the tis/tii/frq/prx files get written.  At 1 million  
docs
and a mem_threshold of 16 MB, the flush happens somewhere in the  
neighborhood
of 300 times, so c. 300 sorted runs must be merged.

At 10 million docs and a mem_threshold of 16 MB, we should see  
something like
3000 sorted runs.  That's probably too many, and if it isn't,  
eventually it
will be.

When the external sorter flips from feeding to fetching mode, it  
establishes a
buffer for each run.  In KinoSearch 0.09_03, each buffer is allowed  
to consume
memory according to this formula:

     sortex->run_cache_limit = (sortex->mem_threshold / 2) / sortex-
 >num_runs;

When there's 6 runs, each buffer can eat 1.33 MB, but when there's  
300 runs,
each buffer only gets around 25 kB.

The primary problem with external sorting algorithms that use an N-
way merge
is that they tend to be IO bound [1, 2]: between recovering data from  
the sorted
runs and writing to the final output, the disk has to seek all over  
the place.
Therefore, it's important that the each run's buffer be fairly  
generous, to
minimize the number of refills.

I'll wager that at 3000 runs and 2.5 kB buffers, the external sorter is
going to seize up.  Fortunately there's an easy solution:

Use more than 16 MB of ram.

Earlier versions of KinoSearch used a different external sorter which  
set a
fixed buffer size of 32 kB of content.  I'll fix 0.09 before the  
official
release to set a floor of 64 kB ram usage for the buffers, and sometime
hence consider how to expose mem_threshold in the public API.

 > There are typically knees in search engine performance curves when
 > indexes get substantially larger than RAM, and behaviour on either
 > side of the knee may differ with different indexing strategies.

As far as KinoSearch is concerned, the index was substantially larger
than RAM when it was only 19000 docs, so we're well past the knee.

Arguably, these tests have been comparing apples and crabapples,  
because I
haven't forced KinoSearch to use more RAM.  As mentioned earlier, the  
rough
analogue to maxMergeDocs is the RAM that the in-memory sort pool is  
allowed
to consume before a run must be written to disk -- 16 MB, by  
default.  I can
set it higher, but that's not part of the public API, so it seemed like
cheating to do so.  Here's how things shake out when KinoSearch is  
hacked to
use maximum RAM...

19043 docs, body neither stored nor vectorized:

config                  truncated mean secs (6 reps)    max memory (1  
rep)
------------------------------------------------------------------------
--
Lucene / JVM 1.4                      43.68 secs                79 MB
KinoSearch / Perl 5.8.8               63.89 secs               230 MB

19043 docs, body stored and vectorized:

config                  truncated mean secs (6 reps)    max memory (1  
rep)
------------------------------------------------------------------------
--
KinoSearch / Perl 5.8.8               69.12 secs               236 MB
Lucene / JVM 1.4                      71.96 secs               118 MB

Perhaps in my desire to give Lucene every advantage, I've gone  
overboard and
hobbled KinoSearch too much.  My main interest is in comparing  
algorithms, and
in retrospect it's a more realistic comparison if I up KinoSearch's RAM,
private API hack or no.  However, I knew that at least some of the  
people
reading these benchmarking reports would see "Cage Match! Lucene vs.
KinoSearch!" and not pay attention to the subtler, more interesting  
qualitative
differences: where one or the other does better and why.  It seemed  
important
to take the highest high road so that no controversy could ever arise  
as to
whether one engine had been given a leg up.  I'm just not worried  
about the
speed that KinoSearch is ultimately going to attain.  I'm intimately
acquainted with where the bottlenecks tend to occur in Perl, but  
KinoSearch is
not a pure Perl library -- it's an extension to Perl written in Perl  
and C,
with some of the performance-critical work done by C.  (Call it a "C  
library
with a Perl interface", if you like -- that's close enough.) This  
approach has
its drawbacks, but now that after much experimentation and effort all  
or nearly
all of the algorithmic issues have been solved, speed isn't one of  
them.  There
are numerous optimizations still to be done, so it's going to get  
faster.

The interesting thing to me about the unlimited-memory results above  
is how the
results change when there's stored and vectorized data.  A tricked-
out Lucene
1.9.1 indexer seems to outperform a tricked-out KinoSearch 0.09_03  
indexer with
regards to inverting documents.  However, when there's stored/
vectorized data,
Lucene appears to have some overhead that KinoSearch doesn't.  
Corroborating
data from additional independent experiments would be nice to have,  
but I think
at this point we have enough to entertain the hypothesis that the  
KinoSearch
merge model, considered in isolation, is simply a more efficient  
algorithm than
the current Lucene merge model.  I suspect that were Lucene to adopt it,
indexing speed would improve.

   * The store field data and term vectors data are only written  
once, rather
     than shuffled around with each merge.
   * The serialized postings are plain old byte strings which can be  
compared
     using memcmp, so there's no OO overhead for the comparison routine.
   * The streams in an N-way merge are more predictable and therefore  
easier to
     optimize for IO efficiency.
   * External sorting is a well-studied problem and existing  
scholarship can be
     leveraged.

That's my two cents, and my contribution.

Thanks for the critique,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

[1] "Parallel Out-of-Core Sorting: The Third Way", Geeta Chaudhry
     Section 8.1, "Merging based algorithms"
     http://www.cs.dartmouth.edu/reports/abstracts/TR2004-517/
[2] "Asynchronous Parallel Disk Sorting", Dementiev/Sanders
     Section 2.2 "Multi-way merging"
     http://i10www.ira.uka.de/dementiev/files/DS03.pdf

RAW DATA - 100,000 docs
==================================================

slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M -
XX:CompileThreshold=100 LuceneIndexer -docs 100000 -reps 6
---------------------------------------------------
1   Secs: 384.62  Docs: 100000
2   Secs: 359.83  Docs: 100000
3   Secs: 354.13  Docs: 100000
4   Secs: 354.63  Docs: 100000
5   Secs: 355.56  Docs: 100000
6   Secs: 354.18  Docs: 100000
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.6 ppc
Mean: 360.49 secs
Truncated mean (4 kept, 2 discarded): 356.05 secs
---------------------------------------------------
slothbear:~/Desktop/ks/t/benchmarks marvin$ cd ~/Desktop/ks588/t/
benchmarks/
slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx --docs=100000 --reps=6
------------------------------------------------------------
1    Secs: 406.87  Docs: 100000
2    Secs: 409.73  Docs: 100000
3    Secs: 405.33  Docs: 100000
4    Secs: 407.35  Docs: 100000
5    Secs: 406.76  Docs: 100000
6    Secs: 409.45  Docs: 100000
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 407.58 secs
Truncated mean (4 kept, 2 discarded): 407.61 secs
------------------------------------------------------------

RAW DATA - 1,000,000 docs
==================================================

slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx --docs=1000000 --reps=6;  
cd ~/Desktop/ks/t/benchmarks/; java -server -Xmx500M -
XX:CompileThreshold=100 LuceneIndexer -docs 1000000 -reps 6
------------------------------------------------------------
1    Secs: 4590.71  Docs: 1000000
2    Secs: 4694.36  Docs: 1000000
3    Secs: 4976.30  Docs: 1000000
4    Secs: 4760.99  Docs: 1000000
5    Secs: 4801.72  Docs: 1000000
6    Secs: 4834.97  Docs: 1000000
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 4776.51 secs
Truncated mean (4 kept, 2 discarded): 4773.01 secs
------------------------------------------------------------
---------------------------------------------------
1   Secs: 4,023.33  Docs: 1000000
2   Secs: 4,003.18  Docs: 1000000
3   Secs: 4,012.21  Docs: 1000000
4   Secs: 4,015.97  Docs: 1000000
5   Secs: 4,010.61  Docs: 1000000
6   Secs: 4,013.39  Docs: 1000000
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.6 ppc
Mean: 4,013.12 secs
Truncated mean (4 kept, 2 discarded): 4,013.04 secs
---------------------------------------------------
slothbear:~/Desktop/ks/t/benchmarks marvin$


RAW DATA -- KinoSearch large RAM
================================

slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx --reps=6
------------------------------------------------------------
1    Secs: 64.86  Docs: 19043
2    Secs: 68.95  Docs: 19043
3    Secs: 67.98  Docs: 19043
4    Secs: 69.92  Docs: 19043
5    Secs: 71.56  Docs: 19043
6    Secs: 69.63  Docs: 19043
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 68.82 secs
Truncated mean (4 kept, 2 discarded): 69.12 secs
------------------------------------------------------------
slothbear:~/Desktop/ks588/t/benchmarks marvin$ /usr/local/perl588/bin/
perl -Mblib indexers/kinosearch_indexer.plx --reps=6
------------------------------------------------------------
1    Secs: 60.00  Docs: 19043
2    Secs: 63.35  Docs: 19043
3    Secs: 66.77  Docs: 19043
4    Secs: 63.82  Docs: 19043
5    Secs: 64.04  Docs: 19043
6    Secs: 64.37  Docs: 19043
------------------------------------------------------------
KinoSearch 0.09_03
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 63.72 secs
Truncated mean (4 kept, 2 discarded): 63.89 secs
------------------------------------------------------------
slothbear:~/Desktop/ks588/t/benchmarks marvin$





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]