[jira] Created: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

classic Classic list List threaded Threaded
58 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
Lucene benchmark: objective performance test for Lucene
-------------------------------------------------------

                 Key: LUCENE-675
                 URL: http://issues.apache.org/jira/browse/LUCENE-675
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Andrzej Bialecki


We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.

Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Andrzej Bialecki  updated LUCENE-675:
-------------------------------------

    Attachment: LuceneBenchmark.java

This is just a starting point for discussion - it's a pretty old file I found lying around, so it may not even compile with modern Lucene. Requires commons-compress.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436437 ]
           
Paul Smith commented on LUCENE-675:
-----------------------------------

If you're looking for freely available text in bulk, what about:

http://www.gutenberg.org/wiki/Main_Page

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436442 ]
           
Andrzej Bialecki  commented on LUCENE-675:
------------------------------------------

Yes, that could be a good additional source. However, IMHO the primary corpus should be widely known and standardized, hence my proposal of the Reuters.

(I mistakenly copy&paste-d the urls in the comment above - of course the corpus they're pointing at is the "20 Newsgroups", not the Reuters one. Correct url for the Reuters corpus is  http://www.daviddlewis.com/resources/testcollections/reuters21578/ ).

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436443 ]
           
Paul Smith commented on LUCENE-675:
-----------------------------------

From a strict performance point of view, a standard set of important, but don't forget other languages.

From a tokenization point of view (seperate to this issues), perhaps the Gutenberg project would be useful to test correctness of the analysis phase.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436502 ]
           
Karl Wettin commented on LUCENE-675:
------------------------------------

It is also interesting to know how much time is consumed to assemble an instance of Document from the storage. According to my own tests this is the major reason to why InstantiatedIndex is so much faster than a FS/RAMDirectory. I also presume it to be the bottleneck of any RDBMS-, RMI- or any other "proxy"-based storage.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436516 ]
           
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

Since this has dependencies, do you think we should put it under contrib?  I would be for a Performance directory and we could then organize it from there.  Perhaps into packages for quantitative and qualitative performance.  

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436518 ]
           
Andrzej Bialecki  commented on LUCENE-675:
------------------------------------------

The dependency on commons-compress could be avoided - I used this just to be able to unpack tar.gz files, we can use Ant for that. If you meant the dependency on the corpus - can't Ant download this too as a dependency?

Re: Project Gutenberg - good point, this is a good source for multi-lingual documents. The "Europarl" collection is another, although a bit more hefty, so that could be suitable for running large-scale benchmarks, and texts from Project Gutenberg for running small-scale tests.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436519 ]
           
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

Yeah, ANT can do this, I think.  Take a look at the DB contrib package, it downloads.  I think I can setup the necessary stuff in contrib, if people think that is a good idea.  First contribution will be this file and then we can go from there.  I think Otis has run some perf. stuff too, but I am not sure if it can be contributed.  I think someone else has really studied query perf. so it would be cool if that was added too.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436587 ]
           
Otis Gospodnetic commented on LUCENE-675:
-----------------------------------------

I still haven't gotten my employer to sign and fax the CCLA, so I'm stuck and can't contribute my search benchmark.

I have a suggestion for a name for this - Lube, for Lucene Benchmark - contrib/lube.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Grant Ingersoll reassigned LUCENE-675:
--------------------------------------

    Assignee: Grant Ingersoll

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436858 ]
           
Michael McCandless commented on LUCENE-675:
-------------------------------------------

I think this is an incredibly important initiative: with every
non-trivial change to Lucene (eg lock-less commits) we must verify
performance did not get worse.  But, as things stand now, it's an
ad-hoc thing that each developer needs to do.

So (as a consumer of this), I would love to have a ready-to-use
standard test that I could run to check if I've slowed things down
with lock-less commits.

In the mean time I've been using Europarl for my testing.

Also important to realize is there are many dimensions to test.  With
lock-less I'm focusing entirely on "wall clock time to open readers
and writers" in different use cases like pure indexing, pure
searching, highly interactive mixed indexing/searching, etc.  And this
is actually hard to test cleanly because in certain cases (highly
interactive case, or many readers case), the current Lucene hits many
"commit lock" retries and/or timeouts (whereas lock-less doesn't).  So
what's a "fair" comparison in this case?

In addition to standardizing on the corpus I think we ideallly need
standardized hardware / OS / software configuration as well, so the
numbers are easily comparable across time.  Even the test process
itself is important, eg details like "you should reboot the box before
each run" and "discard results from first run then take average of
next 3 runs as your result", are important.  It would be wonderful if
we could get this into a nightly automated regression test so we could
track over time how the performance has changed (and, for example,
quickly detect accidental regressions).  We should probably open this
as a separate issue which depends first on this issue being complete.


> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436934 ]
           
Mike Klaas commented on LUCENE-675:
-----------------------------------

A few notes on benchmarks:

First, it is important to realize that no benchmark will ever fully-capture all aspects of lucene performance, particularly since so many real-world data distributions are so varied.  That said, they are useful tools, especially if they are componentized to measure various aspects of lucene performance (the narrower the goal of the benchmark it, the better a benchmark can be created).

It is rather unrealistic to expect to standardize hardware / os ... better to compare before/after numbers on a single configuration, rather than comparing the numbers among configurations.  The test process _is_ important, but anything crucial should be built into the test (like the number of iterations; taking the average, etc).  Concerning the specifics of this: Requiring reboots is onerous and not an important criterion (at least for unix systems--I'm not sufficiently familiar with windows to comment).  Better to stipulate a relatively quiscient machine.  Or perhaps not--it might be useful to see how the machine load affects lucene performance.  Also, the arithmetic mean is a terrible way of combining results due to its emphasis on outliers.  Better is the average over minimum times of small sets of runs.  

Of course, any scheme has its problems.  In general, the most important thing when using benchmarks is being aware of the limitations of the benchmark and methodology used.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436949 ]
           
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

My comments are marked by GSI
-----------

In the mean time I've been using Europarl for my testing.

GSI: perhaps you can contribute once this is setup

Also important to realize is there are many dimensions to test. With
lock-less I'm focusing entirely on "wall clock time to open readers
and writers" in different use cases like pure indexing, pure
searching, highly interactive mixed indexing/searching, etc. And this
is actually hard to test cleanly because in certain cases (highly
interactive case, or many readers case), the current Lucene hits many
"commit lock" retries and/or timeouts (whereas lock-less doesn't). So
what's a "fair" comparison in this case?

GSI:  I am planning on taking Andrzej contribution and refactoring it into components that can be reused, as well as creating a "standard" benchmark which will be easy to run through a simple ant task, i.e. ant run-baseline

GSI: From here, anybody can contribute their own (I will provide interfaces to facilitate this) benchmarks which others can choose to run.


In addition to standardizing on the corpus I think we ideallly need
standardized hardware / OS / software configuration as well, so the
numbers are easily comparable across time.

GSI: Not really feasible unless you are proposing to buy us machines :-)  I think more important is the ability to do a before and after evaluation (that runs each test several times) as you make changes.  Anybody should be able to do the same.  Run the benchmark, apply the patch and then rerun the benchmark.


> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436972 ]
           
Dawid Weiss commented on LUCENE-675:
------------------------------------

First -- I think it's a good initiative. Grant, when you're thinking about the infrastructure, it would be pretty neat to have a way of logging performance in a way so that one could draw charts from them. You know, for the visual folks :)

Anyway, my other idea is that benchmarking Lucene can be performed on two levels: one is the user level, where the entire operation counts (such as indexing, searching etc). Another aspect is measurement of atomic parts _within_ the big operation so that you know how much of the whole thing each subpart takes. I wrote an interesting piece of code once that allows measuring times for named operation (per-thread) in a recursive way. Looks something like this:

perfLogger.start("indexing");
try {
  .. code (with recursion etc)  ...
  perfLogger.start("subpart");
  try {

  } finally {
     perfLogger.stop();
  }
} finally {
  perfLogger.stop();
}

in the output you get something like this:

indexing: 5 seconds;
   ->subpart: : 2 seconds;
   -> ...

Of course everything comes at a price and the above logging costs some CPU cycles (my implementation stored a nesting stack in ThreadLocals).

One can always put that code in 'if' clauses attached to final variables and enable logging only for benchmarking targets (the compiler will get rid of logging statements then).

If folks are interested I can dig out that performance logger and maybe adopt it to what Grant comes up with.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436979 ]
           
Michael McCandless commented on LUCENE-675:
-------------------------------------------

I agree: a simple ant-accessible benchmark to enable "before and
after" runs is an awesome step forward.  And that a standardized HW/SW
testing environment is not really realistic now.

> GSI: perhaps you can contribute once this is setup

I will try!

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436980 ]
           
Doron Cohen commented on LUCENE-675:
------------------------------------

Few things that would be nice to have in this performance package/framework -

() indexing only overall time.
() indexing only time changes as the index grows (might be the case that indexing performance starts to misbehave from a certain size or so).
() search single user while indexing
() search only single user
() search only concurrent users
() short queries
() long queries
() wild card queries
() range queries
() queries with rare words
() queries with common words
() tokenization/analysis only (above indexing measurements include tokenization, but it would be important to be able to "prove" to oneself that tokenization/analysis time is not hurt by  a recent change).

() parametric control over:
() () location of test input data.
() () location of output index.
() () location of output log/results.
() ()  total collection size (total number of bytes/characters read from collection)
() () document (average) size (bytes/chars) - test can break input data and recompose it into documents of desired size.
() () "implicit iteration size" - merge-factor, max-buffered-docs
() () "explicit iteration size" - how often the perf test calls
() () long queries text
() () short queries text
() () which parts of the test framework capabilities to run
() () number of users / threads.
() () queries pace - how many queries are fired in, say, a minute.

Additional points:
() Would help if all test run parameters are maintained in a properties (or xml config) file, so one can easily modify the test input/output without having to recompile the code.
() Output to allow easy creation of graphs or so - perhaps best would be to have an result object, so others can easily extend with additional output formats.
() index size as part of output.
() number of index files as part of output (?)
() indexing input module that can loop over the input collection. This allows to test indexing of a collection larger than the actual input collection being used.



> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12440990 ]
           
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

OK, I have a preliminary implementation based on adapting Andrzej's approach.  The interesting thing about this approach, is it is easy to adapt to be more or less exhaustive (i.e. how many of the parameters does one wish to have the system alter as it runs)  Thus, you can have it change the merge factors, max buffered docs, number of documents indexed, number of different queries run, etc.  The tradeoff, of course, is the length of time it takes to run these.

So my question to those interested, is what is a good baseline running time for testing in a standard way?  My initial thought is to have something that takes between 15-30 minutes to run, but I am not sure on this.  Another approach would be to have three "baselines":  1. quick validation (5 minutes to run...) 2. standard (15-45) 3. exhaustive (1-10 hours).  

I know several others have built benchmarking suites for their internal use, what has been your strategy?

Thoughts, ideas, insights?

Thanks,
Grant

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12440994 ]
           
Marvin Humphrey commented on LUCENE-675:
----------------------------------------

The indexing benchmarking apps I wrote take command line arguments for how many docs and how many reps.  My standard test is to do 1000 docs and 6 reps.  Within a couple seconds the first rep is done and the app is printing out results.  For rapid development, having something that speedy is really handy.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12441281 ]
           
Doug Cutting commented on LUCENE-675:
-------------------------------------

As Marvin points out, quick micro-benchmarks are great to have.  But other effects only show up when things get very large.  So I think we need at least two baselines: micro and macro.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

123