Porting benchmark suite

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Porting benchmark suite

Marvin Humphrey
Greets,

Lucy needs sophisticated search-time benchmarking.  The obvious approach is to
port the Lucene contrib benchmark suite.  

However, contrib benchmark has a large number of classes, the documentation is
sparse and occasionally wrong ("Usage: java Benchmark algorithm-file"),
there's no howto or Wiki page (just package.html) ... and one obvious starting
point, the "Benchmarker" class, is deprecated.

What's actually important in the benchmark suite?  Besides "Benchmarker" being
deprecated, there look to be multiple "stats" and "utils" directories. Are
there large chunks of obsolete code that can be safely ignored?

Marvin Humphrey



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Porting benchmark suite

Grant Ingersoll-2
The build file in the benchmarker has a "run" target that shows how to  
run it.  The important part to port is the "by task" stuff: http://lucene.apache.org/java/2_4_0/api/contrib-benchmark/org/apache/lucene/benchmark/byTask/package-summary.html




On Feb 6, 2009, at 10:11 PM, Marvin Humphrey wrote:

> Greets,
>
> Lucy needs sophisticated search-time benchmarking.  The obvious  
> approach is to
> port the Lucene contrib benchmark suite.
>
> However, contrib benchmark has a large number of classes, the  
> documentation is
> sparse and occasionally wrong ("Usage: java Benchmark algorithm-
> file"),
> there's no howto or Wiki page (just package.html) ... and one  
> obvious starting
> point, the "Benchmarker" class, is deprecated.
>
> What's actually important in the benchmark suite?  Besides  
> "Benchmarker" being
> deprecated, there look to be multiple "stats" and "utils"  
> directories. Are
> there large chunks of obsolete code that can be safely ignored?
>
> Marvin Humphrey
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika) using Solr/
Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Porting benchmark suite

Michael McCandless-2

You'll also need at least some of the *QueryMaker under feeds.

You might also want to make an improvment: change the QueryMaker API
to include both the query and the "arrival time" of that query.  And
then fix all ReadTask (and Search*Task) so that queries are executed
at their scheduled time (assuming enough threads & hardware).

This way one could play back a true search log and measure "realistic"
query latencies, or, one could concoct synthetic difficult cases (4
very hard queries suddenly running at once) and understand how
performance degrades.

Another thing I miss (which I've worked around w/ Python scripts on
top) is to be able to save a set of runs, and then use it as a
baseline when comparing to another set of runs, with the ability to
print out resulting tables in Jira's markup.

Mike

Grant Ingersoll wrote:

> The build file in the benchmarker has a "run" target that shows how  
> to run it.  The important part to port is the "by task" stuff: http://lucene.apache.org/java/2_4_0/api/contrib-benchmark/org/apache/lucene/benchmark/byTask/package-summary.html
>
>
>
>
> On Feb 6, 2009, at 10:11 PM, Marvin Humphrey wrote:
>
>> Greets,
>>
>> Lucy needs sophisticated search-time benchmarking.  The obvious  
>> approach is to
>> port the Lucene contrib benchmark suite.
>>
>> However, contrib benchmark has a large number of classes, the  
>> documentation is
>> sparse and occasionally wrong ("Usage: java Benchmark algorithm-
>> file"),
>> there's no howto or Wiki page (just package.html) ... and one  
>> obvious starting
>> point, the "Benchmarker" class, is deprecated.
>>
>> What's actually important in the benchmark suite?  Besides  
>> "Benchmarker" being
>> deprecated, there look to be multiple "stats" and "utils"  
>> directories. Are
>> there large chunks of obsolete code that can be safely ignored?
>>
>> Marvin Humphrey
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika) using  
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Porting benchmark suite

Jason Rutherglen
I'm planning to work on incorporating Mike's Python scripts into the
Java benchmark code. I'd like to keep track of overall suggestions
for improvements to contrib/benchmark. Perhaps I should open an issue
so people can post suggestions? This way I can look at them and code
them up (as I'll forget otherwise or they'll be lost in the dev
list). Marvin may think of improvements in the midst of porting
(seems he already has).

On Sat, Feb 7, 2009 at 8:00 AM, Michael McCandless <[hidden email]> wrote:

You'll also need at least some of the *QueryMaker under feeds.

You might also want to make an improvment: change the QueryMaker API
to include both the query and the "arrival time" of that query.  And
then fix all ReadTask (and Search*Task) so that queries are executed
at their scheduled time (assuming enough threads & hardware).

This way one could play back a true search log and measure "realistic"
query latencies, or, one could concoct synthetic difficult cases (4
very hard queries suddenly running at once) and understand how
performance degrades.

Another thing I miss (which I've worked around w/ Python scripts on
top) is to be able to save a set of runs, and then use it as a
baseline when comparing to another set of runs, with the ability to
print out resulting tables in Jira's markup.

Mike


Grant Ingersoll wrote:

The build file in the benchmarker has a "run" target that shows how to run it.  The important part to port is the "by task" stuff: http://lucene.apache.org/java/2_4_0/api/contrib-benchmark/org/apache/lucene/benchmark/byTask/package-summary.html




On Feb 6, 2009, at 10:11 PM, Marvin Humphrey wrote:

Greets,

Lucy needs sophisticated search-time benchmarking.  The obvious approach is to
port the Lucene contrib benchmark suite.

However, contrib benchmark has a large number of classes, the documentation is
sparse and occasionally wrong ("Usage: java Benchmark algorithm-file"),
there's no howto or Wiki page (just package.html) ... and one obvious starting
point, the "Benchmarker" class, is deprecated.

What's actually important in the benchmark suite?  Besides "Benchmarker" being
deprecated, there look to be multiple "stats" and "utils" directories. Are
there large chunks of obsolete code that can be safely ignored?

Marvin Humphrey



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika) using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Porting benchmark suite

Mark Miller-3
Jason Rutherglen wrote:
> I'm planning to work on incorporating Mike's Python scripts into the
> Java benchmark code.
+1! I had an itch to do this myself, but it never got scratched.

> I'd like to keep track of overall suggestions
> for improvements to contrib/benchmark. Perhaps I should open an issue
> so people can post suggestions? This way I can look at them and code
> them up (as I'll forget otherwise or they'll be lost in the dev
> list). Marvin may think of improvements in the midst of porting
> (seems he already has).
I'm not against the idea I suppose, but I think the current method of
posting an issue if you have an idea works just fine (eg an issue per
idea, all related to benchmark or not). It can always be closed if it
ends up being too pie in the sky or something.

- Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Porting benchmark suite

Michael McCandless-2

Mark Miller wrote:

> Jason Rutherglen wrote:
>> I'm planning to work on incorporating Mike's Python scripts into the
>> Java benchmark code.
> +1! I had an itch to do this myself, but it never got scratched.

We could also robustify those scripts and just use them.  I've used  
them (modified each time) on various issues now.

>> I'd like to keep track of overall suggestions
>> for improvements to contrib/benchmark. Perhaps I should open an issue
>> so people can post suggestions? This way I can look at them and code
>> them up (as I'll forget otherwise or they'll be lost in the dev
>> list). Marvin may think of improvements in the midst of porting
>> (seems he already has).
> I'm not against the idea I suppose, but I think the current method  
> of posting an issue if you have an idea works just fine (eg an issue  
> per idea, all related to benchmark or not). It can always be closed  
> if it ends up being too pie in the sky or something.
>
> - Mark
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Porting benchmark suite

Grant Ingersoll-2
In reply to this post by Jason Rutherglen

On Feb 9, 2009, at 12:24 PM, Jason Rutherglen wrote:

> I'm planning to work on incorporating Mike's Python scripts into the
> Java benchmark code. I'd like to keep track of overall suggestions
> for improvements to contrib/benchmark. Perhaps I should open an issue
> so people can post suggestions? This way I can look at them and code
> them up (as I'll forget otherwise or they'll be lost in the dev
> list).

I forget whether it's possible or not, but I'd love to be able to  
benchmark indexing and querying at the same time.  Basically, X  
threads querying in parallel while writes/commits are taking place.

As for tracking suggestions, there is a contrib/benchmark component in  
JIRA for Lucene, so if people just add a JIRA issue against that  
component, we shouldn't lose track.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Porting benchmark suite

Jason Rutherglen
> to be able to benchmark indexing and querying at the same time.  Basically, X threads querying in parallel while writes/commits are taking place.

+1 Agreed, this is useful for benchmarking realtime search (which is some of the motivation for learning and improving the benchmark code). 

On Mon, Feb 9, 2009 at 11:02 AM, Grant Ingersoll <[hidden email]> wrote:

On Feb 9, 2009, at 12:24 PM, Jason Rutherglen wrote:

I'm planning to work on incorporating Mike's Python scripts into the
Java benchmark code. I'd like to keep track of overall suggestions
for improvements to contrib/benchmark. Perhaps I should open an issue
so people can post suggestions? This way I can look at them and code
them up (as I'll forget otherwise or they'll be lost in the dev
list).

I forget whether it's possible or not, but I'd love to be able to benchmark indexing and querying at the same time.  Basically, X threads querying in parallel while writes/commits are taking place.

As for tracking suggestions, there is a contrib/benchmark component in JIRA for Lucene, so if people just add a JIRA issue against that component, we shouldn't lose track.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Porting benchmark suite

Grant Ingersoll-2

On Feb 10, 2009, at 2:29 PM, Jason Rutherglen wrote:

> > to be able to benchmark indexing and querying at the same time.  
> Basically, X threads querying in parallel while writes/commits are  
> taking place.

I should add "reopens" to the writes/commits

>
>
> +1 Agreed, this is useful for benchmarking realtime search (which is  
> some of the motivation for learning and improving the benchmark code).

Definitely.

The other thing I think that would be useful is the ability to  
"induce" a collection with certain characteristics.  For instance,  
perhaps I'd like 1 million documents with a field that that contains  
457,893 unique terms in it, spread across those documents.  Then, I'd  
like to be able to sort on that field and benchmark said sorting,  
FieldCaching, etc.  I know it takes the randomness out of things, but  
in some cases, you want specific characteristics for testing.  Of  
course, this should all be easy enough with a DocMaker, so no worries  
on implementing it.


-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]