Parallel tests in ANT, experiment volunteers welcome :)

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Parallel tests in ANT, experiment volunteers welcome :)

Dawid Weiss-2
I've been quietly working on an ANT task that would run tests in
isolated JVMs, similar to what Lucene build files do using macros and
selectors. It's been fun and I was finally able to integrate a few
other features I've always wanted (like detailed progress listeners),
but it's another story.

If you have a multi-multi-core machine (or if you don't and want to
provide some feedback) then please run the following script/ commands:

# get the code from my github fork:
git clone [hidden email]:dweiss/lucene-solr.git --depth 1 -b junit4
cd lucene-solr/lucene

# Get a baseline for core tests running on trunk/ant macros.
git checkout trunk
ant test-core -Dtestcase=compileonly
# This is a single run and it depends on the seed, but we'll consider
it a baseline -> write down the execution time or remember it.
ant test-core -Dtests.seed=random

# Switch over to junit4 branch and recompile.
git checkout junit4
ant test-core -Dtestcase=compileonly

# An initial pass collects statistics; these can be stored with the
project to bootstrap,
# but for now they're zero. Adjust the number of CPUs to your system:
# typically, you'll want the physical number of cores - 1 (reserved
for the aggregator).
# tests.seed is set to an empty value because junit4 and ltc use a
different format! So there
# will be some variability across test executions (and that's good
because estimates will vary).
ant test-core -Dtests.seed= -Dtests.cpus=4
ant test-core -Dtests.seed= -Dtests.cpus=4
ant test-core -Dtests.seed= -Dtests.cpus=4

# with each run the estimates (shown up front) should be getting
closer to the real execution
# time for each slave. They will not be exact because of randomness,
but should be fairly close. For example
# I get:
#   [junit4] Expected execution time on slave 0:   233.94s
#   [junit4] Expected execution time on slave 3:   233.94s
#   [junit4] Expected execution time on slave 1:   233.95s
#   [junit4] Expected execution time on slave 2:   233.95s
#
# and the real times:

#

I would very much appreciate feedback on (including but not limited to):

1) If something is not working. The tests hung on my machine once, the
slave JVM wasn't responsive,
    it didn't even dump a stack trace, didn't react to kill -QUIT, nothing.

2) Is test execution faster than the baseline? By how much? For
multi-multi-cores, if you have time
    how does execution time correlate with tests.cpus (I assume memory
bandwidth or disk will be the
    bottleneck at some point).

3) Did you enjoy the sweet hum of cpu fans? For zero-noise systems:
you better crank up those pumps or put something cold on the cpu unit
:)

Thanks,
Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Dawid Weiss-2
Darm, gmal likes to line wrap... I've put the script here too:

https://gist.github.com/1539653

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Dawid Weiss-2
Robert just informed me that there is an exception coming out from ANT
if you run it with ANT 1.7.1. Don't know if it's a known issue, but I
use ANT 1.8.x and the problem is not present there.

Dawid

On Fri, Dec 30, 2011 at 1:36 PM, Dawid Weiss <[hidden email]> wrote:
> Darm, gmal likes to line wrap... I've put the script here too:
>
> https://gist.github.com/1539653
>
> Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Dawid Weiss-2
I updated the code and it works with Ant 1.7.1 now. I also noticed
parameters are parsed slightly different (maybe it's windows), so you
need to quote to pass an empty parameter as in:

ant test-core "-Dtests.seed=" -Dtests.cpus=7

Dawid

On Fri, Dec 30, 2011 at 2:01 PM, Dawid Weiss <[hidden email]> wrote:

> Robert just informed me that there is an exception coming out from ANT
> if you run it with ANT 1.7.1. Don't know if it's a known issue, but I
> use ANT 1.8.x and the problem is not present there.
>
> Dawid
>
> On Fri, Dec 30, 2011 at 1:36 PM, Dawid Weiss <[hidden email]> wrote:
>> Darm, gmal likes to line wrap... I've put the script here too:
>>
>> https://gist.github.com/1539653
>>
>> Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Robert Muir
Here's a couple runs from my machine... but I think some of this is
some wild swings in the tests (bad apples).

   [junit4] Slave 0:     0.15 ..    95.85 =    95.70s
   [junit4] Slave 1:     0.17 ..    76.62 =    76.45s
   [junit4] Slave 2:     0.14 ..    47.33 =    47.19s
   [junit4] Slave 3:     0.13 ..    48.20 =    48.08s
   [junit4] Execution time total: 95.90s
   [junit4] Tests summary: 278 suites, 1550 tests, 9 ignored (6 assumptions)

   [junit4] Slave 0:     0.16 ..    61.38 =    61.22s
   [junit4] Slave 1:     0.16 ..    84.89 =    84.74s
   [junit4] Slave 2:     0.16 ..    59.31 =    59.15s
   [junit4] Slave 3:     0.16 ..    77.67 =    77.51s
   [junit4] Execution time total: 84.95s
   [junit4] Tests summary: 278 suites, 1550 tests, 5 ignored (2 assumptions)

   [junit4] Slave 0:     0.17 ..    69.68 =    69.50s
   [junit4] Slave 1:     0.16 ..    67.49 =    67.33s
   [junit4] Slave 2:     0.16 ..    64.00 =    63.84s
   [junit4] Slave 3:     0.16 ..    72.68 =    72.51s
   [junit4] Execution time total: 72.73s
   [junit4] Tests summary: 278 suites, 1550 tests, 7 ignored (4 assumptions)

   [junit4] Slave 0:     0.16 ..    64.94 =    64.78s
   [junit4] Slave 1:     0.19 ..    67.69 =    67.50s
   [junit4] Slave 2:     0.16 ..    62.59 =    62.43s
   [junit4] Slave 3:     0.21 ..    66.12 =    65.91s
   [junit4] Execution time total: 67.74s
   [junit4] Tests summary: 278 suites, 1550 tests, 17 ignored (14 assumptions)

   [junit4] Slave 0:     0.19 ..    57.03 =    56.84s
   [junit4] Slave 1:     0.17 ..    65.57 =    65.40s
   [junit4] Slave 2:     0.18 ..    77.44 =    77.26s
   [junit4] Slave 3:     0.15 ..    64.90 =    64.74s
   [junit4] Execution time total: 77.48s
   [junit4] Tests summary: 278 suites, 1550 tests, 6 ignored (3 assumptions)

   [junit4] Slave 0:     0.15 ..    73.56 =    73.41s
   [junit4] Slave 1:     0.15 ..    70.84 =    70.69s
   [junit4] Slave 2:     0.15 ..    97.94 =    97.79s
   [junit4] Slave 3:     0.18 ..    66.66 =    66.47s
   [junit4] Execution time total: 97.99s
   [junit4] Tests summary: 278 suites, 1550 tests, 13 ignored (10 assumptions)

On Fri, Dec 30, 2011 at 12:22 PM, Dawid Weiss <[hidden email]> wrote:

> I updated the code and it works with Ant 1.7.1 now. I also noticed
> parameters are parsed slightly different (maybe it's windows), so you
> need to quote to pass an empty parameter as in:
>
> ant test-core "-Dtests.seed=" -Dtests.cpus=7
>
> Dawid
>
> On Fri, Dec 30, 2011 at 2:01 PM, Dawid Weiss <[hidden email]> wrote:
>> Robert just informed me that there is an exception coming out from ANT
>> if you run it with ANT 1.7.1. Don't know if it's a known issue, but I
>> use ANT 1.8.x and the problem is not present there.
>>
>> Dawid
>>
>> On Fri, Dec 30, 2011 at 1:36 PM, Dawid Weiss <[hidden email]> wrote:
>>> Darm, gmal likes to line wrap... I've put the script here too:
>>>
>>> https://gist.github.com/1539653
>>>
>>> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Dawid Weiss
Thanks Robert. Yes, the variation in certain suites is pretty large --
if you open the generated execution times cache you can see the
timings for each test suite. I've seen differences going into tens of
seconds depending on the seed (and the environment?). What are your
timing for ant-based splits? Roughly the same?

Dawid

On Fri, Dec 30, 2011 at 8:44 PM, Robert Muir <[hidden email]> wrote:

> Here's a couple runs from my machine... but I think some of this is
> some wild swings in the tests (bad apples).
>
>   [junit4] Slave 0:     0.15 ..    95.85 =    95.70s
>   [junit4] Slave 1:     0.17 ..    76.62 =    76.45s
>   [junit4] Slave 2:     0.14 ..    47.33 =    47.19s
>   [junit4] Slave 3:     0.13 ..    48.20 =    48.08s
>   [junit4] Execution time total: 95.90s
>   [junit4] Tests summary: 278 suites, 1550 tests, 9 ignored (6 assumptions)
>
>   [junit4] Slave 0:     0.16 ..    61.38 =    61.22s
>   [junit4] Slave 1:     0.16 ..    84.89 =    84.74s
>   [junit4] Slave 2:     0.16 ..    59.31 =    59.15s
>   [junit4] Slave 3:     0.16 ..    77.67 =    77.51s
>   [junit4] Execution time total: 84.95s
>   [junit4] Tests summary: 278 suites, 1550 tests, 5 ignored (2 assumptions)
>
>   [junit4] Slave 0:     0.17 ..    69.68 =    69.50s
>   [junit4] Slave 1:     0.16 ..    67.49 =    67.33s
>   [junit4] Slave 2:     0.16 ..    64.00 =    63.84s
>   [junit4] Slave 3:     0.16 ..    72.68 =    72.51s
>   [junit4] Execution time total: 72.73s
>   [junit4] Tests summary: 278 suites, 1550 tests, 7 ignored (4 assumptions)
>
>   [junit4] Slave 0:     0.16 ..    64.94 =    64.78s
>   [junit4] Slave 1:     0.19 ..    67.69 =    67.50s
>   [junit4] Slave 2:     0.16 ..    62.59 =    62.43s
>   [junit4] Slave 3:     0.21 ..    66.12 =    65.91s
>   [junit4] Execution time total: 67.74s
>   [junit4] Tests summary: 278 suites, 1550 tests, 17 ignored (14 assumptions)
>
>   [junit4] Slave 0:     0.19 ..    57.03 =    56.84s
>   [junit4] Slave 1:     0.17 ..    65.57 =    65.40s
>   [junit4] Slave 2:     0.18 ..    77.44 =    77.26s
>   [junit4] Slave 3:     0.15 ..    64.90 =    64.74s
>   [junit4] Execution time total: 77.48s
>   [junit4] Tests summary: 278 suites, 1550 tests, 6 ignored (3 assumptions)
>
>   [junit4] Slave 0:     0.15 ..    73.56 =    73.41s
>   [junit4] Slave 1:     0.15 ..    70.84 =    70.69s
>   [junit4] Slave 2:     0.15 ..    97.94 =    97.79s
>   [junit4] Slave 3:     0.18 ..    66.66 =    66.47s
>   [junit4] Execution time total: 97.99s
>   [junit4] Tests summary: 278 suites, 1550 tests, 13 ignored (10 assumptions)
>
> On Fri, Dec 30, 2011 at 12:22 PM, Dawid Weiss <[hidden email]> wrote:
>> I updated the code and it works with Ant 1.7.1 now. I also noticed
>> parameters are parsed slightly different (maybe it's windows), so you
>> need to quote to pass an empty parameter as in:
>>
>> ant test-core "-Dtests.seed=" -Dtests.cpus=7
>>
>> Dawid
>>
>> On Fri, Dec 30, 2011 at 2:01 PM, Dawid Weiss <[hidden email]> wrote:
>>> Robert just informed me that there is an exception coming out from ANT
>>> if you run it with ANT 1.7.1. Don't know if it's a known issue, but I
>>> use ANT 1.8.x and the problem is not present there.
>>>
>>> Dawid
>>>
>>> On Fri, Dec 30, 2011 at 1:36 PM, Dawid Weiss <[hidden email]> wrote:
>>>> Darm, gmal likes to line wrap... I've put the script here too:
>>>>
>>>> https://gist.github.com/1539653
>>>>
>>>> Dawid
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Robert Muir
On Fri, Dec 30, 2011 at 3:45 PM, Dawid Weiss
<[hidden email]> wrote:
> Thanks Robert. Yes, the variation in certain suites is pretty large --
> if you open the generated execution times cache you can see the
> timings for each test suite. I've seen differences going into tens of
> seconds depending on the seed (and the environment?). What are your
> timing for ant-based splits? Roughly the same?
>

I think i got to the bottom of this. Depending upon your seed, 95% of
the time a test gets "RamDirectory" but 5% of the time it gets a
file-system backed implementation.

Because of this, depending upon environment, test times swing wildly
because of fsync(). For example in the last nightly build we fsynced
over 7,000 times in tests.

This is really crazy and I want to prolong the life of my SSD: see my
latest comment with a fix on LUCENE-3667. With that patch my times are
no longer swinging wildly.

(easy way to see what i am talking about: just run tests with
-Dtests.directory=MMapDirectory or something like that)

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Michael McCandless-2
This looks cool!

I ran this a few times:

  ant test-core -Dtests.seed=0:0:0 -Dtests.cpus=20
-Dtests.directory=RAMDirectory -Dtests.codec=Lucene40

I fixed seed & RAMDir to reduce variance...

   [junit4] Slave 16:     0.29 ..    24.65 =    24.36s
   [junit4] Slave 17:     0.36 ..    30.62 =    30.26s
   [junit4] Slave 18:     0.44 ..    30.84 =    30.41s
   [junit4] Slave 19:     0.50 ..    28.65 =    28.15s
   [junit4] Execution time total: 36.69s
   [junit4] Tests summary: 278 suites, 1550 tests, 3 ignored

   [junit4] Slave 16:     0.44 ..    29.61 =    29.17s
   [junit4] Slave 17:     0.55 ..    31.59 =    31.04s
   [junit4] Slave 18:     0.30 ..    25.85 =    25.54s
   [junit4] Slave 19:     0.31 ..    32.64 =    32.33s
   [junit4] Execution time total: 37.12s
   [junit4] Tests summary: 278 suites, 1550 tests, 3 ignored

   [junit4] Slave 16:     0.28 ..    25.70 =    25.42s
   [junit4] Slave 17:     0.23 ..    29.83 =    29.60s
   [junit4] Slave 18:     0.28 ..    27.50 =    27.22s
   [junit4] Slave 19:     0.37 ..    27.67 =    27.30s
   [junit4] Execution time total: 35.23s
   [junit4] Tests summary: 278 suites, 1550 tests, 1 failure, 3 ignored

   [junit4] Slave 16:     0.38 ..    28.99 =    28.61s
   [junit4] Slave 17:     0.41 ..    30.79 =    30.38s
   [junit4] Slave 18:     0.48 ..    30.05 =    29.57s
   [junit4] Slave 19:     0.35 ..    30.71 =    30.36s
   [junit4] Execution time total: 38.46s
   [junit4] Tests summary: 278 suites, 1550 tests, 3 ignored

   [junit4] Slave 16:     0.27 ..    29.56 =    29.29s
   [junit4] Slave 17:     0.44 ..    32.64 =    32.21s
   [junit4] Slave 18:     0.40 ..    31.99 =    31.60s
   [junit4] Slave 19:     0.27 ..    32.64 =    32.37s
   [junit4] Execution time total: 37.70s
   [junit4] Tests summary: 278 suites, 1550 tests, 3 ignored

Does the "Execution time total" include compilation, or is it just the
actual test runtime?

Can this change run "across" the different groups of tests we have
(core, modules/*, contrib/*, solr/*, etc.)?  I found that to be a
major bottleneck in the current "ant test"'s concurrency, ie we have a
pinch point after each group of tests (must wait for all JVMs to
finish before moving on to next group...), but I think fixing that in
ant is going to be hard?

When I use the hacked up Python test runner (runAllTests.py in
luceneutil), running only core tests w/ RAMDir and Lucene40 codec it
takes ~30 seconds; I think it's doing roughly the same thing as this
change (balancing the tests across JVMs).  BUT: that's on current
trunk, vs your git clone which is somewhat old by now... so it's an
apples/pears comparison ;)

Mike McCandless

http://blog.mikemccandless.com

On Fri, Dec 30, 2011 at 3:49 PM, Robert Muir <[hidden email]> wrote:

> On Fri, Dec 30, 2011 at 3:45 PM, Dawid Weiss
> <[hidden email]> wrote:
>> Thanks Robert. Yes, the variation in certain suites is pretty large --
>> if you open the generated execution times cache you can see the
>> timings for each test suite. I've seen differences going into tens of
>> seconds depending on the seed (and the environment?). What are your
>> timing for ant-based splits? Roughly the same?
>>
>
> I think i got to the bottom of this. Depending upon your seed, 95% of
> the time a test gets "RamDirectory" but 5% of the time it gets a
> file-system backed implementation.
>
> Because of this, depending upon environment, test times swing wildly
> because of fsync(). For example in the last nightly build we fsynced
> over 7,000 times in tests.
>
> This is really crazy and I want to prolong the life of my SSD: see my
> latest comment with a fix on LUCENE-3667. With that patch my times are
> no longer swinging wildly.
>
> (easy way to see what i am talking about: just run tests with
> -Dtests.directory=MMapDirectory or something like that)
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Dawid Weiss
Thanks Mike. Answers/ comments in-line below

>    [junit4] Slave 16:     0.29 ..    24.65 =    24.36s
>    [junit4] Slave 17:     0.36 ..    30.62 =    30.26s
>    [junit4] Slave 18:     0.44 ..    30.84 =    30.41s
>    [junit4] Slave 19:     0.50 ..    28.65 =    28.15s
>    [junit4] Execution time total: 36.69s
>    [junit4] Tests summary: 278 suites, 1550 tests, 3 ignored

I forgot how nasty your beast computer is... 20 slaves?! Remind me how
many actual (real) cores do you have? Did you experiment with
different slave numbers? I ask because I noticed that:

1) it makes little sense to run cpu-intense tests on hyper-cores,
doesn't yield much if anything,
2) you should leave some room for system vm threads (GC, compilers);
the more VMs, the more room you'll need.

> Does the "Execution time total" include compilation, or is it just the
> actual test runtime?

The total is calculated before slave VMs are launched and after they
complete, so even launch time is included. It's here:
https://github.com/carrotsearch/randomizedtesting/blob/master/ant-junit4/src/main/java/com/carrotsearch/ant/tasks/junit4/JUnit4.java

> Can this change run "across" the different groups of tests we have
> (core, modules/*, contrib/*, solr/*, etc.)?  I found that to be a
> major bottleneck in the current "ant test"'s concurrency, ie we have a
> pinch point after each group of tests (must wait for all JVMs to
> finish before moving on to next group...), but I think fixing that in
> ant is going to be hard?

If I understand you correctly the problem is that ANT in Lucene/ Solr
is calling to sub-module ANT scripts and these in turn invoke the test
macro. So running everything from a single test task would be possible
if we had a master-level test script, it's not directly related to how
the tests are actually executed. That JUnit4 task supports globbing in
suite selectors so it could be executed with, say,
-Dtests.class=org.apache.lucene.blah.* to restrict tests to run just a
certain section of all tests, but include everything by default.

Don't know how it affects modularization though -- the tests will run
faster but they'll be more difficult to maintain I guess.

> When I use the hacked up Python test runner (runAllTests.py in luceneutil),

This was my inspiration -- Robert pointed me at that, very helpful
although you need your kind of machine to run so many SSH sessions :D

> change (balancing the tests across JVMs).  BUT: that's on current
> trunk, vs your git clone which is somewhat old by now... so it's an
> apples/pears comparison ;)

Oh, come on, my fork is only a few days behind! :) I've pulled the
current trunk and merged. I'd appreciate if you could re-run again,
this time with, say, 5, 10, 15 and 20 threads. I wonder what the
speedup/ overhead is. Thanks.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Michael McCandless-2
On Wed, Jan 4, 2012 at 5:11 PM, Dawid Weiss
<[hidden email]> wrote:
> I forgot how nasty your beast computer is... 20 slaves?! Remind me how> many actual (real) cores do you have?
Beast has two 6-core CPUs (x5680 xeons), so 12 real cores (24
withhyperthreading).
> Did you experiment with> different slave numbers? I ask because I noticed that:>> 1) it makes little sense to run cpu-intense tests on hyper-cores,> doesn't yield much if anything,> 2) you should leave some room for system vm threads (GC, compilers);> the more VMs, the more room you'll need.
In the past I found somewhere around 20 was good w/ the
Pythonrunner... but I went and tested again!
With the Python runner I see these run times on just lucene core tests:
   2 cpus: 72.2 sec   5 cpus: 35.0 sec  10 cpus: 28.1 sec  15 cpus:
26.2 sec  20 cpus: 26.0 sec  25 cpus: 27.5 sec
So seems like after 15 cores it's not helping much... but then I ranon
all tests (well minus a few intermittently failing tests):
  10 cpus: 88.3 sec  15 cpus: 80.2 sec  20 cpus: 77.4 sec  25 cpus: 76.7 sec
The above were just running on beast, but the Python runner can
sendjobs (hacked up, just using ssh) to other machines... I have two
othernon-beasts, and which I ran 3 jvms on each:
  25 + 3 + 3 cpus: 64.7 sec
With the new ant runner:
2 cpus:
   [junit4] Slave 0:     0.16 ..    50.68 =    50.52s   [junit4] Slave
1:     0.16 ..    49.58 =    49.42s   [junit4] Execution time total:
50.73s   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored

5 cpus:
   [junit4] Slave 0:     0.19 ..    21.87 =    21.68s   [junit4] Slave
1:     0.16 ..    21.86 =    21.70s   [junit4] Slave 2:     0.16 ..
29.31 =    29.15s   [junit4] Slave 3:     0.16 ..    26.64 =    26.48s
  [junit4] Slave 4:     0.19 ..    29.82 =    29.63s   [junit4]
Execution time total: 29.89s   [junit4] Tests summary: 279 suites,
1546 tests, 4 ignored
10 cpus:
   [junit4] Slave 0:     0.21 ..    14.62 =    14.41s   [junit4] Slave
1:     0.22 ..    17.21 =    16.99s   [junit4] Slave 2:     0.23 ..
18.79 =    18.56s   [junit4] Slave 3:     0.23 ..    22.99 =    22.76s
  [junit4] Slave 4:     0.20 ..    27.39 =    27.19s   [junit4] Slave
5:     0.19 ..    27.23 =    27.04s   [junit4] Slave 6:     0.23 ..
20.40 =    20.17s   [junit4] Slave 7:     0.19 ..    26.52 =    26.33s
  [junit4] Slave 8:     0.24 ..    26.42 =    26.18s   [junit4] Slave
9:     0.22 ..    23.57 =    23.35s   [junit4] Execution time total:
27.52s   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored
15 cpus:
   [junit4] Slave 0:     0.29 ..     5.16 =     4.87s   [junit4] Slave
1:     0.26 ..    15.36 =    15.10s   [junit4] Slave 2:     0.26 ..
12.99 =    12.73s   [junit4] Slave 3:     0.29 ..    24.20 =    23.92s
  [junit4] Slave 4:     0.26 ..    27.00 =    26.74s   [junit4] Slave
5:     0.33 ..    19.97 =    19.63s   [junit4] Slave 6:     0.31 ..
25.29 =    24.98s   [junit4] Slave 7:     0.24 ..    28.92 =    28.68s
  [junit4] Slave 8:     0.33 ..    23.67 =    23.34s   [junit4] Slave
9:     0.43 ..    24.43 =    24.00s   [junit4] Slave 10:     0.40 ..
 27.61 =    27.21s   [junit4] Slave 11:     0.22 ..    21.77 =
21.56s   [junit4] Slave 12:     0.22 ..    26.78 =    26.56s
[junit4] Slave 13:     0.26 ..    25.92 =    25.66s   [junit4] Slave
14:     0.35 ..    27.77 =    27.42s   [junit4] Execution time total:
28.98s   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored
20 cpus:
   [junit4] Slave 0:     0.35 ..    23.32 =    22.97s   [junit4] Slave
1:     0.30 ..    24.32 =    24.02s   [junit4] Slave 2:     0.35 ..
21.35 =    21.00s   [junit4] Slave 3:     0.37 ..    23.63 =    23.26s
  [junit4] Slave 4:     0.38 ..    20.74 =    20.35s   [junit4] Slave
5:     0.30 ..    19.74 =    19.44s   [junit4] Slave 6:     0.36 ..
26.39 =    26.03s   [junit4] Slave 7:     0.46 ..    23.64 =    23.18s
  [junit4] Slave 8:     0.43 ..    22.44 =    22.02s   [junit4] Slave
9:     0.30 ..    24.05 =    23.76s   [junit4] Slave 10:     0.41 ..
 24.75 =    24.33s   [junit4] Slave 11:     0.30 ..    22.66 =
22.36s   [junit4] Slave 12:     0.30 ..    24.93 =    24.62s
[junit4] Slave 13:     0.40 ..    24.39 =    24.00s   [junit4] Slave
14:     0.24 ..    24.47 =    24.23s   [junit4] Slave 15:     0.45 ..
  25.23 =    24.78s   [junit4] Slave 16:     0.34 ..    23.06 =
22.72s   [junit4] Slave 17:     0.23 ..    23.50 =    23.28s
[junit4] Slave 18:     0.30 ..    24.27 =    23.97s   [junit4] Slave
19:     0.30 ..    24.91 =    24.61s   [junit4] Execution time total:
26.52s   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored
I only ran once each and results are likely noisy... so it's hard to
pick a best CPU count...
>> Does the "Execution time total" include compilation, or is it just the>> actual test runtime?>> The total is calculated before slave VMs are launched and after they> complete, so even launch time is included. It's here:> https://github.com/carrotsearch/randomizedtesting/blob/master/ant-junit4/src/main/java/com/carrotsearch/ant/tasks/junit4/JUnit4.java
Hmm so does that include compile time (my numbers don't)?  Sounds
likeno?  I'm also measuring from first launch to last finish.
>> Can this change run "across" the different groups of tests we have>> (core, modules/*, contrib/*, solr/*, etc.)?  I found that to be a>> major bottleneck in the current "ant test"'s concurrency, ie we have a>> pinch point after each group of tests (must wait for all JVMs to>> finish before moving on to next group...), but I think fixing that in>> ant is going to be hard?>> If I understand you correctly the problem is that ANT in Lucene/ Solr> is calling to sub-module ANT scripts and these in turn invoke the test> macro. So running everything from a single test task would be possible> if we had a master-level test script, it's not directly related to how> the tests are actually executed.
Yes I think that's the problem!
Ideally ant would just gather up all "jobs" to run and then
we'daggregate/distribute across JVMs.
> That JUnit4 task supports globbing in> suite selectors so it could be executed with, say,> -Dtests.class=org.apache.lucene.blah.* to restrict tests to run just a> certain section of all tests, but include everything by default.
Cool.
> Don't know how it affects modularization though -- the tests will run> faster but they'll be more difficult to maintain I guess.
Hmm... can we somehow keep today's directory structure but have
anttreat it as a single "module"?  Or is the problem that we need
tochange the JVM settings (eg CLASSPATH) per test module we havetoday
so we must make separate modules for that...?
>> When I use the hacked up Python test runner (runAllTests.py in luceneutil),>> This was my inspiration -- Robert pointed me at that, very helpful> although you need your kind of machine to run so many SSH sessions :D
OK cool :)  Actually it doesn't open any SSH sessions unless you
giveit remote machines to use -- for the "local" JVMs it just forks.
>> change (balancing the tests across JVMs).  BUT: that's on current>> trunk, vs your git clone which is somewhat old by now... so it's an>> apples/pears comparison ;)>> Oh, come on, my fork is only a few days behind! :) I've pulled the> current trunk and merged. I'd appreciate if you could re-run again,> this time with, say, 5, 10, 15 and 20 threads. I wonder what the> speedup/ overhead is. Thanks.
I re-ran above -- looks like the times came down some so the new
antrunner is basically the same as the Python runner (on core tests):
great!

Mike McCandless
http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Michael McCandless-2
In reply to this post by Dawid Weiss
Trying again... hopefully this time NOT hitting this nasty Chrome bug:
    http://code.google.com/p/chromium/issues/detail?id=102407
On Wed, Jan 4, 2012 at 5:11 PM, Dawid Weiss
<[hidden email]> wrote:
> I forgot how nasty your beast computer is... 20 slaves?! Remind me how> many actual (real) cores do you have?
Beast has two 6-core CPUs (x5680 xeons), so 12 real cores (24
withhyperthreading).
> Did you experiment with> different slave numbers? I ask because I noticed that:>> 1) it makes little sense to run cpu-intense tests on hyper-cores,> doesn't yield much if anything,> 2) you should leave some room for system vm threads (GC, compilers);> the more VMs, the more room you'll need.
In the past I found somewhere around 20 was good w/ the
Pythonrunner... but I went and tried again!
With the Python runner I see these run times on just lucene core tests:
   2 cpus: 72.2 sec   5 cpus: 35.0 sec  10 cpus: 28.1 sec  15 cpus:
26.2 sec  20 cpus: 26.0 sec  25 cpus: 27.5 sec
So seems like after 15 cores it's not helping much... but then I ranon
all tests (well minus a few intermittently failing tests):
  10 cpus: 88.3 sec  15 cpus: 80.2 sec  20 cpus: 77.4 sec  25 cpus: 76.7 sec
The above were just running on beast, but the Python runner can
sendjobs (hacked up, just using ssh) to other machines... I have two
othernon-beasts, and which I ran 3 jvms on each:
  25 + 3 + 3 cpus: 64.7 sec
With the new ant runner:
2 cpus:
   [junit4] Slave 0:     0.16 ..    50.68 =    50.52s   [junit4] Slave
1:     0.16 ..    49.58 =    49.42s   [junit4] Execution time total:
50.73s   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored

5 cpus:
   [junit4] Slave 0:     0.19 ..    21.87 =    21.68s   [junit4] Slave
1:     0.16 ..    21.86 =    21.70s   [junit4] Slave 2:     0.16 ..
29.31 =    29.15s   [junit4] Slave 3:     0.16 ..    26.64 =    26.48s
  [junit4] Slave 4:     0.19 ..    29.82 =    29.63s   [junit4]
Execution time total: 29.89s   [junit4] Tests summary: 279 suites,
1546 tests, 4 ignored
10 cpus:
   [junit4] Slave 0:     0.21 ..    14.62 =    14.41s   [junit4] Slave
1:     0.22 ..    17.21 =    16.99s   [junit4] Slave 2:     0.23 ..
18.79 =    18.56s   [junit4] Slave 3:     0.23 ..    22.99 =    22.76s
  [junit4] Slave 4:     0.20 ..    27.39 =    27.19s   [junit4] Slave
5:     0.19 ..    27.23 =    27.04s   [junit4] Slave 6:     0.23 ..
20.40 =    20.17s   [junit4] Slave 7:     0.19 ..    26.52 =    26.33s
  [junit4] Slave 8:     0.24 ..    26.42 =    26.18s   [junit4] Slave
9:     0.22 ..    23.57 =    23.35s   [junit4] Execution time total:
27.52s   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored
15 cpus:
   [junit4] Slave 0:     0.29 ..     5.16 =     4.87s   [junit4] Slave
1:     0.26 ..    15.36 =    15.10s   [junit4] Slave 2:     0.26 ..
12.99 =    12.73s   [junit4] Slave 3:     0.29 ..    24.20 =    23.92s
  [junit4] Slave 4:     0.26 ..    27.00 =    26.74s   [junit4] Slave
5:     0.33 ..    19.97 =    19.63s   [junit4] Slave 6:     0.31 ..
25.29 =    24.98s   [junit4] Slave 7:     0.24 ..    28.92 =    28.68s
  [junit4] Slave 8:     0.33 ..    23.67 =    23.34s   [junit4] Slave
9:     0.43 ..    24.43 =    24.00s   [junit4] Slave 10:     0.40 ..
 27.61 =    27.21s   [junit4] Slave 11:     0.22 ..    21.77 =
21.56s   [junit4] Slave 12:     0.22 ..    26.78 =    26.56s
[junit4] Slave 13:     0.26 ..    25.92 =    25.66s   [junit4] Slave
14:     0.35 ..    27.77 =    27.42s   [junit4] Execution time total:
28.98s   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored
20 cpus:
   [junit4] Slave 0:     0.35 ..    23.32 =    22.97s   [junit4] Slave
1:     0.30 ..    24.32 =    24.02s   [junit4] Slave 2:     0.35 ..
21.35 =    21.00s   [junit4] Slave 3:     0.37 ..    23.63 =    23.26s
  [junit4] Slave 4:     0.38 ..    20.74 =    20.35s   [junit4] Slave
5:     0.30 ..    19.74 =    19.44s   [junit4] Slave 6:     0.36 ..
26.39 =    26.03s   [junit4] Slave 7:     0.46 ..    23.64 =    23.18s
  [junit4] Slave 8:     0.43 ..    22.44 =    22.02s   [junit4] Slave
9:     0.30 ..    24.05 =    23.76s   [junit4] Slave 10:     0.41 ..
 24.75 =    24.33s   [junit4] Slave 11:     0.30 ..    22.66 =
22.36s   [junit4] Slave 12:     0.30 ..    24.93 =    24.62s
[junit4] Slave 13:     0.40 ..    24.39 =    24.00s   [junit4] Slave
14:     0.24 ..    24.47 =    24.23s   [junit4] Slave 15:     0.45 ..
  25.23 =    24.78s   [junit4] Slave 16:     0.34 ..    23.06 =
22.72s   [junit4] Slave 17:     0.23 ..    23.50 =    23.28s
[junit4] Slave 18:     0.30 ..    24.27 =    23.97s   [junit4] Slave
19:     0.30 ..    24.91 =    24.61s   [junit4] Execution time total:
26.52s   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored
I only ran once each and results are likely noisy... so it's hard
topick a best CPU count...
>> Does the "Execution time total" include compilation, or is it just the>> actual test runtime?>> The total is calculated before slave VMs are launched and after they> complete, so even launch time is included. It's here:> https://github.com/carrotsearch/randomizedtesting/blob/master/ant-junit4/src/main/java/com/carrotsearch/ant/tasks/junit4/JUnit4.java
Hmm so does that include compile time (my numbers don't)?  Sounds
likeno?  I'm also measuring from first launch to last finish.
>> Can this change run "across" the different groups of tests we have>> (core, modules/*, contrib/*, solr/*, etc.)?  I found that to be a>> major bottleneck in the current "ant test"'s concurrency, ie we have a>> pinch point after each group of tests (must wait for all JVMs to>> finish before moving on to next group...), but I think fixing that in>> ant is going to be hard?>> If I understand you correctly the problem is that ANT in Lucene/ Solr> is calling to sub-module ANT scripts and these in turn invoke the test> macro. So running everything from a single test task would be possible> if we had a master-level test script, it's not directly related to how> the tests are actually executed.
Yes I think that's the problem!
Ideally ant would just gather up all "jobs" to run and then
we'daggregate/distribute across JVMs.
> That JUnit4 task supports globbing in> suite selectors so it could be executed with, say,> -Dtests.class=org.apache.lucene.blah.* to restrict tests to run just a> certain section of all tests, but include everything by default.
Cool.
> Don't know how it affects modularization though -- the tests will run> faster but they'll be more difficult to maintain I guess.
Hmm... can we somehow keep today's directory structure but have
anttreat it as a single "module"?  Or is the problem that we need
tochange the JVM settings (eg CLASSPATH) per test module we havetoday
so we must make separate modules for that...?
>> When I use the hacked up Python test runner (runAllTests.py in luceneutil),>> This was my inspiration -- Robert pointed me at that, very helpful> although you need your kind of machine to run so many SSH sessions :D
OK cool :)  Actually it doesn't open any SSH sessions unless you
giveit remote machines to use -- for the "local" JVMs it just forks.
>> change (balancing the tests across JVMs).  BUT: that's on current>> trunk, vs your git clone which is somewhat old by now... so it's an>> apples/pears comparison ;)>> Oh, come on, my fork is only a few days behind! :) I've pulled the> current trunk and merged. I'd appreciate if you could re-run again,> this time with, say, 5, 10, 15 and 20 threads. I wonder what the> speedup/ overhead is. Thanks.
I re-ran above -- looks like the times came down some so the new
antrunner is basically the same as the Python runner (on core
tests):great!
Mike McCandless
http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Michael McCandless-2
In reply to this post by Dawid Weiss
Maybe... 3rd time's the charm...?  (This time from Opera).

On Wed, Jan 4, 2012 at 5:11 PM, Dawid Weiss
<[hidden email]> wrote:

> I forgot how nasty your beast computer is... 20 slaves?! Remind me how
> many actual (real) cores do you have?

Beast has two 6-core CPUs (x5680 xeons), so 12 real cores (24 with
hyperthreading).

> Did you experiment with
> different slave numbers? I ask because I noticed that:
>
> 1) it makes little sense to run cpu-intense tests on hyper-cores,
> doesn't yield much if anything,
> 2) you should leave some room for system vm threads (GC, compilers);
> the more VMs, the more room you'll need.

In the past I found somewhere around 20 was good w/ the Python
runner... but I went and tried again!

With the Python runner I see these run times on just lucene core tests:

   2 cpus: 72.2 sec
   5 cpus: 35.0 sec
  10 cpus: 28.1 sec
  15 cpus: 26.2 sec
  20 cpus: 26.0 sec
  25 cpus: 27.5 sec

So seems like after 15 cores it's not helping much... but then I ran
on all tests (well minus a few intermittently failing tests):

  10 cpus: 88.3 sec
  15 cpus: 80.2 sec
  20 cpus: 77.4 sec
  25 cpus: 76.7 sec

The above were just running on beast, but the Python runner can send
jobs (hacked up, just using ssh) to other machines... I have two other
non-beasts, and which I ran 3 jvms on each:

  25 + 3 + 3 cpus: 64.7 sec

With the new ant runner:

2 cpus:

   [junit4] Slave 0:     0.16 ..    50.68 =    50.52s
   [junit4] Slave 1:     0.16 ..    49.58 =    49.42s
   [junit4] Execution time total: 50.73s
   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored


5 cpus:

   [junit4] Slave 0:     0.19 ..    21.87 =    21.68s
   [junit4] Slave 1:     0.16 ..    21.86 =    21.70s
   [junit4] Slave 2:     0.16 ..    29.31 =    29.15s
   [junit4] Slave 3:     0.16 ..    26.64 =    26.48s
   [junit4] Slave 4:     0.19 ..    29.82 =    29.63s
   [junit4] Execution time total: 29.89s
   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored

10 cpus:

   [junit4] Slave 0:     0.21 ..    14.62 =    14.41s
   [junit4] Slave 1:     0.22 ..    17.21 =    16.99s
   [junit4] Slave 2:     0.23 ..    18.79 =    18.56s
   [junit4] Slave 3:     0.23 ..    22.99 =    22.76s
   [junit4] Slave 4:     0.20 ..    27.39 =    27.19s
   [junit4] Slave 5:     0.19 ..    27.23 =    27.04s
   [junit4] Slave 6:     0.23 ..    20.40 =    20.17s
   [junit4] Slave 7:     0.19 ..    26.52 =    26.33s
   [junit4] Slave 8:     0.24 ..    26.42 =    26.18s
   [junit4] Slave 9:     0.22 ..    23.57 =    23.35s
   [junit4] Execution time total: 27.52s
   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored

15 cpus:

   [junit4] Slave 0:     0.29 ..     5.16 =     4.87s
   [junit4] Slave 1:     0.26 ..    15.36 =    15.10s
   [junit4] Slave 2:     0.26 ..    12.99 =    12.73s
   [junit4] Slave 3:     0.29 ..    24.20 =    23.92s
   [junit4] Slave 4:     0.26 ..    27.00 =    26.74s
   [junit4] Slave 5:     0.33 ..    19.97 =    19.63s
   [junit4] Slave 6:     0.31 ..    25.29 =    24.98s
   [junit4] Slave 7:     0.24 ..    28.92 =    28.68s
   [junit4] Slave 8:     0.33 ..    23.67 =    23.34s
   [junit4] Slave 9:     0.43 ..    24.43 =    24.00s
   [junit4] Slave 10:     0.40 ..    27.61 =    27.21s
   [junit4] Slave 11:     0.22 ..    21.77 =    21.56s
   [junit4] Slave 12:     0.22 ..    26.78 =    26.56s
   [junit4] Slave 13:     0.26 ..    25.92 =    25.66s
   [junit4] Slave 14:     0.35 ..    27.77 =    27.42s
   [junit4] Execution time total: 28.98s
   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored

20 cpus:

   [junit4] Slave 0:     0.35 ..    23.32 =    22.97s
   [junit4] Slave 1:     0.30 ..    24.32 =    24.02s
   [junit4] Slave 2:     0.35 ..    21.35 =    21.00s
   [junit4] Slave 3:     0.37 ..    23.63 =    23.26s
   [junit4] Slave 4:     0.38 ..    20.74 =    20.35s
   [junit4] Slave 5:     0.30 ..    19.74 =    19.44s
   [junit4] Slave 6:     0.36 ..    26.39 =    26.03s
   [junit4] Slave 7:     0.46 ..    23.64 =    23.18s
   [junit4] Slave 8:     0.43 ..    22.44 =    22.02s
   [junit4] Slave 9:     0.30 ..    24.05 =    23.76s
   [junit4] Slave 10:     0.41 ..    24.75 =    24.33s
   [junit4] Slave 11:     0.30 ..    22.66 =    22.36s
   [junit4] Slave 12:     0.30 ..    24.93 =    24.62s
   [junit4] Slave 13:     0.40 ..    24.39 =    24.00s
   [junit4] Slave 14:     0.24 ..    24.47 =    24.23s
   [junit4] Slave 15:     0.45 ..    25.23 =    24.78s
   [junit4] Slave 16:     0.34 ..    23.06 =    22.72s
   [junit4] Slave 17:     0.23 ..    23.50 =    23.28s
   [junit4] Slave 18:     0.30 ..    24.27 =    23.97s
   [junit4] Slave 19:     0.30 ..    24.91 =    24.61s
   [junit4] Execution time total: 26.52s
   [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored

I only ran once each and results are likely noisy... so it's hard to
pick a best CPU count...

>> Does the "Execution time total" include compilation, or is it just the
>> actual test runtime?
>
> The total is calculated before slave VMs are launched and after they
> complete, so even launch time is included. It's here:
> https://github.com/carrotsearch/randomizedtesting/blob/master/ant-junit4/src/main/java/com/carrotsearch/ant/tasks/junit4/JUnit4.java

Hmm so does that include compile time (my numbers don't)?  Sounds like
no?  I'm also measuring from first launch to last finish.

>> Can this change run "across" the different groups of tests we have
>> (core, modules/*, contrib/*, solr/*, etc.)?  I found that to be a
>> major bottleneck in the current "ant test"'s concurrency, ie we have a
>> pinch point after each group of tests (must wait for all JVMs to
>> finish before moving on to next group...), but I think fixing that in
>> ant is going to be hard?
>
> If I understand you correctly the problem is that ANT in Lucene/ Solr
> is calling to sub-module ANT scripts and these in turn invoke the test
> macro. So running everything from a single test task would be possible
> if we had a master-level test script, it's not directly related to how
> the tests are actually executed.

Yes I think that's the problem!

Ideally ant would just gather up all "jobs" to run and then we'd
aggregate/distribute across JVMs.

> That JUnit4 task supports globbing in
> suite selectors so it could be executed with, say,
> -Dtests.class=org.apache.lucene.blah.* to restrict tests to run just a
> certain section of all tests, but include everything by default.

Cool.

> Don't know how it affects modularization though -- the tests will run
> faster but they'll be more difficult to maintain I guess.

Hmm... can we somehow keep today's directory structure but have ant
treat it as a single "module"?  Or is the problem that we need to
change the JVM settings (eg CLASSPATH) per test module we have
today so we must make separate modules for that...?

>> When I use the hacked up Python test runner (runAllTests.py in luceneutil),
>
> This was my inspiration -- Robert pointed me at that, very helpful
> although you need your kind of machine to run so many SSH sessions :D

OK cool :)  Actually it doesn't open any SSH sessions unless you give
it remote machines to use -- for the "local" JVMs it just forks.

>> change (balancing the tests across JVMs).  BUT: that's on current
>> trunk, vs your git clone which is somewhat old by now... so it's an
>> apples/pears comparison ;)
>
> Oh, come on, my fork is only a few days behind! :) I've pulled the
> current trunk and merged. I'd appreciate if you could re-run again,
> this time with, say, 5, 10, 15 and 20 threads. I wonder what the
> speedup/ overhead is. Thanks.

I re-ran above -- looks like the times came down some so the new ant
runner is basically the same as the Python runner (on core tests):
great!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Dawid Weiss
In reply to this post by Michael McCandless-2
> With the Python runner I see these run times on just lucene core tests:
>    2 cpus: 72.2 sec   5 cpus: 35.0 sec  10 cpus: 28.1 sec  15 cpus:
> 26.2 sec  20 cpus: 26.0 sec  25 cpus: 27.5 sec

I would say this is aligned with my intuition -- after you exceed the
physical number of cores things don't speed up anymore.

>   10 cpus: 88.3 sec  15 cpus: 80.2 sec  20 cpus: 77.4 sec  25 cpus: 76.7 sec
> The above were just running on beast, but the Python runner can

This is probably because some tests don't add anything to CPU load
(they're disk bound or use the network)? The speedup is also not that
significant -- adding 15 cpus only yielded about 10 secs.

> Hmm so does that include compile time (my numbers don't)?  Sounds
> likeno?  I'm also measuring from first launch to last finish.

Oh, you mean ANT compile/ execution time before actual testing? No, I
don't include that -- the execution time is actual spawned jvms.

> Yes I think that's the problem!
> Ideally ant would just gather up all "jobs" to run and then
> we'daggregate/distribute across JVMs.

Could be done by emitting test class/ classpath names from each module
and then running a final testing task that would execute whatever was
appended to the current run... but it seems clumsy to me, don't know
how to do it better though.

> tochange the JVM settings (eg CLASSPATH) per test module we havetoday
> so we must make separate modules for that...?

Yeah, that would be one thing -- different classpaths/ vm properties
etc. This could be problematic.

> I re-ran above -- looks like the times came down some so the new
> antrunner is basically the same as the Python runner (on core tests):
> great!

Thanks. I'm still working on the rough edges (like reporting a jvm
crash, there were problems with ibm j9) and Stanislaw is preparing a
nice(r) test report. We will contribute a patch once this is done and
if there is interest we would love to contribute this in.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Dawid Weiss
In reply to this post by Michael McCandless-2
> 15 cpus:
>
>   [junit4] Slave 0:     0.29 ..     5.16 =     4.87s
...
>   [junit4] Slave 3:     0.29 ..    24.20 =    23.92s
>   [junit4] Slave 4:     0.26 ..    27.00 =    26.74s

This is weird -- such discrepancy shouldn't happen after it has some
initial timings unless there was a really skewed test case inside. I
do all per-vm suite balancing beforehand and don't adjust once the
execution is in progress (probably using job stealing); maybe this is
a mistake that should be corrected. Then the order of suites should be
reported in case of a failure and if you have 20 slaves this would be
a fairly large log ;)

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Michael McCandless-2
In reply to this post by Dawid Weiss
On Thu, Jan 5, 2012 at 2:37 AM, Dawid Weiss
<[hidden email]> wrote:

>> With the Python runner I see these run times on just lucene core tests:
>>    2 cpus: 72.2 sec   5 cpus: 35.0 sec  10 cpus: 28.1 sec  15 cpus:
>> 26.2 sec  20 cpus: 26.0 sec  25 cpus: 27.5 sec
>
> I would say this is aligned with my intuition -- after you exceed the
> physical number of cores things don't speed up anymore.
>
>>   10 cpus: 88.3 sec  15 cpus: 80.2 sec  20 cpus: 77.4 sec  25 cpus: 76.7 sec
>> The above were just running on beast, but the Python runner can
>
> This is probably because some tests don't add anything to CPU load
> (they're disk bound or use the network)? The speedup is also not that
> significant -- adding 15 cpus only yielded about 10 secs.

Right... looks like most of the gains are by 10 CPUs.  Still I'll take
10 seconds ;)

>> Hmm so does that include compile time (my numbers don't)?  Sounds
>> likeno?  I'm also measuring from first launch to last finish.
>
> Oh, you mean ANT compile/ execution time before actual testing? No, I
> don't include that -- the execution time is actual spawned jvms.

OK good.

>> Yes I think that's the problem!
>> Ideally ant would just gather up all "jobs" to run and then
>> we'daggregate/distribute across JVMs.
>
> Could be done by emitting test class/ classpath names from each module
> and then running a final testing task that would execute whatever was
> appended to the current run... but it seems clumsy to me, don't know
> how to do it better though.

OK.... we lose a lot because of this.  Though, I haven't tried w/ your
git clone -- can it run a top-level "ant test" and it does the load
balancing by module...?

>> tochange the JVM settings (eg CLASSPATH) per test module we havetoday
>> so we must make separate modules for that...?
>
> Yeah, that would be one thing -- different classpaths/ vm properties
> etc. This could be problematic.

The Python runner completely cheats here, which is bad (because we may
pick up a dep we didn't intend to, and never catch it)... just takes
the union of all CLASSPATHS.

>> I re-ran above -- looks like the times came down some so the new
>> antrunner is basically the same as the Python runner (on core tests):
>> great!
>
> Thanks. I'm still working on the rough edges (like reporting a jvm
> crash, there were problems with ibm j9) and Stanislaw is preparing a
> nice(r) test report. We will contribute a patch once this is done and
> if there is interest we would love to contribute this in.

Awesome!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Michael McCandless-2
In reply to this post by Dawid Weiss
On Thu, Jan 5, 2012 at 2:43 AM, Dawid Weiss
<[hidden email]> wrote:

>> 15 cpus:
>>
>>   [junit4] Slave 0:     0.29 ..     5.16 =     4.87s
> ...
>>   [junit4] Slave 3:     0.29 ..    24.20 =    23.92s
>>   [junit4] Slave 4:     0.26 ..    27.00 =    26.74s
>
> This is weird -- such discrepancy shouldn't happen after it has some
> initial timings unless there was a really skewed test case inside. I
> do all per-vm suite balancing beforehand and don't adjust once the
> execution is in progress (probably using job stealing); maybe this is
> a mistake that should be corrected. Then the order of suites should be
> reported in case of a failure and if you have 20 slaves this would be
> a fairly large log ;)

It is strange... because I'm running w/ fixed seed, RAMDir and
Lucene40 codec.  There shouldn't be much variance...

The Python runner pre-aggregates the tests into a JVM run, but, it
tries to put ~ 30 seconds worth of tests per JVM, and then front-loads
for any tests that take > 30 seconds (that test runs alone in the
JVM).  So then it's just pulling from that priority queue...

This is somewhat wasteful in that the Python runner is running more
JVMs than the new ant runner, but I do this because the tests can have
such variability on run time... so I think the net effect is just like
job stealing except the Python runner is launching new JVMs to
"steal".

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Chris Hostetter-3
In reply to this post by Michael McCandless-2

: > Yeah, that would be one thing -- different classpaths/ vm properties
: > etc. This could be problematic.
:
: The Python runner completely cheats here, which is bad (because we may
: pick up a dep we didn't intend to, and never catch it)... just takes
: the union of all CLASSPATHS.

as long as the default "ant test" does recursive testing of "ant test"
in the individual modules with their isolated classpaths to ensure no
dependency bleedover, a special case top level "ant
run-all-tests-parallel" target that unions all hte classpaths seems
like it might be acceptible for things like continuously randomized
test only jenkins builds.  

but i wonder if reproducibility might be a problem?  if you don't get the
same classpath, and some classes are loaded i na diff order, would you be
able to "cd modules/foo && ant test -D..." and see the same failures?


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Dawid Weiss
In reply to this post by Michael McCandless-2
> It is strange... because I'm running w/ fixed seed, RAMDir and
> Lucene40 codec.  There shouldn't be much variance...

I don't think it's running with a fixed seed. The problem is that
junit4 has the same property to control seed (tests.seed) but a
different seed format; that's why I suggested a few runs and
specifying an empty seed (which is compatible with junit4 and ltc).

> for any tests that take > 30 seconds (that test runs alone in the
> JVM).  So then it's just pulling from that priority queue...

I was thinking to do job stealing but then comes the issue of
reproducibility (the order of suites sent to a particular jvm) in case
the jvm crashes or something. Technically it's easy to do, but after
some deliberation I opted for a fixed list of seeds per slave (then
you can re-run with the same list because it's on disk, passed as a
parameter).

@Hoss:

> but i wonder if reproducibility might be a problem?  if you don't get the
> same classpath, and some classes are loaded i na diff order, would you be
> able to "cd modules/foo && ant test -D..." and see the same failures?

Most likely not. Classpath variations will be an issue. Now that I
think of it even load-balancing will be an issue if it's to be
calculated from repeatedly updated data. On the other hand, if
balancing is calculated from a fixed set of precomputed statistics the
quality may vary from system to system... again no good solutions for
this I guess.

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Mark Miller-3
In reply to this post by Dawid Weiss

On Jan 5, 2012, at 2:37 AM, Dawid Weiss wrote:

> if there is interest we would love to contribute this in.

+1! I've been itching to work on something like this since parallel tests where first put in - can't wait to see it go in.

- Mark Miller
lucidimagination.com












---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Parallel tests in ANT, experiment volunteers welcome :)

Dawid Weiss
I like this too and I'm sure there'll be plenty of places where
helping hands will be more than welcome :)

Side note -- Maven surefire has built-in support for parallel builds
(forked) too, didn't have the time to check how they handled some of
the issued we mentioned.

Dawid

On Thu, Jan 5, 2012 at 9:31 PM, Mark Miller <[hidden email]> wrote:

>
> On Jan 5, 2012, at 2:37 AM, Dawid Weiss wrote:
>
>> if there is interest we would love to contribute this in.
>
> +1! I've been itching to work on something like this since parallel tests where first put in - can't wait to see it go in.
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]