ParallelMultiSearcher reimplementation

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

ParallelMultiSearcher reimplementation

Gus Holcomb
Hello everyone,
  We are currently using Lucene 1.9.1 at work. Using a profiler, I
discovered that searching with a HitCollector in a ParallelMultiSearcher
is single threaded. By extending ParallelMultiSearcher I was able to
parallelize it without a problem (and without requiring a new lucene jar
for deployment). In addition, I re-implemented all of the existing
multithreading using a user configurable thread pool, queue and executor
service, etc. The current implementation of spawning one thread per
searchable is not only slower, but dangerous.

Is this development already taking place in the trunk? I was unable to
uncover any progress in this area. I haven't contributed to lucene (or
any open source project) before, but I would be willing to clean up a
number of things in this area if there was interest.

Looking forward to hearing from you...

Gus Holcomb

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ParallelMultiSearcher reimplementation

Chris Hostetter-3

Hi Gus,

: Is this development already taking place in the trunk? I was unable to
: uncover any progress in this area. I haven't contributed to lucene (or
: any open source project) before, but I would be willing to clean up a
: number of things in this area if there was interest.

I'm not aware of any particular development going on with
ParallelMultiSearcher at the moment, but I'm also not that familiar with
ParallelMultiSearcher in general -- so I can't say for sure.

The best way to contribute these cahnges back to the community is using
the steps outlined here...

        http://wiki.apache.org/jakarta-lucene/HowToContribute

In a nut shell:

  1) check out the trunk using anon SVN
  2) move your code into the lucene package structure
  3) generate a patch using (svn add as neccessary and) svn diff
  4) submit the patch as an attachment in Jira.


Don't worry too much about "cleaning" up the code at first ... as long as
it compiles i would submit a patch, that way you can get eyeballs on it
sooner, and then worry about how clean it is.

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: ParallelMultiSearcher reimplementation

Gus Holcomb
In reply to this post by Gus Holcomb
Is there any timeline for when Java 1.5 packages will be allowed?

Thanks,
Gus Holcomb

-----Original Message-----
From: Chris Hostetter [mailto:[hidden email]]
Sent: Thursday, November 02, 2006 8:08 PM
To: [hidden email]
Subject: Re: ParallelMultiSearcher reimplementation


Hi Gus,

: Is this development already taking place in the trunk? I was unable to
: uncover any progress in this area. I haven't contributed to lucene (or
: any open source project) before, but I would be willing to clean up a
: number of things in this area if there was interest.

I'm not aware of any particular development going on with
ParallelMultiSearcher at the moment, but I'm also not that familiar with
ParallelMultiSearcher in general -- so I can't say for sure.

The best way to contribute these cahnges back to the community is using
the steps outlined here...

        http://wiki.apache.org/jakarta-lucene/HowToContribute

In a nut shell:

  1) check out the trunk using anon SVN
  2) move your code into the lucene package structure
  3) generate a patch using (svn add as neccessary and) svn diff
  4) submit the patch as an attachment in Jira.


Don't worry too much about "cleaning" up the code at first ... as long
as it compiles i would submit a patch, that way you can get eyeballs on
it sooner, and then worry about how clean it is.

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ParallelMultiSearcher reimplementation

Doug Cutting
In reply to this post by Gus Holcomb
Gus Holcomb wrote:
>   We are currently using Lucene 1.9.1 at work. Using a profiler, I
> discovered that searching with a HitCollector in a ParallelMultiSearcher
> is single threaded. By extending ParallelMultiSearcher I was able to
> parallelize it without a problem (and without requiring a new lucene jar
> for deployment). In addition, I re-implemented all of the existing
> multithreading using a user configurable thread pool, queue and executor
> service, etc. The current implementation of spawning one thread per
> searchable is not only slower, but dangerous.

Please consider breaking these into separate patches, one to permit
ParallelMultiSearcher w/ HitCollector to not be single-threaded, and
another to re-implement things with a thread pool.  The latter is more
controversial, and it would be a shame to have the former wait on it.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: ParallelMultiSearcher reimplementation

Chris Hostetter-3
In reply to this post by Gus Holcomb

: Is there any timeline for when Java 1.5 packages will be allowed?

Since I'd rather not have someone wire my car to explode (a joke that's
particulararly funny to people who know that i don't have a driver's
lisence let alone a car) I'll refrain from commenting and just point to
these threads...

http://www.nabble.com/Lucene-and-Java-1.5-tf1690825.html
http://www.nabble.com/Results-%28Re%3A-Survey%3A-Lucene-and-Java-1.4-vs.-1.5%29-tf1800681.html

I don't think i'll incite too much rioting to say "no there is no
timeline"
.. I may incite some rioting by saying "my guess is 1.5 packages will be
supported when the patches requiring them become highly desired.

: -----Original Message-----
: From: Chris Hostetter [mailto:[hidden email]]
: Sent: Thursday, November 02, 2006 8:08 PM
: To: [hidden email]
: Subject: Re: ParallelMultiSearcher reimplementation
:
:
: Hi Gus,
:
: : Is this development already taking place in the trunk? I was unable to
: : uncover any progress in this area. I haven't contributed to lucene (or
: : any open source project) before, but I would be willing to clean up a
: : number of things in this area if there was interest.
:
: I'm not aware of any particular development going on with
: ParallelMultiSearcher at the moment, but I'm also not that familiar with
: ParallelMultiSearcher in general -- so I can't say for sure.
:
: The best way to contribute these cahnges back to the community is using
: the steps outlined here...
:
: http://wiki.apache.org/jakarta-lucene/HowToContribute
:
: In a nut shell:
:
:   1) check out the trunk using anon SVN
:   2) move your code into the lucene package structure
:   3) generate a patch using (svn add as neccessary and) svn diff
:   4) submit the patch as an attachment in Jira.
:
:
: Don't worry too much about "cleaning" up the code at first ... as long
: as it compiles i would submit a patch, that way you can get eyeballs on
: it sooner, and then worry about how clean it is.
:
: -Hoss
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [hidden email]
: For additional commands, e-mail: [hidden email]
:
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [hidden email]
: For additional commands, e-mail: [hidden email]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ParallelMultiSearcher reimplementation

Chuck Williams-2
Chris Hostetter wrote on 11/03/2006 09:40 AM:
> : Is there any timeline for when Java 1.5 packages will be allowed?
>
> I don't think i'll incite too much rioting to say "no there is no
> timeline"
> .. I may incite some rioting by saying "my guess is 1.5 packages will be
> supported when the patches requiring them become highly desired.
>  

Not being shy about inciting riots, the problem with this approach is
that people using Java 1.5 are discouraged from submitting patches to
being with.


Doug Cutting wrote on 11/03/2006 08:39 AM:
> Please consider breaking these into separate patches, one to permit
> ParallelMultiSearcher w/ HitCollector to not be single-threaded, and
> another to re-implement things with a thread pool.  The latter is more
> controversial, and it would be a shame to have the former wait on it.

Why would a thread pool be more controversial?  Dynamically creating and
garbaging threads has many downsides.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ParallelMultiSearcher reimplementation

Doug Cutting
Chuck Williams wrote:
> Why would a thread pool be more controversial?  Dynamically creating and
> garbaging threads has many downsides.

The JVM already pools native threads, so mostly what's saved by thread
pools is the allocation & initialization of new Thread instances.  There
are also downsides to thread pools.  They alter ThreadLocal semantics
and generally add complexity that may not be warranted.

Like most optimizations, use of thread pools should be motivated by
benchmarks.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ParallelMultiSearcher reimplementation

Otis Gospodnetic-2
In reply to this post by Gus Holcomb
Gus, I encourage you to submit your patch.  Java 1.5 is allowed in many people's local Lucene repositories. :)
I use PMS in a few high-traffic places and would like to see your improvements.

Otis

----- Original Message ----
From: Gus Holcomb <[hidden email]>
To: [hidden email]
Sent: Friday, November 3, 2006 11:54:56 AM
Subject: RE: ParallelMultiSearcher reimplementation

Is there any timeline for when Java 1.5 packages will be allowed?

Thanks,
Gus Holcomb

-----Original Message-----
From: Chris Hostetter [mailto:[hidden email]]
Sent: Thursday, November 02, 2006 8:08 PM
To: [hidden email]
Subject: Re: ParallelMultiSearcher reimplementation


Hi Gus,

: Is this development already taking place in the trunk? I was unable to
: uncover any progress in this area. I haven't contributed to lucene (or
: any open source project) before, but I would be willing to clean up a
: number of things in this area if there was interest.

I'm not aware of any particular development going on with
ParallelMultiSearcher at the moment, but I'm also not that familiar with
ParallelMultiSearcher in general -- so I can't say for sure.

The best way to contribute these cahnges back to the community is using
the steps outlined here...

    http://wiki.apache.org/jakarta-lucene/HowToContribute

In a nut shell:

  1) check out the trunk using anon SVN
  2) move your code into the lucene package structure
  3) generate a patch using (svn add as neccessary and) svn diff
  4) submit the patch as an attachment in Jira.


Don't worry too much about "cleaning" up the code at first ... as long
as it compiles i would submit a patch, that way you can get eyeballs on
it sooner, and then worry about how clean it is.

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ParallelMultiSearcher reimplementation

Chuck Williams-2
In reply to this post by Doug Cutting
Doug Cutting wrote on 11/03/2006 12:18 PM:

> Chuck Williams wrote:
>> Why would a thread pool be more controversial?  Dynamically creating and
>> garbaging threads has many downsides.
>
> The JVM already pools native threads, so mostly what's saved by thread
> pools is the allocation & initialization of new Thread instances.
> There are also downsides to thread pools.  They alter ThreadLocal
> semantics and generally add complexity that may not be warranted.
>
> Like most optimizations, use of thread pools should be motivated by
> benchmarks.

I followed this same logic in ParallelWriter and got burned.  My first
implementation (still the version submitted as a patch in jira) used
dynamic threads to add the subdocuments to the parallel subindexes
simultaneously.  This hit a problem with abnormal native heap OOM's in
the jvm.  At first I thought it was simply a thread stack size / java
heap size configuration issue, but adjusting these did not resolve the
issue.  This was on linux.  ps -L showed large numbers of defunct
threads.  jconsole showed enormous growing total-ever-allocated thread
counts.  I switched to a thread pool and the issue went away with the
same config settings.

So, I'm not convinced the jvm does such a good job a pooling OS native
threads.

Re. ThreadLocals, I agree the semantics are different, but arguably they
are most useful with thread pools.  With dynamic threads, you get a
reallocation every time, while with thread pools you avoid constant
reallocations.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ParallelMultiSearcher reimplementation

Doug Cutting
Chuck Williams wrote:

> I followed this same logic in ParallelWriter and got burned.  My first
> implementation (still the version submitted as a patch in jira) used
> dynamic threads to add the subdocuments to the parallel subindexes
> simultaneously.  This hit a problem with abnormal native heap OOM's in
> the jvm.  At first I thought it was simply a thread stack size / java
> heap size configuration issue, but adjusting these did not resolve the
> issue.  This was on linux.  ps -L showed large numbers of defunct
> threads.  jconsole showed enormous growing total-ever-allocated thread
> counts.  I switched to a thread pool and the issue went away with the
> same config settings.

Can you demonstrate the problem with a standalone program?

Way back in the 90's I implemented a system at Excite that spawned one
or more Java threads per request, and it ran for days on end, handling
20 or more requests per second.  The thread spawning overhead was
insignificant.  That was JDK 1.2 on Solaris.  Have things gotten that
much worse in the interim?  Today Hadoop's RPC allocates a thread per
connection, and we see good performance.  So I certainly have
counterexamples.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ParallelMultiSearcher reimplementation

eks dev
In reply to this post by Gus Holcomb
maybe someone interested.
I just remembered, we tested pure Hadop RPC a few (5+) months ago in simple setup, kind of balancing server getting and distributing requests to 3 "search units"...

we went that far as java RMI proved to have ugly latency problems (or we did not get it right, don't know for sure) in well-under-second scenario.

All in all, spawning new thread (hadoop RPC) on java 1.5, roundtrip over 2 net nodes (100MB ) there and back was slightly under 3k / second on a few kB  message.


----- Original Message ----
From: Doug Cutting <[hidden email]>
To: [hidden email]
Sent: Monday, 13 November, 2006 9:50:28 PM
Subject: Re: ParallelMultiSearcher reimplementation

Chuck Williams wrote:

> I followed this same logic in ParallelWriter and got burned.  My first
> implementation (still the version submitted as a patch in jira) used
> dynamic threads to add the subdocuments to the parallel subindexes
> simultaneously.  This hit a problem with abnormal native heap OOM's in
> the jvm.  At first I thought it was simply a thread stack size / java
> heap size configuration issue, but adjusting these did not resolve the
> issue.  This was on linux.  ps -L showed large numbers of defunct
> threads.  jconsole showed enormous growing total-ever-allocated thread
> counts.  I switched to a thread pool and the issue went away with the
> same config settings.

Can you demonstrate the problem with a standalone program?

Way back in the 90's I implemented a system at Excite that spawned one
or more Java threads per request, and it ran for days on end, handling
20 or more requests per second.  The thread spawning overhead was
insignificant.  That was JDK 1.2 on Solaris.  Have things gotten that
much worse in the interim?  Today Hadoop's RPC allocates a thread per
connection, and we see good performance.  So I certainly have
counterexamples.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





Send instant messages to your online friends http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ParallelMultiSearcher reimplementation

Chuck Williams-2
In reply to this post by Doug Cutting


Doug Cutting wrote on 11/13/2006 10:50 AM:

> Chuck Williams wrote:
>> I followed this same logic in ParallelWriter and got burned.  My first
>> implementation (still the version submitted as a patch in jira) used
>> dynamic threads to add the subdocuments to the parallel subindexes
>> simultaneously.  This hit a problem with abnormal native heap OOM's in
>> the jvm.  At first I thought it was simply a thread stack size / java
>> heap size configuration issue, but adjusting these did not resolve the
>> issue.  This was on linux.  ps -L showed large numbers of defunct
>> threads.  jconsole showed enormous growing total-ever-allocated thread
>> counts.  I switched to a thread pool and the issue went away with the
>> same config settings.
>
> Can you demonstrate the problem with a standalone program?
>
> Way back in the 90's I implemented a system at Excite that spawned one
> or more Java threads per request, and it ran for days on end, handling
> 20 or more requests per second.  The thread spawning overhead was
> insignificant.  That was JDK 1.2 on Solaris.  Have things gotten that
> much worse in the interim?  Today Hadoop's RPC allocates a thread per
> connection, and we see good performance.  So I certainly have
> counterexamples.

Are you pushing memory to the limit?  In my case, we need a maximally
sized Java heap (about 2.5G on linux) and so carefully minimize the
thread stack and perm space sizes.  My suspicion is that it takes a
while after a thread is defunct before all resources are reclaimed.  We
are hitting our server with 50 simultaneous threads doing indexing, each
of which writes 6 parallel subindexes in a separate thread.  This yields
hundreds of threads created per second in tight total thread stack
space; the process continually bumped over the native heap limit.  With
the change to thread pools, and therefore no dynamic creation and
destruction of thread stacks, all works fine.

Unless you are running with a maximal Java heap, you are unlikely to
have the issue as there is plenty of space left over for the native
heap, so a delay in thread stack reclamation would yield a larger
average process size, but would not cause OOM's.

Chuck



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]