FieldsReader synchronized access vs. ThreadLocal ?

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

FieldsReader synchronized access vs. ThreadLocal ?

Robert Engels
In SegmentReader, currently the access to FieldsReader.doc(n) is
synchronized (which is must be).

Does it not make sense to use a ThreadLocal implementation similar to the
TermInfosReader?

It seems that in a highly multi-threaded server this synchronized method
could lead to significant blocking when the documents are being retrieved?
Reply | Threaded
Open this post in threaded view
|

Re: FieldsReader synchronized access vs. ThreadLocal ?

Doug Cutting
Robert Engels wrote:
> It seems that in a highly multi-threaded server this synchronized method
> could lead to significant blocking when the documents are being retrieved?

Perhaps, but I'd prefer to wait for someone to demonstrate this as a
performance bottleneck before adding another ThreadLocal.

Peter Keegan has recently demonstrated pretty good concurrency using
mmap directory on four and eight CPU systems:

http://www.mail-archive.com/java-user@.../msg05074.html

Peter also wondered if the SegmentReader.document(int) method might be a
bottleneck, and tried patching it to run unsynchronized:

http://www.mail-archive.com/java-user@.../msg05891.html

Unfortunately that did not improve his performance:

http://www.mail-archive.com/java-user@.../msg06163.html

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: FieldsReader synchronized access vs. ThreadLocal ?

Robert Engels
The test results seem hard to believe. Doubling the CPUs only increased
through put by 20%??? Seems rather low for primarily a "read only" test.

Peter did not seem to answer many of the follow-up questions (at least I
could not find the answers) regarding whether or not the CPU usage was 100%.
If the OS cache is insufficient to support the size of the index and the
number of queries being executed, then you will not achieve linear increases
with the number of CPUs, since you will become quickly become IO bound
(especially if the queries are returning a wide variety of documents that
are scattered through out the index).

Since reading a document is a relatively expensive operation (especially if
the data blocks are not in the OS cache), if synchronized, no other thread
can read a document, or begin to read a document (in the case of an
OS/hardware that supports scatter/gather multiple IO requests). The is not
just applicable to cases where lots of documents are being read. Since the
isDeleted() method uses the same synchronized lock as document(), all query
scorers that filter out deleted documents will also be impacted, as they
will block while the document is being read.

In order to test this, I wrote the attached test case. It uses 2 threads,
one which reads every document in a segment, another which reads the same
document repeatedly (for as many documents as there are in the index). The
theory being, the "readsame" should be able to execute rather quickly (since
the needed disk blocks will quickly become available in the OS cache), where
as the "readall" will be much slower (since almost every document retrieval
will require disk access).

I tested using a segment containing 100k documents. I ran the test on a
single CPU machine (1.2 ghz P4).

I used the windows "cleanmem" to clear the system cache before running the
tests. (It seemed unreliable at times. Does anyone know a fool-proof method
of emptying the system cache on windows???)

Running using the unmodified SegmentReader and FieldsReader (synchronized)
over multiple tests, I got the following:

BEST TIME
ReadSameThread, time = 2359
ReadAllThread, time = 2469

WORST TIME
ReadSameThread, time = 2671
ReadAllThread, time = 2968

Using the modified (unsynchronized using ThreadLocal) classes, I got the
following:

BEST TIME
ReadSameThread, time = 1328
ReadAllThread, time = 1859

WORST TIME
ReadSameThread, time = 1671
ReadAllThread, time = 1953

I believe that using an MMap directory only improves the situation since the
OS reads the blocks much more efficiently (faster). Imagine if you were
running Lucene using a VERY SLOW disk subsystem - the synchronized block
would have an even greater negative impact.

Hopefully, this is enough to demonstrate the value of using ThreadLocals to
support simultaneous IO.

I look forward to your thoughts, and others - hopefully someone can run the
test on a multiple CPU machine.

Robert

-----Original Message-----
From: Doug Cutting [mailto:[hidden email]]
Sent: Tuesday, May 16, 2006 3:17 PM
To: [hidden email]
Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?


Robert Engels wrote:
> It seems that in a highly multi-threaded server this synchronized method
> could lead to significant blocking when the documents are being retrieved?

Perhaps, but I'd prefer to wait for someone to demonstrate this as a
performance bottleneck before adding another ThreadLocal.

Peter Keegan has recently demonstrated pretty good concurrency using
mmap directory on four and eight CPU systems:

http://www.mail-archive.com/java-user@.../msg05074.html

Peter also wondered if the SegmentReader.document(int) method might be a
bottleneck, and tried patching it to run unsynchronized:

http://www.mail-archive.com/java-user@.../msg05891.html

Unfortunately that did not improve his performance:

http://www.mail-archive.com/java-user@.../msg06163.html

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

FieldsReader.java (5K) Download Attachment
SegmentReader.java (19K) Download Attachment
MultiThreadSegmentReaderTest.java (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FieldsReader synchronized access vs. ThreadLocal ?

Yonik Seeley
On 5/17/06, Robert Engels <[hidden email]> wrote:
> Since reading a document is a relatively expensive operation

Expensive relative to searching operations that cover 1 million
documents (i.e. you don't want to call doc() a million times)

Solr has a document cache, and I've found that it doesn't help max
throughput that much (I just needed more concurrent searchers to reach
the max), but it does help latency of individual requests.

>[...] Since the
> isDeleted() method uses the same synchronized lock as document(), all query
> scorers that filter out deleted documents will also be impacted, as they
> will block while the document is being read.

Interesting observation... that lock need not be shared.
Using a different lock would mean aquiring two locks per doc() call,
but it may be worth it to unblock the scorers waiting on isDeleted()

It might also be worth it to make a ReadOnlyIndexReader that didn't
have to deal with issues of synchronizing access to the deleted docs
vector.

If only Sun had done their APIs correctly, we could easily make
non-synchronizing implementations of IndexInput and friends w/o
resorting to ThreadLocals... Ah well.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: FieldsReader synchronized access vs. ThreadLocal ?

Robert Engels
If you run a concurrent searches over a million documents, returning only
the matching 500 of each, you will still encounter significant blocking if
there are not many matches between the documents sets - the blocking is even
worse due to the isDeleted() synchronization, as query performance will be
severely impacted the longer each document() call takes.

Using the ThreadLocal demonstrates excellent performance improvements for
multi-threaded queries. Not sure why it should not be used, especially since
it also fixed the synchronization problem between isDeleted() and
document().


-----Original Message-----
From: Yonik Seeley [mailto:[hidden email]]
Sent: Wednesday, May 17, 2006 1:36 PM
To: [hidden email]
Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?


On 5/17/06, Robert Engels <[hidden email]> wrote:
> Since reading a document is a relatively expensive operation

Expensive relative to searching operations that cover 1 million
documents (i.e. you don't want to call doc() a million times)

Solr has a document cache, and I've found that it doesn't help max
throughput that much (I just needed more concurrent searchers to reach
the max), but it does help latency of individual requests.

>[...] Since the
> isDeleted() method uses the same synchronized lock as document(), all
query
> scorers that filter out deleted documents will also be impacted, as they
> will block while the document is being read.

Interesting observation... that lock need not be shared.
Using a different lock would mean aquiring two locks per doc() call,
but it may be worth it to unblock the scorers waiting on isDeleted()

It might also be worth it to make a ReadOnlyIndexReader that didn't
have to deal with issues of synchronizing access to the deleted docs
vector.

If only Sun had done their APIs correctly, we could easily make
non-synchronizing implementations of IndexInput and friends w/o
resorting to ThreadLocals... Ah well.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: FieldsReader synchronized access vs. ThreadLocal ?

Yonik Seeley
On 5/17/06, Robert Engels <[hidden email]> wrote:
> If you run a concurrent searches over a million documents, returning only
> the matching 500 of each.

There's the difference... we pretty much never retrieve 500 documents.
 We retrieve exactly the number needed to display a page of search
results (typically 10 to 25).

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: FieldsReader synchronized access vs. ThreadLocal ?

Grant Ingersoll
Somewhat related, the Lazy Field loading patch uses a ThreadLocal on the
FieldsReader to handle.  It is issue 545 in Jira.

Yonik Seeley wrote:

> On 5/17/06, Robert Engels <[hidden email]> wrote:
>> If you run a concurrent searches over a million documents, returning
>> only
>> the matching 500 of each.
>
> There's the difference... we pretty much never retrieve 500 documents.
> We retrieve exactly the number needed to display a page of search
> results (typically 10 to 25).
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
> server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: FieldsReader synchronized access vs. ThreadLocal ?

Robert Engels
Does it use a ThreadLocal for the FieldsReader? If so, that is somewhat less
efficient (than using a ThreadLocal on the streams in the FieldsReader - as
the modified code I supplied does).

In either case it is better than the synchronization on the document() call.
It is just not needed.

-----Original Message-----
From: Grant Ingersoll [mailto:[hidden email]]
Sent: Wednesday, May 17, 2006 7:29 PM
To: [hidden email]
Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?


Somewhat related, the Lazy Field loading patch uses a ThreadLocal on the
FieldsReader to handle.  It is issue 545 in Jira.

Yonik Seeley wrote:

> On 5/17/06, Robert Engels <[hidden email]> wrote:
>> If you run a concurrent searches over a million documents, returning
>> only
>> the matching 500 of each.
>
> There's the difference... we pretty much never retrieve 500 documents.
> We retrieve exactly the number needed to display a page of search
> results (typically 10 to 25).
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
> server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: FieldsReader synchronized access vs. ThreadLocal ?

Peter Keegan
In reply to this post by Robert Engels
Robert,

Sorry I missed your questions.

The test results seem hard to believe. Doubling the CPUs only increased
> through put by 20%??? Seems rather low for primarily a "read only" test.


I think this refers to the test I did on a 16 cpu (32 hyperthreaded) server.
This system was actually two 8 cpu systems cabled together on their
backplanes. I suspect that some tradeoffs were made in its design that
allowed for this flexibility which resulted in the minimal improvement in
the tests.

Peter did not seem to answer many of the follow-up questions (at least I
> could not find the answers) regarding whether or not the CPU usage was
> 100%.


On the 16-cpu system I noticed that load was not distributed very evenly -
some were near 100%, others were less than 10%. On the AMD Opteron servers,
the distribution was quite even and between 75-100%.

I look forward to your thoughts, and others - hopefully someone can run the
> test on a multiple CPU machine.
>
>
I built Lucene with your mod's and ran my test on the 8 cpu AMD Linux
server, but noticed no difference in query throughput. It would seem that
ThreadLocal could improve performance, but I think my bottlenecks are
elsewhere, like IndexInput.readVInt and inserting results in priority
queues.

Peter
Reply | Threaded
Open this post in threaded view
|

RE: FieldsReader synchronized access vs. ThreadLocal ?

Robert Engels
As someone else pointed out, the proposed mods will only affect queries the
return a lot of Documents. If your test is only set up to return a few
documents (or possible none at all), then you will see no difference.

The fact that some of the CPUs were far less than 100%, and others were at
100% may be a good sign. How any query threads were you testing with?

-----Original Message-----
From: Peter Keegan [mailto:[hidden email]]
Sent: Thursday, May 18, 2006 1:01 PM
To: [hidden email]; [hidden email]
Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?


Robert,

Sorry I missed your questions.

The test results seem hard to believe. Doubling the CPUs only increased
> through put by 20%??? Seems rather low for primarily a "read only" test.


I think this refers to the test I did on a 16 cpu (32 hyperthreaded) server.
This system was actually two 8 cpu systems cabled together on their
backplanes. I suspect that some tradeoffs were made in its design that
allowed for this flexibility which resulted in the minimal improvement in
the tests.

Peter did not seem to answer many of the follow-up questions (at least I
> could not find the answers) regarding whether or not the CPU usage was
> 100%.


On the 16-cpu system I noticed that load was not distributed very evenly -
some were near 100%, others were less than 10%. On the AMD Opteron servers,
the distribution was quite even and between 75-100%.

I look forward to your thoughts, and others - hopefully someone can run the
> test on a multiple CPU machine.
>
>
I built Lucene with your mod's and ran my test on the 8 cpu AMD Linux
server, but noticed no difference in query throughput. It would seem that
ThreadLocal could improve performance, but I think my bottlenecks are
elsewhere, like IndexInput.readVInt and inserting results in priority
queues.

Peter


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: FieldsReader synchronized access vs. ThreadLocal ?

Peter Keegan
I'm returning 20 results (about .5Kb each). In fact, I had to reduce that
from 50 because the network was becoming the bottleneck.

On the 16-cpu server, I ran tests using 8, 16 and 32 query threads, but
there was no improvement with more threads. I still believe the hardware was
to blame.

Peter

On 5/18/06, Robert Engels <[hidden email]> wrote:

>
> As someone else pointed out, the proposed mods will only affect queries
> the
> return a lot of Documents. If your test is only set up to return a few
> documents (or possible none at all), then you will see no difference.
>
> The fact that some of the CPUs were far less than 100%, and others were at
> 100% may be a good sign. How any query threads were you testing with?
>
> -----Original Message-----
> From: Peter Keegan [mailto:[hidden email]]
> Sent: Thursday, May 18, 2006 1:01 PM
> To: [hidden email]; [hidden email]
> Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?
>
>
> Robert,
>
> Sorry I missed your questions.
>
> The test results seem hard to believe. Doubling the CPUs only increased
> > through put by 20%??? Seems rather low for primarily a "read only" test.
>
>
> I think this refers to the test I did on a 16 cpu (32 hyperthreaded)
> server.
> This system was actually two 8 cpu systems cabled together on their
> backplanes. I suspect that some tradeoffs were made in its design that
> allowed for this flexibility which resulted in the minimal improvement in
> the tests.
>
> Peter did not seem to answer many of the follow-up questions (at least I
> > could not find the answers) regarding whether or not the CPU usage was
> > 100%.
>
>
> On the 16-cpu system I noticed that load was not distributed very evenly -
> some were near 100%, others were less than 10%. On the AMD Opteron
> servers,
> the distribution was quite even and between 75-100%.
>
> I look forward to your thoughts, and others - hopefully someone can run
> the
> > test on a multiple CPU machine.
> >
> >
> I built Lucene with your mod's and ran my test on the 8 cpu AMD Linux
> server, but noticed no difference in query throughput. It would seem that
> ThreadLocal could improve performance, but I think my bottlenecks are
> elsewhere, like IndexInput.readVInt and inserting results in priority
> queues.
>
> Peter
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: FieldsReader synchronized access vs. ThreadLocal ?

Robert Engels
Given that Lucene is generally VERY CPU bound, having stalled processors
implies that those threads (and more) are blocked, either by IO, or by a
synchronized block - as long as you have more threads than processors.

If the machine has a POOR disk subsystem in comparison to the CPU speed, and
the OS disk cache is too small, you can easily stall the threads.

-----Original Message-----
From: Peter Keegan [mailto:[hidden email]]
Sent: Thursday, May 18, 2006 1:32 PM
To: [hidden email]; [hidden email]
Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?


I'm returning 20 results (about .5Kb each). In fact, I had to reduce that
from 50 because the network was becoming the bottleneck.

On the 16-cpu server, I ran tests using 8, 16 and 32 query threads, but
there was no improvement with more threads. I still believe the hardware was
to blame.

Peter

On 5/18/06, Robert Engels <[hidden email]> wrote:

>
> As someone else pointed out, the proposed mods will only affect queries
> the
> return a lot of Documents. If your test is only set up to return a few
> documents (or possible none at all), then you will see no difference.
>
> The fact that some of the CPUs were far less than 100%, and others were at
> 100% may be a good sign. How any query threads were you testing with?
>
> -----Original Message-----
> From: Peter Keegan [mailto:[hidden email]]
> Sent: Thursday, May 18, 2006 1:01 PM
> To: [hidden email]; [hidden email]
> Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?
>
>
> Robert,
>
> Sorry I missed your questions.
>
> The test results seem hard to believe. Doubling the CPUs only increased
> > through put by 20%??? Seems rather low for primarily a "read only" test.
>
>
> I think this refers to the test I did on a 16 cpu (32 hyperthreaded)
> server.
> This system was actually two 8 cpu systems cabled together on their
> backplanes. I suspect that some tradeoffs were made in its design that
> allowed for this flexibility which resulted in the minimal improvement in
> the tests.
>
> Peter did not seem to answer many of the follow-up questions (at least I
> > could not find the answers) regarding whether or not the CPU usage was
> > 100%.
>
>
> On the 16-cpu system I noticed that load was not distributed very evenly -
> some were near 100%, others were less than 10%. On the AMD Opteron
> servers,
> the distribution was quite even and between 75-100%.
>
> I look forward to your thoughts, and others - hopefully someone can run
> the
> > test on a multiple CPU machine.
> >
> >
> I built Lucene with your mod's and ran my test on the 8 cpu AMD Linux
> server, but noticed no difference in query throughput. It would seem that
> ThreadLocal could improve performance, but I think my bottlenecks are
> elsewhere, like IndexInput.readVInt and inserting results in priority
> queues.
>
> Peter
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: FieldsReader synchronized access vs. ThreadLocal ?

Robert Engels
In reply to this post by Peter Keegan
As an aside,

On my VERY crappy 1.2 ghz single CPU P4, using a index of 300k documents, I
can perform 50 searches per second (returning 20 document matches each).
This includes the time to serialize and send the results to the client
(although the client is on the same machine, but it also competes for cpu
with the search server).

Based on some informal viewing of the CPU usage, the client consume 50-70%
of the cpu, so I would assume that moving the client off the server should
double the queries per second (although there would be additional delay due
to network transmission). So even a single CPU P4 could easily do 100
queries per second.

Even though we are comparing apples and oranges, unless you are performing
some really expensive queries, I would expect your configuration to be MUCH
faster.

-----Original Message-----
From: Peter Keegan [mailto:[hidden email]]
Sent: Thursday, May 18, 2006 1:32 PM
To: [hidden email]; [hidden email]
Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?


I'm returning 20 results (about .5Kb each). In fact, I had to reduce that
from 50 because the network was becoming the bottleneck.

On the 16-cpu server, I ran tests using 8, 16 and 32 query threads, but
there was no improvement with more threads. I still believe the hardware was
to blame.

Peter

On 5/18/06, Robert Engels <[hidden email]> wrote:

>
> As someone else pointed out, the proposed mods will only affect queries
> the
> return a lot of Documents. If your test is only set up to return a few
> documents (or possible none at all), then you will see no difference.
>
> The fact that some of the CPUs were far less than 100%, and others were at
> 100% may be a good sign. How any query threads were you testing with?
>
> -----Original Message-----
> From: Peter Keegan [mailto:[hidden email]]
> Sent: Thursday, May 18, 2006 1:01 PM
> To: [hidden email]; [hidden email]
> Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?
>
>
> Robert,
>
> Sorry I missed your questions.
>
> The test results seem hard to believe. Doubling the CPUs only increased
> > through put by 20%??? Seems rather low for primarily a "read only" test.
>
>
> I think this refers to the test I did on a 16 cpu (32 hyperthreaded)
> server.
> This system was actually two 8 cpu systems cabled together on their
> backplanes. I suspect that some tradeoffs were made in its design that
> allowed for this flexibility which resulted in the minimal improvement in
> the tests.
>
> Peter did not seem to answer many of the follow-up questions (at least I
> > could not find the answers) regarding whether or not the CPU usage was
> > 100%.
>
>
> On the 16-cpu system I noticed that load was not distributed very evenly -
> some were near 100%, others were less than 10%. On the AMD Opteron
> servers,
> the distribution was quite even and between 75-100%.
>
> I look forward to your thoughts, and others - hopefully someone can run
> the
> > test on a multiple CPU machine.
> >
> >
> I built Lucene with your mod's and ran my test on the 8 cpu AMD Linux
> server, but noticed no difference in query throughput. It would seem that
> ThreadLocal could improve performance, but I think my bottlenecks are
> elsewhere, like IndexInput.readVInt and inserting results in priority
> queues.
>
> Peter
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: FieldsReader synchronized access vs. ThreadLocal ?

Peter Keegan
The queries are mostly boolean (all AND'd terms), no. of terms varies
anywhere from a few to 25 or more, with 1 or 2 sort fields.

My tests are designed to measure total query throughput, not just raw search
speed. The client test program blasts queries from >50 threads over a socket
and runs on a separate server from Lucene. I can get much higher rates by
just blasting from a single thread in the client, but this doesn't simulate
the real use model.

Peter

On 5/19/06, Robert Engels <[hidden email]> wrote:

>
> As an aside,
>
> On my VERY crappy 1.2 ghz single CPU P4, using a index of 300k documents,
> I
> can perform 50 searches per second (returning 20 document matches each).
> This includes the time to serialize and send the results to the client
> (although the client is on the same machine, but it also competes for cpu
> with the search server).
>
> Based on some informal viewing of the CPU usage, the client consume 50-70%
> of the cpu, so I would assume that moving the client off the server should
> double the queries per second (although there would be additional delay
> due
> to network transmission). So even a single CPU P4 could easily do 100
> queries per second.
>
> Even though we are comparing apples and oranges, unless you are performing
> some really expensive queries, I would expect your configuration to be
> MUCH
> faster.
>
> -----Original Message-----
> From: Peter Keegan [mailto:[hidden email]]
> Sent: Thursday, May 18, 2006 1:32 PM
> To: [hidden email]; [hidden email]
> Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?
>
>
> I'm returning 20 results (about .5Kb each). In fact, I had to reduce that
> from 50 because the network was becoming the bottleneck.
>
> On the 16-cpu server, I ran tests using 8, 16 and 32 query threads, but
> there was no improvement with more threads. I still believe the hardware
> was
> to blame.
>
> Peter
>
> On 5/18/06, Robert Engels <[hidden email]> wrote:
> >
> > As someone else pointed out, the proposed mods will only affect queries
> > the
> > return a lot of Documents. If your test is only set up to return a few
> > documents (or possible none at all), then you will see no difference.
> >
> > The fact that some of the CPUs were far less than 100%, and others were
> at
> > 100% may be a good sign. How any query threads were you testing with?
> >
> > -----Original Message-----
> > From: Peter Keegan [mailto:[hidden email]]
> > Sent: Thursday, May 18, 2006 1:01 PM
> > To: [hidden email]; [hidden email]
> > Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?
> >
> >
> > Robert,
> >
> > Sorry I missed your questions.
> >
> > The test results seem hard to believe. Doubling the CPUs only increased
> > > through put by 20%??? Seems rather low for primarily a "read only"
> test.
> >
> >
> > I think this refers to the test I did on a 16 cpu (32 hyperthreaded)
> > server.
> > This system was actually two 8 cpu systems cabled together on their
> > backplanes. I suspect that some tradeoffs were made in its design that
> > allowed for this flexibility which resulted in the minimal improvement
> in
> > the tests.
> >
> > Peter did not seem to answer many of the follow-up questions (at least I
> > > could not find the answers) regarding whether or not the CPU usage was
> > > 100%.
> >
> >
> > On the 16-cpu system I noticed that load was not distributed very evenly
> -
> > some were near 100%, others were less than 10%. On the AMD Opteron
> > servers,
> > the distribution was quite even and between 75-100%.
> >
> > I look forward to your thoughts, and others - hopefully someone can run
> > the
> > > test on a multiple CPU machine.
> > >
> > >
> > I built Lucene with your mod's and ran my test on the 8 cpu AMD Linux
> > server, but noticed no difference in query throughput. It would seem
> that
> > ThreadLocal could improve performance, but I think my bottlenecks are
> > elsewhere, like IndexInput.readVInt and inserting results in priority
> > queues.
> >
> > Peter
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: FieldsReader synchronized access vs. ThreadLocal ?

Robert Engels
If you can get much higher throughput from a single threaded client, then
you should queue the requests in the server and process them from a single
thread (or a small pool of threads).

If you can get much higher throughput from a single threaded client seems to
also imply that Lucene (or at least your packaging) is NOT very concurrent
(since more threads actually reduce the efficiency).

-----Original Message-----
From: Peter Keegan [mailto:[hidden email]]
Sent: Friday, May 19, 2006 9:03 AM
To: [hidden email]; [hidden email]
Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?


The queries are mostly boolean (all AND'd terms), no. of terms varies
anywhere from a few to 25 or more, with 1 or 2 sort fields.

My tests are designed to measure total query throughput, not just raw search
speed. The client test program blasts queries from >50 threads over a socket
and runs on a separate server from Lucene. I can get much higher rates by
just blasting from a single thread in the client, but this doesn't simulate
the real use model.

Peter

On 5/19/06, Robert Engels <[hidden email]> wrote:

>
> As an aside,
>
> On my VERY crappy 1.2 ghz single CPU P4, using a index of 300k documents,
> I
> can perform 50 searches per second (returning 20 document matches each).
> This includes the time to serialize and send the results to the client
> (although the client is on the same machine, but it also competes for cpu
> with the search server).
>
> Based on some informal viewing of the CPU usage, the client consume 50-70%
> of the cpu, so I would assume that moving the client off the server should
> double the queries per second (although there would be additional delay
> due
> to network transmission). So even a single CPU P4 could easily do 100
> queries per second.
>
> Even though we are comparing apples and oranges, unless you are performing
> some really expensive queries, I would expect your configuration to be
> MUCH
> faster.
>
> -----Original Message-----
> From: Peter Keegan [mailto:[hidden email]]
> Sent: Thursday, May 18, 2006 1:32 PM
> To: [hidden email]; [hidden email]
> Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?
>
>
> I'm returning 20 results (about .5Kb each). In fact, I had to reduce that
> from 50 because the network was becoming the bottleneck.
>
> On the 16-cpu server, I ran tests using 8, 16 and 32 query threads, but
> there was no improvement with more threads. I still believe the hardware
> was
> to blame.
>
> Peter
>
> On 5/18/06, Robert Engels <[hidden email]> wrote:
> >
> > As someone else pointed out, the proposed mods will only affect queries
> > the
> > return a lot of Documents. If your test is only set up to return a few
> > documents (or possible none at all), then you will see no difference.
> >
> > The fact that some of the CPUs were far less than 100%, and others were
> at
> > 100% may be a good sign. How any query threads were you testing with?
> >
> > -----Original Message-----
> > From: Peter Keegan [mailto:[hidden email]]
> > Sent: Thursday, May 18, 2006 1:01 PM
> > To: [hidden email]; [hidden email]
> > Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?
> >
> >
> > Robert,
> >
> > Sorry I missed your questions.
> >
> > The test results seem hard to believe. Doubling the CPUs only increased
> > > through put by 20%??? Seems rather low for primarily a "read only"
> test.
> >
> >
> > I think this refers to the test I did on a 16 cpu (32 hyperthreaded)
> > server.
> > This system was actually two 8 cpu systems cabled together on their
> > backplanes. I suspect that some tradeoffs were made in its design that
> > allowed for this flexibility which resulted in the minimal improvement
> in
> > the tests.
> >
> > Peter did not seem to answer many of the follow-up questions (at least I
> > > could not find the answers) regarding whether or not the CPU usage was
> > > 100%.
> >
> >
> > On the 16-cpu system I noticed that load was not distributed very evenly
> -
> > some were near 100%, others were less than 10%. On the AMD Opteron
> > servers,
> > the distribution was quite even and between 75-100%.
> >
> > I look forward to your thoughts, and others - hopefully someone can run
> > the
> > > test on a multiple CPU machine.
> > >
> > >
> > I built Lucene with your mod's and ran my test on the 8 cpu AMD Linux
> > server, but noticed no difference in query throughput. It would seem
> that
> > ThreadLocal could improve performance, but I think my bottlenecks are
> > elsewhere, like IndexInput.readVInt and inserting results in priority
> > queues.
> >
> > Peter
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: FieldsReader synchronized access vs. ThreadLocal ?

Yonik Seeley
In reply to this post by Peter Keegan
On 5/19/06, Peter Keegan <[hidden email]> wrote:
> The client test program blasts queries from >50 threads over a socket
> and runs on a separate server from Lucene. I can get much higher rates by
> just blasting from a single thread in the client, but this doesn't simulate
> the real use model.

Wow... what is the real use model?  Do you mean 50 threads each making
requests as fast as they can (sending a new request as soon as they
get a response from the previous)?

Normally, if you have 50 outstanding requests at a time, your server
is clearly overloaded and you need more servers...

Do you get acceptable latency with 50 clients at a time?


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: FieldsReader synchronized access vs. ThreadLocal ?

Peter Keegan
The use model is to be able to handle bursts of heavy query load with
acceptable latency. The test program's threads send requests continuously
with a 50 msec delay in between. A separate thread reads all the results.
This is actually much harsher than the expected load and the latency is
high, but it helps in measuring the limits of the system.

Peter

On 5/19/06, Yonik Seeley <[hidden email]> wrote:

>
> On 5/19/06, Peter Keegan <[hidden email]> wrote:
> > The client test program blasts queries from >50 threads over a socket
> > and runs on a separate server from Lucene. I can get much higher rates
> by
> > just blasting from a single thread in the client, but this doesn't
> simulate
> > the real use model.
>
> Wow... what is the real use model?  Do you mean 50 threads each making
> requests as fast as they can (sending a new request as soon as they
> get a response from the previous)?
>
> Normally, if you have 50 outstanding requests at a time, your server
> is clearly overloaded and you need more servers...
>
> Do you get acceptable latency with 50 clients at a time?
>
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
> server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: FieldsReader synchronized access vs. ThreadLocal ?

Robert Engels
In reply to this post by Robert Engels
fyi, when the OS disk cache is not primed, I get better performance from
more threads (since when one thread is blocked on IO, the other may find its
data available, and continue the query), but as the cache becomes "ready"
the more threads I use the worse the performance (due to the overhead of
context switching). I am going to be able to test very soon on some
multi-processor Sun opteron machines, so I'll update when the information is
available.

It seems the best solution might be to use an adaptive pool, where based on
query avg. query response time the number of processing threads is adjusted.
Not sure if it would work reliably (since it would have to ignore complex
queries that might bias the assessment).

Our "search server" runs in a separate process and uses its own pool of
worker threads, so it would be quite easy for us to test something like
this.

I guess this is another reason that a 'core' set of performance tests, with
index corpuses, would be ideal in order to evaluate how different
modifications work based on hardware/OS.

-----Original Message-----
From: Robert Engels [mailto:[hidden email]]
Sent: Friday, May 19, 2006 9:18 AM
To: [hidden email]
Subject: RE: FieldsReader synchronized access vs. ThreadLocal ?


If you can get much higher throughput from a single threaded client, then
you should queue the requests in the server and process them from a single
thread (or a small pool of threads).

If you can get much higher throughput from a single threaded client seems to
also imply that Lucene (or at least your packaging) is NOT very concurrent
(since more threads actually reduce the efficiency).

-----Original Message-----
From: Peter Keegan [mailto:[hidden email]]
Sent: Friday, May 19, 2006 9:03 AM
To: [hidden email]; [hidden email]
Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?


The queries are mostly boolean (all AND'd terms), no. of terms varies
anywhere from a few to 25 or more, with 1 or 2 sort fields.

My tests are designed to measure total query throughput, not just raw search
speed. The client test program blasts queries from >50 threads over a socket
and runs on a separate server from Lucene. I can get much higher rates by
just blasting from a single thread in the client, but this doesn't simulate
the real use model.

Peter

On 5/19/06, Robert Engels <[hidden email]> wrote:

>
> As an aside,
>
> On my VERY crappy 1.2 ghz single CPU P4, using a index of 300k documents,
> I
> can perform 50 searches per second (returning 20 document matches each).
> This includes the time to serialize and send the results to the client
> (although the client is on the same machine, but it also competes for cpu
> with the search server).
>
> Based on some informal viewing of the CPU usage, the client consume 50-70%
> of the cpu, so I would assume that moving the client off the server should
> double the queries per second (although there would be additional delay
> due
> to network transmission). So even a single CPU P4 could easily do 100
> queries per second.
>
> Even though we are comparing apples and oranges, unless you are performing
> some really expensive queries, I would expect your configuration to be
> MUCH
> faster.
>
> -----Original Message-----
> From: Peter Keegan [mailto:[hidden email]]
> Sent: Thursday, May 18, 2006 1:32 PM
> To: [hidden email]; [hidden email]
> Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?
>
>
> I'm returning 20 results (about .5Kb each). In fact, I had to reduce that
> from 50 because the network was becoming the bottleneck.
>
> On the 16-cpu server, I ran tests using 8, 16 and 32 query threads, but
> there was no improvement with more threads. I still believe the hardware
> was
> to blame.
>
> Peter
>
> On 5/18/06, Robert Engels <[hidden email]> wrote:
> >
> > As someone else pointed out, the proposed mods will only affect queries
> > the
> > return a lot of Documents. If your test is only set up to return a few
> > documents (or possible none at all), then you will see no difference.
> >
> > The fact that some of the CPUs were far less than 100%, and others were
> at
> > 100% may be a good sign. How any query threads were you testing with?
> >
> > -----Original Message-----
> > From: Peter Keegan [mailto:[hidden email]]
> > Sent: Thursday, May 18, 2006 1:01 PM
> > To: [hidden email]; [hidden email]
> > Subject: Re: FieldsReader synchronized access vs. ThreadLocal ?
> >
> >
> > Robert,
> >
> > Sorry I missed your questions.
> >
> > The test results seem hard to believe. Doubling the CPUs only increased
> > > through put by 20%??? Seems rather low for primarily a "read only"
> test.
> >
> >
> > I think this refers to the test I did on a 16 cpu (32 hyperthreaded)
> > server.
> > This system was actually two 8 cpu systems cabled together on their
> > backplanes. I suspect that some tradeoffs were made in its design that
> > allowed for this flexibility which resulted in the minimal improvement
> in
> > the tests.
> >
> > Peter did not seem to answer many of the follow-up questions (at least I
> > > could not find the answers) regarding whether or not the CPU usage was
> > > 100%.
> >
> >
> > On the 16-cpu system I noticed that load was not distributed very evenly
> -
> > some were near 100%, others were less than 10%. On the AMD Opteron
> > servers,
> > the distribution was quite even and between 75-100%.
> >
> > I look forward to your thoughts, and others - hopefully someone can run
> > the
> > > test on a multiple CPU machine.
> > >
> > >
> > I built Lucene with your mod's and ran my test on the 8 cpu AMD Linux
> > server, but noticed no difference in query throughput. It would seem
> that
> > ThreadLocal could improve performance, but I think my bottlenecks are
> > elsewhere, like IndexInput.readVInt and inserting results in priority
> > queues.
> >
> > Peter
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]