[jira] Created: (LUCENE-1195) Performance improvement for TermInfosReader

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
Performance improvement for TermInfosReader
-------------------------------------------

                 Key: LUCENE-1195
                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Michael Busch
            Assignee: Michael Busch
            Priority: Minor
             Fix For: 2.4


Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
The second time when the posting list is opened (TermDocs or TermPositions).

The dictionary lookup is not cheap, that's why a significant performance improvement is
possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
cache to TermInfosReader.

I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
500,000 documents from wikipedia. Here are some test results:

50,000 AND queries with 3 terms each:
old:                  152 secs
new (with LRU cache): 112 secs (26% faster)

50,000 OR queries with 3 terms each:
old:                  175 secs
new (with LRU cache): 133 secs (24% faster)

For bigger indexes this patch will probably have less impact, for smaller once more.

I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (LUCENE-1195) Performance improvement for TermInfosReader

Mike Klaas

On 26-Feb-08, at 3:00 PM, Michael Busch (JIRA) wrote:

>
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
>
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
>
> For bigger indexes this patch will probably have less impact, for  
> smaller once more.

Wow -- term lookup occupied 50% of the query execution time?!  How  
many documents matched these queries?

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (LUCENE-1195) Performance improvement for TermInfosReader

Robert Engels
In reply to this post by Michael Gibney (Jira)
This can only possibly work if all 50,000 terms are held in memory,  
otherwise the cache management overhead is going to matter.

What is the number of terms in the database? What is the distribution  
of the terms used in the test case (differ for each query)?

Performance tests like this without the detailed data is not very  
accurate (or useful).

On Feb 26, 2008, at 5:00 PM, Michael Busch (JIRA) wrote:

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                  Key: LUCENE-1195
>                  URL: https://issues.apache.org/jira/browse/ 
> LUCENE-1195
>              Project: Lucene - Java
>           Issue Type: Improvement
>           Components: Index
>             Reporter: Michael Busch
>             Assignee: Michael Busch
>             Priority: Minor
>              Fix For: 2.4
>
>
> Currently we have a bottleneck for multi-term queries: the  
> dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where  
> searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or  
> TermPositions).
>
> The dictionary lookup is not cheap, that's why a significant  
> performance improvement is
> possible here if we avoid the second lookup. An easy way to do this  
> is to add a small LRU
> cache to TermInfosReader.
>
> I ran some performance experiments with an LRU cache size of 20,  
> and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
>
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
>
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
>
> For bigger indexes this patch will probably have less impact, for  
> smaller once more.
>
> I will attach a patch soon.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (LUCENE-1195) Performance improvement for TermInfosReader

Robert Engels
In reply to this post by Mike Klaas
That's what I was pointing out. The test is VERY broken....

On Feb 26, 2008, at 5:53 PM, Mike Klaas wrote:

>
> On 26-Feb-08, at 3:00 PM, Michael Busch (JIRA) wrote:
>>
>> 50,000 AND queries with 3 terms each:
>> old:                  152 secs
>> new (with LRU cache): 112 secs (26% faster)
>>
>> 50,000 OR queries with 3 terms each:
>> old:                  175 secs
>> new (with LRU cache): 133 secs (24% faster)
>>
>> For bigger indexes this patch will probably have less impact, for  
>> smaller once more.
>
> Wow -- term lookup occupied 50% of the query execution time?!  How  
> many documents matched these queries?
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572747#action_12572747 ]

Yonik Seeley commented on LUCENE-1195:
--------------------------------------

Nice results!
I assume this is on a non-optimized index?
One of the unexpected things I've seen in Solr is that enumerating over all the terms of a non-optimized index is a very significant component of the total time (including iterating over termdocs when necessary).  This was with terms with a relatively low df though.


> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572750#action_12572750 ]

Michael Busch commented on LUCENE-1195:
---------------------------------------

Test details:
The index has 500,000 docs and 3191625 unique terms. To construct the queries
I used terms with 1000<df<3000, the index has 3880 of them. I combined the
terms randomly. Each query has at least one hit. The AND queries have 25 hits
on average, the OR queries 5641.

The LRU cache was pretty small with a size of just 20.

The index is unoptimized with 11 segments.

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572750#action_12572750 ]

michaelbusch edited comment on LUCENE-1195 at 2/26/08 5:04 PM:
----------------------------------------------------------------

Test details:
The index has 500,000 docs and 3191625 unique terms. To construct the queries
I used terms with 1000<df<3000, the index has 3880 of them. I combined the
terms randomly. Each query has at least one hit. The AND queries have 25 hits
on average, the OR queries 5641.

The LRU cache was pretty small with a size of just 20.

The index is unoptimized with 11 segments.

The searcher was warmed for the tests, thus benefiting from FS caching, which
should be a common scenario for indexes of such a medium size.

      was (Author: michaelbusch):
    Test details:
The index has 500,000 docs and 3191625 unique terms. To construct the queries
I used terms with 1000<df<3000, the index has 3880 of them. I combined the
terms randomly. Each query has at least one hit. The AND queries have 25 hits
on average, the OR queries 5641.

The LRU cache was pretty small with a size of just 20.

The index is unoptimized with 11 segments.
 

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572753#action_12572753 ]

Yonik Seeley commented on LUCENE-1195:
--------------------------------------

Thinking about a common use in Solr: doing a query and faceting that query by a field... would blow out the cache (due to iterating over all the terms in a single field) if it's a global cache?  Is there a good way to prevent that from happening (perhaps just change lucene's existing single -entry thread local cache to a multi-entry thread local cache?)

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572768#action_12572768 ]

Eric commented on LUCENE-1195:
------------------------------

When faceting(iterating all terms), will each term be looked up twice in dictionary?

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572913#action_12572913 ]

Yonik Seeley commented on LUCENE-1195:
--------------------------------------

{quote}When faceting(iterating all terms), will each term be looked up twice in dictionary?{quote}

No, but consider the case of one thread iterating over all the terms in a field (due to faceting, a range query, prefix query, etc)  That would tend to destroy the cache for other threads doing simple queries unless a way could be found to bypass the cache during enumeration somehow, or make each cache thread specific.





> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-1195:
----------------------------------

    Attachment: lucene-1195.patch

Here is the simple patch. The cache is only used in TermInfosReader.get(Term).

So if for example a RangeQuery gets a TermEnum from the IndexReader, then
enumerating the terms using the TermEnum will not replace the terms in the
cache.

The LRUCache itself is not synchronized. It might happen that multiple
threads lookup the same term at the same time, then we might get an cache
miss. But I think such a situation should be very rare, and it's therefore
better to avoid the synchronization overhead?

I set the default cache size to 1024. A cache entry is a (Term, TermInfo)
tuple. TermInfo needs 24 bytes, I think a Term approx. 20-30 bytes? So
the cache would need about 1024 * ~50 bytes = 50Kb plus a bit overhead
from the LinkedHashMap. This is the memory requirement per index segment,
so a non-optimized index with 20 segments would need about 1MB more memory
with this cache. I think this is acceptable? Otherwise we can also decrease
the cache size.

All core & contrib tests pass.

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573045#action_12573045 ]

Yonik Seeley commented on LUCENE-1195:
--------------------------------------

{quote}The LRUCache itself is not synchronized. {quote}

Unfortunately, it needs to be... no getting around it.


> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573049#action_12573049 ]

Yonik Seeley commented on LUCENE-1195:
--------------------------------------

{quote}
Here is the simple patch. The cache is only used in TermInfosReader.get(Term).

So if for example a RangeQuery gets a TermEnum from the IndexReader, then
enumerating the terms using the TermEnum will not replace the terms in the
cache.
{quote}

But for each term, TermDocs.seek() will be called, and that will do a TermInfosReader.get(Term), replacing items in the cache.
For SegmentReader, seek(TermEnum) actually doesn't call TermInfosReader.get(Term), but any kind of multi-reader does.


> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573065#action_12573065 ]

Michael Busch commented on LUCENE-1195:
---------------------------------------

{quote}
Unfortunately, it needs to be... no getting around it.
{quote}

You're right, and I'm stupid :)
Actually what I meant was that the get() and put() methods don't need to
be synchronized if the underlying data structure, i. e.the LinkedHashMap,
that I'm using is thread-safe, otherwise it might return inconsistent
data.
But the LinkedHashMap is not, unless I decorate it with
Collections.synchronizedMap(). Do you know what's faster? Using the
synchronized map or making get() and put() synchronized? Probably
there's not really a difference, because the decorator that
Collections.synchronizedMap() returns just does essentially the same?

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573073#action_12573073 ]

Yonik Seeley commented on LUCENE-1195:
--------------------------------------

There's higher level synchronization too (ensuring that two different threads don't generate the same cache entry at the same time), and I agree that should not be done in this case.

Just use Collections.synchronizedMap(), it will be the same speed, more readable, and can be easily replaced later anyway.

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-1195:
----------------------------------

    Attachment: lucene-1195.patch

Changes in the patch:
- the used cache is thread-safe now
- added a ThreadLocal to TermInfosReader, so that each thread has its own cache of size 1024 now
- SegmentTermEnum.scanTo() returns now the number of invocations of next(). TermInfosReader only
  puts TermInfo objects into the cache if scanTo() has called next() more than once. Thus, if e. g.
  a WildcardQuery or RangeQuery iterates over terms in order, only the first term will be put into
  the cache. This is an addition to the ThreadLocal that prevents one thread from wiping out its
  own cache with such a query.
- added a new package org/apache/lucene/util/cache that has a SimpleMapCache (taken from LUCENE-831)
  and the SimpleLRUCache that was part of the previous patch here. I decided to put the caches in
  a separate package, because we can reuse them for different things like LUCENE-831 or e. g. after
  deprecating Hits as LRU cache for recently loaded stored documents.
 
I reran the same performance experiments and it turns out that the speedup is still the same and
the overhead of the ThreadLocal is in the noise. So I think this should be a good approach now?

I also ran similar performance tests on a bigger index with about 4.3 million documents. The
speedup with 50k AND queries was, as expected, not as significant anymore. However, the speedup
was still about 7%. I haven't run the OR queries on the bigger index yet, but most likely the
speedup will not be very significant anymore.

All unit tests pass.

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch, lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-1195:
----------------------------------

    Attachment: lucene-1195.patch

In the previous patch was a silly thread-safety problem that I fixed now.
Some threads in the TestIndexReaderReopen test occasionally hit
errors (I fixed the testcase to fail now whenever an error is hit).

I made some other changes to the TermInfosReader. I'm not using
two ThreadLocals anymore for the SegmentTermEnum and Cache,
but added a small inner class called ThreadResources which holds
references to those two objects. I also minimized the amount of
ThreadLocal.get() calls by passing around the enumerator.

Furthermore I got rid of the private scanEnum() method and inlined
it into the get() method to fix the above mentioned thread-safety
problem. And I also realized that the cache itself does not have to
be thread-safe, because we put it into a ThreadLocal.

I reran the same performance test that I ran for the first patch and
this version seems to be even faster: 107secs vs. 112secs with
the first patch (~30% improvement compared to trunk, 152secs).

All tests pass, including the improved
TestIndexReaderReopen.testThreadSafety(), which I ran multiple
times.

OK I think this patch is ready now, I'm planning to commit it in a
day or so.

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch, lucene-1195.patch, lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599297#action_12599297 ]

Michael Busch commented on LUCENE-1195:
---------------------------------------

{quote}
Using the deprecated method would have the advantage that it (the whole wrapper class in fact) would _have_ to be removed in 3.0.
{quote}

Thanks for reviewing! You're right, I will change it to use the deprecated method and also deprecate the wrapper class itself.


> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch, lucene-1195.patch, lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-1195:
----------------------------------

    Comment: was deleted

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch, lucene-1195.patch, lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
> cache to TermInfosReader.
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Busch
In reply to this post by Michael Gibney (Jira)
Oups, I added this comment to the wrong issue... too many open browser
tabs... :)

Michael Busch (JIRA) wrote:

>     [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599297#action_12599297 ]
>
> Michael Busch commented on LUCENE-1195:
> ---------------------------------------
>
> {quote}
> Using the deprecated method would have the advantage that it (the whole wrapper class in fact) would _have_ to be removed in 3.0.
> {quote}
>
> Thanks for reviewing! You're right, I will change it to use the deprecated method and also deprecate the wrapper class itself.
>
>
>> Performance improvement for TermInfosReader
>> -------------------------------------------
>>
>>                 Key: LUCENE-1195
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>          Components: Index
>>            Reporter: Michael Busch
>>            Assignee: Michael Busch
>>            Priority: Minor
>>             Fix For: 2.4
>>
>>         Attachments: lucene-1195.patch, lucene-1195.patch, lucene-1195.patch
>>
>>
>> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
>> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
>> The second time when the posting list is opened (TermDocs or TermPositions).
>> The dictionary lookup is not cheap, that's why a significant performance improvement is
>> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU
>> cache to TermInfosReader.
>> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
>> 500,000 documents from wikipedia. Here are some test results:
>> 50,000 AND queries with 3 terms each:
>> old:                  152 secs
>> new (with LRU cache): 112 secs (26% faster)
>> 50,000 OR queries with 3 terms each:
>> old:                  175 secs
>> new (with LRU cache): 133 secs (24% faster)
>> For bigger indexes this patch will probably have less impact, for smaller once more.
>> I will attach a patch soon.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12