[jira] Created: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

classic Classic list List threaded Threaded
47 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
Do MultiTermQuery boolean rewrites per segment
----------------------------------------------

                 Key: LUCENE-2690
                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
             Project: Lucene - Java
          Issue Type: Improvement
    Affects Versions: 4.0
            Reporter: Uwe Schindler
            Assignee: Uwe Schindler
             Fix For: 4.0
         Attachments: LUCENE-2690.patch

MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.

This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..

Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2690:
----------------------------------

    Attachment: LUCENE-2690.patch

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2690:
----------------------------------

    Attachment: LUCENE-2690.patch

Updated patch, that also checks for duplicate terms in the fuzzy rewrite. This should be fine now, but we need to fix the FuzzyQuery tests to checks for multiple segments with the same terms that should fail with this patch.

Maybe we need a separate MTQ tests that creates two IndexWriters which add documents with an overlapping term set to both indexes. Queries are then ran using MzultiReader, so we can control merging and make sure the term appears really in two "segments". I will work on a test for that.

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919496#action_12919496 ]

Michael McCandless commented on LUCENE-2690:
--------------------------------------------

It'd be nice somehow to have MTQ.getTotalNumberOfTerms return the *unique* term count instead of the total number of terms visited across all segments...

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919528#action_12919528 ]

Michael McCandless commented on LUCENE-2690:
--------------------------------------------

We have to sort the terms coming out of the BytesRefHash, else we get bad seek performance because the within-block seek opto will otherwise often fail to apply...

So I used a TreeMap instead of HashMap.

Then ran a quick perf test on 10 M Wikipedia index:

||Query||QPS clean||QPS mtqseg||Pct diff||||
|unit*|11.83|11.80|{color:red}-0.3%{color}|
|un*d|13.64|16.95|{color:green}24.3%{color}|
|u*d|2.67|3.77|{color:green}41.1%{color}|
|un*ed|34.85|74.94|{color:green}115.0%{color}|
|uni*ed|183.37|437.13|{color:green}138.4%{color}|

So these are good gains!  I can't run FuzzyQuery until we fix the tie-break problem...

I'm really not sure why the prefix query sees no gain yet the others do (I would have actually expected the reverse, because PrefixTermsEnum's accept method is so simple).

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2690:
--------------------------------

    Attachment: LUCENE-2690.patch

we fixed some bugs in the patch, it is not ready for committing, but the tests now pass.

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2690:
---------------------------------------

    Attachment: LUCENE-2690.patch

Another iteration... same perf as before but more failures!

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2690:
----------------------------------

    Attachment: LUCENE-2690.patch

Same as Mike's patch, only with some nocommits removed (max clause count increased) and added the missing FloatUtils file.

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919561#action_12919561 ]

Uwe Schindler commented on LUCENE-2690:
---------------------------------------

by the way: no more failures - only speed improvements (mostly)! Seems that TestFuzzyQuery2 failed because of the incorrectly increased max clause count!

The last thing to do is fixing the attribute stuff to separate the two attribute parts.

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919573#action_12919573 ]

Simon Willnauer commented on LUCENE-2690:
-----------------------------------------

Guys, awesome improvements!! Here are some comments...

* In CutOffTermCollector:
{code} final BytesRefHash pendingTerms = new BytesRefHash(new ByteBlockPool(new RecyclingByteBlockAllocator()));{code}
Sice we do not reuse the allocator we don't need to use the synced one here. There is no reset call anywhere to free the allocated blocks too. We should just use new BytesRefHash() here.


* BooleanQueryRewrite#rewrite uses a HashMap to keep track of BytesRef and TermFreqBoost. I wonder if we should make use of the ParallelArray technique we us in the indexing chain together with a BytesRefHash which could safe us lots of object creation and GC cost would be lower to once MTQ gets under load. Those MTQ can create a very large amount of objects though and this seems to be a hot spot. I currently have use-cases for direct support of something like a ParallelArray base class in LUCENE-2186 and it seems we can use it here too.

* In FloatsUtil#nextAfter I wonder if we need the following lines:  {code}
return new Float(direction)
...
return Double.valueOf(direction).floatValue();
{code} since those methods do nothing else than a (float) direction case really.

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2690:
---------------------------------------

    Attachment: LUCENE-2690-hack.patch


I attached a hacked patch... nowhere near committable, various tests
fail, etc... yet I think once we clean it up, the approach is viable.

I started from the patch like 2 iterations ago, and then fixed how the
MTQ BQ rewrite works so that instead of the two passes (first to
gather matching terms, second to create weight/scorers & run the BQ),
it now makes a single pass.

In that single pass it records which terms matched which segments, and
creates TermScorer for each.

After the single pass, once we've summed up the top level docFreq for
all terms, I go back and reset the weights for all the TermScorers,
sumSQ them, normalize, etc., and then create a FakeQuery object whose
only purpose is to remember the per-segment scorers and provide them
once .scorer(...) is called on each segment.

The big gain with this approach is you don't waste effort trying to
seek to non-existent terms in the sub readers.  Normally the terms
cache would save you here, but, we never cache a miss and so when we
try to look that up again it's always a real (costly) seek.

With this approach we can disable using the terms cache entirely from
MTQ.rewrite, which is great.

I believe the patch works correctly, at least for this test, because
on my 10M wikipedia index it gets identical top N results as clean
trunk.  Here're the perf gains:

||Query||QPS clean||QPS mtqseg||Pct diff||||
|state|37.49|37.40|{color:red}-0.2%{color}|
|unit*|11.86|20.23|{color:green}70.5%{color}|
|un*d|13.58|30.85|{color:green}127.2%{color}|
|uni*ed|173.22|535.27|{color:green}209.0%{color}|
|u*d|2.61|9.05|{color:green}247.3%{color}|
|un*ed|33.59|120.32|{color:green}258.1%{color}|

Note that these gains already include the sizable gains from the
original patch, but the single pass approach makes further great
gains, especially eg on the prefix query.

I don't think we should couple this new patch w/ this issue... this
issue already has awesome gains with a fairly minor change...
I'll open a new issue.


> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690-hack.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919590#action_12919590 ]

Michael McCandless commented on LUCENE-2690:
--------------------------------------------

OK I opened LUCENE-2694.

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690-hack.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919594#action_12919594 ]

Robert Muir commented on LUCENE-2690:
-------------------------------------

{quote}
The big gain with this approach is you don't waste effort trying to
seek to non-existent terms in the sub readers. Normally the terms
cache would save you here, but, we never cache a miss and so when we
try to look that up again it's always a real (costly) seek.

With this approach we can disable using the terms cache entirely from
MTQ.rewrite, which is great.
{quote}

This is the way to go because its horrible for the MTQ to touch the terms cache at all,
and depending on it for good performance is even worse.

I think if you somehow changed the benchmark to use multiple threads and had different
queries executing at the same time, you would see these guys fighting each other
over huge amounts of terms with df=1 and slowing each other down... but we wouldnt
have this problem with them rewriting to FakeQuery



> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690-hack.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919597#action_12919597 ]

Robert Muir commented on LUCENE-2690:
-------------------------------------

bq. In FloatsUtil#nextAfter I wonder if we need the following lines:

Simon, this is a good point. I poached this method from harmony's [StrictMath.nextAfter|http://svn.apache.org/repos/asf/harmony/enhanced/java/branches/java6/classlib/modules/luni/src/main/java/java/lang/StrictMath.java]
Its interesting to take a look also at their [Math.nextAfter|http://svn.apache.org/repos/asf/harmony/enhanced/java/branches/java6/classlib/modules/luni/src/main/java/java/lang/Math.java]

The difference is this Double promotion, I don't understand if this affects us at all or what it would change.
In both cases I do not understand why the boxing is necessary!

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690-hack.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920511#action_12920511 ]

Uwe Schindler commented on LUCENE-2690:
---------------------------------------

Simon:

bq. Sice we do not reuse the allocator we don't need to use the synced one here. There is no reset call anywhere to free the allocated blocks too. We should just use new BytesRefHash() here.

There is no such ctor in trunk. The only available allocator is the used one.

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690-hack.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920589#action_12920589 ]

Simon Willnauer commented on LUCENE-2690:
-----------------------------------------

bq. There is no such ctor in trunk. The only available allocator is the used one.
good point there is one in mine :D - I'm going to upload a patch for this later / tomorrow...

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690-hack.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2690:
------------------------------------

    Attachment: LUCENE-2690.patch

Current patch makes use of a DirectAllocator without recycling etc. I remove the unnecessary boxing in FlaotsUtil and replaced the terms HashMap with a BytesRefHash. I skipped the latest patch since mike marked it as a hack and opened a new issue for that. This one is based on uwes latest one. All tests pass for me though.

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690-hack.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2690:
----------------------------------

    Attachment: LUCENE-2690.patch

Thanks for the improvements, some comments and changes I did locally:

- The code in BooleanQueryRewrite uses += for the boost and docFreq in the case of (>=0, no entry in BytesRefHash), but this should only be an assignment. The update and comparison in the assert should be done only when an entry is already in the hash. Boosts should never be sumed up.
- The parts for update with LUCENE-2702 are marked, they wrap currently with new BytesRef(#get(i)) and should be replaced with code like it was before using PagedBytes
- The work for creating the BytesStartArray is much to do, maybe we can unfinal the DirectBytesStartArray and reuse the code. This would make it easier to extend it and simply add more parallel arrays. Client code should not need to replcate the code (this is maybe another issue).
- But there is also a problem with the current code in TermFreqBoostByteStart: The arrays may not use the exact same size as expected (depending how oversize/grow works). As they are parallel arrays, all should be equal size, so we should only use grow/oversize only for the base array and resize the others to same size. Do we have an ArrayUtil method for that? Currently it (may) be broken. Any comments?

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690-hack.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2690:
----------------------------------

    Attachment: LUCENE-2690.patch

Here a patch with the allocation problems resolved. Also the DirectBytesStartArray is public.

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690-hack.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920909#action_12920909 ]

Simon Willnauer commented on LUCENE-2690:
-----------------------------------------

bq. The code in BooleanQueryRewrite uses += for the boost and docFreq in the case of (>=0, no entry in BytesRefHash), but this should only be an assignment. The update and comparison in the assert should be done only when an entry is already in the hash. Boosts should never be sumed up.
ah yeah - true for sure! it did not break since that only happens once when it is initially added. but you are right for sure that this should only be an assignment

{quote}
But there is also a problem with the current code in TermFreqBoostByteStart: The arrays may not use the exact same size as expected (depending how oversize/grow works). As they are parallel arrays, all should be equal size, so we should only use grow/oversize only for the base array and resize the others to same size. Do we have an ArrayUtil method for that? Currently it (may) be broken. Any comments?
{quote}

good catch man! this won't happen here but its cleaner to use the exact same size. The bigger problem is that I missed to add the right constant to the grow method though. I can fix in a minute

> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690-hack.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using TopTermsBooleanQueryRewrite), the auto constant rewrite method and the ScoringBQ rewrite methods using a MultiFields wrapper on the top-level reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses some additional datastructures (hashed sets/maps) to exclude duplicate terms. All tests currently pass, but FuzzyQuery's tests should not, because it depends for the minimum score handling, that the terms are collected in order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

123