Performance between Filter and HitCollector?

classic Classic list List threaded Threaded
8 messages Options
adb
Reply | Threaded
Open this post in threaded view
|

Performance between Filter and HitCollector?

adb
There are (at least) two ways to generate a BitSet which can be used for filtering.

Filter.bits()

   BitSet bits = new BitSet(reader.maxDoc());
   TermDocs td = reader.termDocs(new Term("field", "text");
   while (td.next())
   {
       bits.set(td.doc());
   }
   return bits;

and HitCollector.collect(), as suggested in Javadocs

    final BitSet bits = new BitSet(indexReader.maxDoc());
    searcher.search(query, new HitCollector() {
        public void collect(int doc, float score) {
          bits.set(doc);
        }
      });

SOLR seems to use DocSetHitCollector in places which allows the DocSet interface
to be used rather then plain old BitSet which allows small sets to be optimised,
but does anyone know the performance implications of using HitCollector, if
score is not required, against using Filter and then generating a DocSet?

Antony






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance between Filter and HitCollector?

Chris Hostetter-3

it's kind of an Apples/Oranges comparison .. in the examples you gave
below, one is executing an arbitrary query (which oculd be anything) the
other is doing a simple TermEnumeration.

Asuming that Query is a TermQuery, the Filter is theoreticaly going to be
faster becuase it does't have to compute any Scores ... generally speaking
a a Filter will alwyas be a little faster then a functionally equivilent
Query for the purposes of building up a simple BitSet of matching
documents because teh Query involves the score calcuations ... but the
Query is generally more usable.

The Query can also be more efficient in other ways, because the
HitCollector doesn't *have* to build a BitSet, it can deal with the
results in whatever way it wants (where as a Filter allways generates a
BitSet).

Solr goes the HitCollector route for a few reasons:
  1) allows us to use hte DocSet abstraction which allows other
     performance benefits over straight BitSets
  2) allows us to have simpler code that builds DocSets and DocLists
     (DocLists know about scores, sorting, and pagination) in a single
     pass when scores or sorting are requested.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance between Filter and HitCollector?

eks dev
In reply to this post by adb
just to complete this fine answer,
there is also Matcher patch (https://issues.apache.org/jira/browse/LUCENE-584)  that could bring the best of both worlds via e.g. ConstantScoringQuery or another abstraction that enables disabling Scoring (where appropriate)

----- Original Message ----
From: Chris Hostetter <[hidden email]>
To: [hidden email]
Sent: Wednesday, 14 March, 2007 7:15:06 PM
Subject: Re: Performance between Filter and HitCollector?


it's kind of an Apples/Oranges comparison .. in the examples you gave
below, one is executing an arbitrary query (which oculd be anything) the
other is doing a simple TermEnumeration.

Asuming that Query is a TermQuery, the Filter is theoreticaly going to be
faster becuase it does't have to compute any Scores ... generally speaking
a a Filter will alwyas be a little faster then a functionally equivilent
Query for the purposes of building up a simple BitSet of matching
documents because teh Query involves the score calcuations ... but the
Query is generally more usable.

The Query can also be more efficient in other ways, because the
HitCollector doesn't *have* to build a BitSet, it can deal with the
results in whatever way it wants (where as a Filter allways generates a
BitSet).

Solr goes the HitCollector route for a few reasons:
  1) allows us to use hte DocSet abstraction which allows other
     performance benefits over straight BitSets
  2) allows us to have simpler code that builds DocSets and DocLists
     (DocLists know about scores, sorting, and pagination) in a single
     pass when scores or sorting are requested.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]






               
___________________________________________________________
All New Yahoo! Mail – Tired of unwanted email come-ons? Let our SpamGuard protect you. http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance between Filter and HitCollector?

Otis Gospodnetic-2
In reply to this post by adb
eks dev and others - have you tried using the code from LUCENE-584?  Noticed any performance increase when you disabled scoring?  I'd like to look at that patch soon and commit it if everything is in place and makes sense, so I'm curious if you or anyone else already tried this patch...

Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: eks dev <[hidden email]>
To: [hidden email]
Sent: Wednesday, March 14, 2007 3:59:25 PM
Subject: Re: Performance between Filter and HitCollector?

just to complete this fine answer,
there is also Matcher patch (https://issues.apache.org/jira/browse/LUCENE-584)  that could bring the best of both worlds via e.g. ConstantScoringQuery or another abstraction that enables disabling Scoring (where appropriate)

----- Original Message ----
From: Chris Hostetter <[hidden email]>
To: [hidden email]
Sent: Wednesday, 14 March, 2007 7:15:06 PM
Subject: Re: Performance between Filter and HitCollector?


it's kind of an Apples/Oranges comparison .. in the examples you gave
below, one is executing an arbitrary query (which oculd be anything) the
other is doing a simple TermEnumeration.

Asuming that Query is a TermQuery, the Filter is theoreticaly going to be
faster becuase it does't have to compute any Scores ... generally speaking
a a Filter will alwyas be a little faster then a functionally equivilent
Query for the purposes of building up a simple BitSet of matching
documents because teh Query involves the score calcuations ... but the
Query is generally more usable.

The Query can also be more efficient in other ways, because the
HitCollector doesn't *have* to build a BitSet, it can deal with the
results in whatever way it wants (where as a Filter allways generates a
BitSet).

Solr goes the HitCollector route for a few reasons:
  1) allows us to use hte DocSet abstraction which allows other
     performance benefits over straight BitSets
  2) allows us to have simpler code that builds DocSets and DocLists
     (DocLists know about scores, sorting, and pagination) in a single
     pass when scores or sorting are requested.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]






       
___________________________________________________________
All New Yahoo! Mail – Tired of unwanted email come-ons? Let our SpamGuard protect you. http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: Performance between Filter and HitCollector?

adb
In reply to this post by Chris Hostetter-3
Thanks for the detailed reponse Hoss.  That's the sort of in depth golden nugget
I'd like to see in a copy of LIA 2 when it becomes available...

I've wanted to use Filter to cache certain of my Term Queries, as it looked
faster for straight Term Query searches, but Solr's DocSet interface abstraction
is more useful.  HashDocSet will probably satisfy 90% of my cache.

Index DBs will typically be in the 1-3 million  documents range, but for mail
which is spread over 1-6K user, so caching lots of BitSets for that number of
users in not practical!

I ended up creating a DocSetFilter and creating DocSets (a la Solr) from BitSet
which is then cached.  I then convert it back during Filter.bits().  Not the
best solution, but the typical hit size is small, so the iteration is fast.

Thanks eks dev for the info about Lucene-584 - that looks like an interesting
set of patches.

Antony

Chris Hostetter wrote:

> it's kind of an Apples/Oranges comparison .. in the examples you gave
> below, one is executing an arbitrary query (which oculd be anything) the
> other is doing a simple TermEnumeration.
>
> Asuming that Query is a TermQuery, the Filter is theoreticaly going to be
> faster becuase it does't have to compute any Scores ... generally speaking
> a a Filter will alwyas be a little faster then a functionally equivilent
> Query for the purposes of building up a simple BitSet of matching
> documents because teh Query involves the score calcuations ... but the
> Query is generally more usable.
>
> The Query can also be more efficient in other ways, because the
> HitCollector doesn't *have* to build a BitSet, it can deal with the
> results in whatever way it wants (where as a Filter allways generates a
> BitSet).
>
> Solr goes the HitCollector route for a few reasons:
>   1) allows us to use hte DocSet abstraction which allows other
>      performance benefits over straight BitSets
>   2) allows us to have simpler code that builds DocSets and DocLists
>      (DocLists know about scores, sorting, and pagination) in a single
>      pass when scores or sorting are requested.



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance between Filter and HitCollector?

Karl Wettin
In reply to this post by Otis Gospodnetic-2

15 mar 2007 kl. 04.09 skrev Otis Gospodnetic:

> eks dev and others - have you tried using the code from  
> LUCENE-584?  Noticed any performance increase when you disabled  
> scoring?  I'd like to look at that patch soon and commit it if  
> everything is in place and makes sense, so I'm curious if you or  
> anyone else already tried this patch...

I was trying out Matcher some month ago when fooling around with ways  
of improving speed in the "active search cache" of LUCENE-550. It  
worked just fine for me. I made no futher investigations, nor do I  
have any performance details. I plan to implement it in there for  
real any year now.

So +1 for commit.


--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance between Filter and HitCollector?

eks dev
In reply to this post by adb
great!

a few words to refresh my memory, it's been a while...

- This patch lays only groundwork and should not cause any performance changes in existing code per se, that is what we have tested extensively some months ago (compatibility), it applied cleanly,  passed all Junit tests and our internal regression test noticed no difference in speed or behavior, benchmarking code was not existent at that time.

- it removes Filter dependency on BitSet and provides a few BitSet alternatives, which is in itself quantum leap in Filter usage as it makes endless possibilities (e.g. solr OpenBitSet would bring immediate speed-up, much better Filter cache  efficiency...). We made a few experiments where we replaced BitSet with VInt and OpenBitSet implementations and significant (up to 40%) speed-ups, but it was a long time ago.

- What we did not attempt is to disable Scoring systematically (that would be the next step after having this patch committed, practically using Matcher, I expect quite a few patches to follow, e.g. ConstantScoringQuery).

I am not very familiar with Scoring part of the code, but I am somehow convinced it would be now relatively easy  to support "pure boolean clauses" in Query, without using Filter. So the end user could specify new WhateverQuery.setScored(false) or some so, .... ah, whatever, let us have Matcher committed first as the benefit of getting rid of memory hungry BitSet in Filter justifies it completely.

many thanks for taking this forward! it will help us to get rid of some really ugly code we now maintain (just to get rid of BitSet)....





----- Original Message ----
From: Otis Gospodnetic <[hidden email]>
To: [hidden email]
Sent: Thursday, 15 March, 2007 4:09:44 AM
Subject: Re: Performance between Filter and HitCollector?

eks dev and others - have you tried using the code from LUCENE-584?  Noticed any performance increase when you disabled scoring?  I'd like to look at that patch soon and commit it if everything is in place and makes sense, so I'm curious if you or anyone else already tried this patch...

Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: eks dev <[hidden email]>
To: [hidden email]
Sent: Wednesday, March 14, 2007 3:59:25 PM
Subject: Re: Performance between Filter and HitCollector?

just to complete this fine answer,
there is also Matcher patch (https://issues.apache.org/jira/browse/LUCENE-584)  that could bring the best of both worlds via e.g. ConstantScoringQuery or another abstraction that enables disabling Scoring (where appropriate)

----- Original Message ----
From: Chris Hostetter <[hidden email]>
To: [hidden email]
Sent: Wednesday, 14 March, 2007 7:15:06 PM
Subject: Re: Performance between Filter and HitCollector?


it's kind of an Apples/Oranges comparison .. in the examples you gave
below, one is executing an arbitrary query (which oculd be anything) the
other is doing a simple TermEnumeration.

Asuming that Query is a TermQuery, the Filter is theoreticaly going to be
faster becuase it does't have to compute any Scores ... generally speaking
a a Filter will alwyas be a little faster then a functionally equivilent
Query for the purposes of building up a simple BitSet of matching
documents because teh Query involves the score calcuations ... but the
Query is generally more usable.

The Query can also be more efficient in other ways, because the
HitCollector doesn't *have* to build a BitSet, it can deal with the
results in whatever way it wants (where as a Filter allways generates a
BitSet).

Solr goes the HitCollector route for a few reasons:
  1) allows us to use hte DocSet abstraction which allows other
     performance benefits over straight BitSets
  2) allows us to have simpler code that builds DocSets and DocLists
     (DocLists know about scores, sorting, and pagination) in a single
     pass when scores or sorting are requested.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]






       
___________________________________________________________
All New Yahoo! Mail – Tired of unwanted email come-ons? Let our SpamGuard protect you. http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]






               
___________________________________________________________
What kind of emailer are you? Find out today - get a free analysis of your email personality. Take the quiz at the Yahoo! Mail Championship.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Performance between Filter and HitCollector?

Erik Hatcher
In reply to this post by adb

On Mar 15, 2007, at 12:27 AM, Antony Bowesman wrote:
> Thanks for the detailed reponse Hoss.  That's the sort of in depth  
> golden nugget I'd like to see in a copy of LIA 2 when it becomes  
> available...

NOTED!   :)

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]