constant scoring queries

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

constant scoring queries

Doug Cutting
Background: In http://issues.apache.org/bugzilla/show_bug.cgi?id=34673,
Yonik Seely proposes a ConstantScoreQuery, based on a Filter.  And in
http://www.mail-archive.com/lucene-dev@.../msg08007.html 
I proposed a mechanism to promote the use of Filters.  Through all of
this, Paul Elshot has hinted that there might be a better way.

Here's another proposal, tackling many of the same issues:

1. Add two methods to Query.java:

   public boolean constantScoring();
   public void constantScoring(boolean);

   When constantScoring(), the boost() is the score for matches.

2. Add two methods to Searcher.java:

   public BitSet cachedBitSet(Query) { return null; }
   public void cacheBitSet(Query, BitSet) {}

   IndexSearcher overrides these to maintain an LRU cache of bitsets.

3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
is not thrown.

4. Modify BooleanScorer to, if constantScoring(),
   - check Searcher for a cached bitset
   - failing that, create a bitset
   - evaluate clauses serially, saving results in bitset
   - cache the bitset
   - use the bitset to handle doc(), next() and skipTo();

5. TermQuery and PhraseQuery could be similarly modified, so that, when
constant scoring, bitsets are cached for very common terms (e.g., >5% of
documents).

With these changes, WildcardQuery, PrefixQuery, RangeQuery etc., when
declared to be constant scoring, will operate much faster and never
throw TooManyClauses.  We can add an option (the default?) to
QueryParser to make these constant scoring.

Also, instead of BitSet we could use an interface:

   public interface DocIdSet {
     void add(int docId);
     boolean contains(int docId);
     int next(int docId);
   }

to permit sparse representations.

Thoughts?

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: constant scoring queries

Robert Engels
I did the nearly the exact same thing in my "derived" Lucene. But in order
to limit modifications to the Lucene core, I created a QueryCache class, and
have derived versions of Prefix and Range query consult the class, passing
in the IndexReader and query to see if there is a cached result. I also
calls QueryCache.clear(IndexReader), when the IndexReader goes out of scope.

Will there be a problem with associating the cache with the IndexSearcher
instances, since it seems that common Lucene code uses code similar to

IndexSearcher searcher = new IndexSearcher(reader);

every time they need to perform a search?

It is REALLY efficient for automatic caching of common range queries and
prefix queries, as I think many users of Lucene pass use a range query to
look for documents modified in the "last n days". The ONLY overhead is extra
memory usage (since without the cache the query needs to be executed as is),
but the size of the LRU cache can be controlled via a property.

-----Original Message-----
From: Doug Cutting [mailto:[hidden email]]
Sent: Tuesday, May 10, 2005 3:40 PM
To: [hidden email]
Subject: constant scoring queries


Background: In http://issues.apache.org/bugzilla/show_bug.cgi?id=34673,
Yonik Seely proposes a ConstantScoreQuery, based on a Filter.  And in
http://www.mail-archive.com/lucene-dev@.../msg08007.html
I proposed a mechanism to promote the use of Filters.  Through all of
this, Paul Elshot has hinted that there might be a better way.

Here's another proposal, tackling many of the same issues:

1. Add two methods to Query.java:

   public boolean constantScoring();
   public void constantScoring(boolean);

   When constantScoring(), the boost() is the score for matches.

2. Add two methods to Searcher.java:

   public BitSet cachedBitSet(Query) { return null; }
   public void cacheBitSet(Query, BitSet) {}

   IndexSearcher overrides these to maintain an LRU cache of bitsets.

3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
is not thrown.

4. Modify BooleanScorer to, if constantScoring(),
   - check Searcher for a cached bitset
   - failing that, create a bitset
   - evaluate clauses serially, saving results in bitset
   - cache the bitset
   - use the bitset to handle doc(), next() and skipTo();

5. TermQuery and PhraseQuery could be similarly modified, so that, when
constant scoring, bitsets are cached for very common terms (e.g., >5% of
documents).

With these changes, WildcardQuery, PrefixQuery, RangeQuery etc., when
declared to be constant scoring, will operate much faster and never
throw TooManyClauses.  We can add an option (the default?) to
QueryParser to make these constant scoring.

Also, instead of BitSet we could use an interface:

   public interface DocIdSet {
     void add(int docId);
     boolean contains(int docId);
     int next(int docId);
   }

to permit sparse representations.

Thoughts?

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: constant scoring queries

Yonik Seeley
In reply to this post by Doug Cutting
Hey now... you're going to obsolete all my in-house code and put me
out of a job ;-)

Could you elaborate on the advantage of having say a TermQuery that
could be either normal-scoring or constant-scoring vs two different
Query classes for doing this?  They seem roughly equivalent.


> 1. Add two methods to Query.java:
>
>    public boolean constantScoring();
>    public void constantScoring(boolean);
>
>    When constantScoring(), the boost() is the score for matches.

That seems fine.

> 2. Add two methods to Searcher.java:
>
>    public BitSet cachedBitSet(Query) { return null; }
>    public void cacheBitSet(Query, BitSet) {}
>
>    IndexSearcher overrides these to maintain an LRU cache of bitsets.

Yup, that's what I have.
Things should be extensible and use a caching interface - the default
implementation being an LRU cache, but users could use their own
implementations to get LFU behavior or whatever.
 
> 3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
> is not thrown.

This is good, but not sufficient for RangeQuery.  If
RangeQuery.constantScoring(), then it should not rewrite to a
BooleanQuery at all.  Depending on the RangeQuery, just the creation
of a BooleanQuery that matches it is too heavyweight.
 
> Also, instead of BitSet we could use an interface:
>
>    public interface DocIdSet {
>      void add(int docId);
>      boolean contains(int docId);
>      int next(int docId);
>    }
>
> to permit sparse representations.

Definitely a DocIdSet.  It's called DocSet in my code and has a bitset
implementation and a compact implementation that's an int hash set
(unordered cause I just use it as a filter now).  Here is the basic
interface:

public interface DocSet {
  public int size();
  public boolean exists(int docid);
  public DocIterator iterator();
  public BitSet getBits();
  public long memSize();
  public DocSet intersection(DocSet other);
  public int intersectionSize(DocSet other);
  public DocSet union(DocSet other);
  public int unionSize(DocSet other);
}


I would separate out int next(int docId) into an iterator.  It may be
more efficient to iterate over certain structures if you can maintain
state about where you are (and this may even be true of a BitSet).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: constant scoring queries

Maik Schreiber
In reply to this post by Doug Cutting
> 1. Add two methods to Query.java:
>
>   public boolean constantScoring();
>   public void constantScoring(boolean);
>
>   When constantScoring(), the boost() is the score for matches.
>
> 2. Add two methods to Searcher.java:
>
>   public BitSet cachedBitSet(Query) { return null; }
>   public void cacheBitSet(Query, BitSet) {}

I'd like to vote for a slight variation here, that is, going in line with the
beans theme. Namely, the method signatures should look like:

public boolean isConstantScoring();
public void setConstantScoring(boolean);

public BitSet getCachedBitSet(Query);
public void setCachedBitSet(Query, BitSet);

--
Maik Schreiber   *   http://www.blizzy.de

GPG public key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x1F11D713
Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: constant scoring queries

Doug Cutting
In reply to this post by Yonik Seeley
Yonik Seeley wrote:
> Could you elaborate on the advantage of having say a TermQuery that
> could be either normal-scoring or constant-scoring vs two different
> Query classes for doing this?  They seem roughly equivalent.

You could code it that way too.  It would require exposing TermWeight
and TermScorer as public classes that would also need to be subclassed.
  My guess is that you'd have to write more code to implement it that
way than to have a boolean flag.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: constant scoring queries

Robert Engels
In reply to this post by Yonik Seeley
Would there be anyway to "rewrite" the cached queries as documents are
added?

By this I mean, if a user runs an "expensive" range query that gets cached,
then another user adds a document that should be included in the cached
query, the addDocument() method would "update" the cached query. I think
this is useful when using the "rolling index" method of performing
incremental updates to the index, so cached queries could remain valid after
a roll.

I think a callback interface similar to

updateQuery(Term t,BitSet newdocs);

Then the query if free to ignore the update if the term does not fall inside
the query. This check is obviously query specific.


-----Original Message-----
From: Yonik Seeley [mailto:[hidden email]]
Sent: Tuesday, May 10, 2005 9:42 PM
To: [hidden email]
Subject: Re: constant scoring queries


Hey now... you're going to obsolete all my in-house code and put me
out of a job ;-)

Could you elaborate on the advantage of having say a TermQuery that
could be either normal-scoring or constant-scoring vs two different
Query classes for doing this?  They seem roughly equivalent.


> 1. Add two methods to Query.java:
>
>    public boolean constantScoring();
>    public void constantScoring(boolean);
>
>    When constantScoring(), the boost() is the score for matches.

That seems fine.

> 2. Add two methods to Searcher.java:
>
>    public BitSet cachedBitSet(Query) { return null; }
>    public void cacheBitSet(Query, BitSet) {}
>
>    IndexSearcher overrides these to maintain an LRU cache of bitsets.

Yup, that's what I have.
Things should be extensible and use a caching interface - the default
implementation being an LRU cache, but users could use their own
implementations to get LFU behavior or whatever.

> 3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
> is not thrown.

This is good, but not sufficient for RangeQuery.  If
RangeQuery.constantScoring(), then it should not rewrite to a
BooleanQuery at all.  Depending on the RangeQuery, just the creation
of a BooleanQuery that matches it is too heavyweight.

> Also, instead of BitSet we could use an interface:
>
>    public interface DocIdSet {
>      void add(int docId);
>      boolean contains(int docId);
>      int next(int docId);
>    }
>
> to permit sparse representations.

Definitely a DocIdSet.  It's called DocSet in my code and has a bitset
implementation and a compact implementation that's an int hash set
(unordered cause I just use it as a filter now).  Here is the basic
interface:

public interface DocSet {
  public int size();
  public boolean exists(int docid);
  public DocIterator iterator();
  public BitSet getBits();
  public long memSize();
  public DocSet intersection(DocSet other);
  public int intersectionSize(DocSet other);
  public DocSet union(DocSet other);
  public int unionSize(DocSet other);
}


I would separate out int next(int docId) into an iterator.  It may be
more efficient to iterate over certain structures if you can maintain
state about where you are (and this may even be true of a BitSet).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: constant scoring queries

Paul Elschot
In reply to this post by Doug Cutting
On Tuesday 10 May 2005 22:39, Doug Cutting wrote:
> Background: In http://issues.apache.org/bugzilla/show_bug.cgi?id=34673,
> Yonik Seely proposes a ConstantScoreQuery, based on a Filter.  And in
> http://www.mail-archive.com/lucene-dev@.../msg08007.html 
> I proposed a mechanism to promote the use of Filters.  Through all of
> this, Paul Elshot has hinted that there might be a better way.

Well, I did not think of combining constant scores with
evaluating clauses one by one, but it makes for a nice fit
in mixing pure boolean as constant scoring and normally
scored queries.

See below for some more comments.
 
> Here's another proposal, tackling many of the same issues:
>
> 1. Add two methods to Query.java:
>
>    public boolean constantScoring();
>    public void constantScoring(boolean);
>
>    When constantScoring(), the boost() is the score for matches.

Since many types of queries can become constant scoring, and attribute
like this is better than a separate query class for constant scoring queries.

> 2. Add two methods to Searcher.java:
>
>    public BitSet cachedBitSet(Query) { return null; }
>    public void cacheBitSet(Query, BitSet) {}
>
>    IndexSearcher overrides these to maintain an LRU cache of bitsets.

A cache of filters might end up having different implementations
of filters to save memory, see also below.
 
> 3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
> is not thrown.

TooManyClauses in BooleanQuery works nicely, but
I prefer not to have a direct association between BooleanQuery and
TooManyClauses. Instead I'd rather have a factory for the term scorers
that use buffer space (TermScorer and SpanTermScorer) and have
this factory throw TooManyClauses. The factory would be passed to
the rewrite() method.
This is the approach taken in the Surround query language, which
has two different factories, but they could be easily combined.
 
> 4. Modify BooleanScorer to, if constantScoring(),
>    - check Searcher for a cached bitset
>    - failing that, create a bitset
>    - evaluate clauses serially, saving results in bitset

That would only be possible for optional clauses, ie. the ones
passed to a DisjunctionScorer in the trunk. For this case, the
DisjunctionScorer would need to be extended. Btw. this would
be even faster than any available boolean scorer since there is
no need to compare doc id's in any way during scoring.

>    - cache the bitset

Or a compact version of it as in
http://issues.apache.org/bugzilla/show_bug.cgi?id=32921
?

>    - use the bitset to handle doc(), next() and skipTo();

and implement boolean AND, OR and NOT when combining filters.

>
> 5. TermQuery and PhraseQuery could be similarly modified, so that, when
> constant scoring, bitsets are cached for very common terms (e.g., >5% of
> documents).
>
> With these changes, WildcardQuery, PrefixQuery, RangeQuery etc., when
> declared to be constant scoring, will operate much faster and never
> throw TooManyClauses.  We can add an option (the default?) to
> QueryParser to make these constant scoring.
>
> Also, instead of BitSet we could use an interface:
>
>    public interface DocIdSet {
>      void add(int docId);
>      boolean contains(int docId);
>      int next(int docId);
>    }
>
> to permit sparse representations.

The compact sparse representation referred to above stores
the differences between doc id's as VInt's, so it can only provide
an iterator over document numbers and not a random access as
in contains(int) above.

It needs only be used when it uses less memory than a bitset, which is the
case when it allows rougly less than 1/8 of the docs of the reader.
 
> Thoughts?

Yes, thanks for combining the constant score with the one by one approach :)

Regards,
Paul Elschot.
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: constant scoring queries

Paul Elschot
On Thursday 12 May 2005 17:11, Paul Elschot wrote:
> On Tuesday 10 May 2005 22:39, Doug Cutting wrote:
...
>  
> > Thoughts?
>
> Yes, thanks for combining the constant score with the one by one approach :)

That should have been:

Thanks for combining the constant score with the filters.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: constant scoring queries

Paul Elschot
In reply to this post by Yonik Seeley
On Wednesday 11 May 2005 04:42, Yonik Seeley wrote:
> Hey now... you're going to obsolete all my in-house code and put me
> out of a job ;-)

We all like cherry picking.

> Could you elaborate on the advantage of having say a TermQuery that
> could be either normal-scoring or constant-scoring vs two different
> Query classes for doing this?  They seem roughly equivalent.
>
>
> > 1. Add two methods to Query.java:
> >
> >    public boolean constantScoring();
> >    public void constantScoring(boolean);
> >
> >    When constantScoring(), the boost() is the score for matches.
>
> That seems fine.

.. with the is.. prefix as suggested by Mark Schreiber.
 

> > 2. Add two methods to Searcher.java:
> >
> >    public BitSet cachedBitSet(Query) { return null; }
> >    public void cacheBitSet(Query, BitSet) {}
> >
> >    IndexSearcher overrides these to maintain an LRU cache of bitsets.
>
> Yup, that's what I have.
> Things should be extensible and use a caching interface - the default
> implementation being an LRU cache, but users could use their own
> implementations to get LFU behavior or whatever.

To have the actual implementation of java.util.BitSet
in the interface is not really nice. The FilteredQuery here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
has two constructors, one for a Filter that provides a BitSet,
and one for SkipFilter that provides a DocNrSkipper, which
is an iterator that can skip (see below).
Making a DocNrSkipper out of a BitSet is no problem,
the underlying compact SortedVIntList has a constructor from a BitSet
(see the useSkipFilter method of this FilteredQuery).
 
> > 3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
> > is not thrown.
 
> This is good, but not sufficient for RangeQuery.  If
> RangeQuery.constantScoring(), then it should not rewrite to a
> BooleanQuery at all.  Depending on the RangeQuery, just the creation
> of a BooleanQuery that matches it is too heavyweight.

By itself, a BooleanQuery is no more than a collection of clauses.
For the case of a RangeQuery (and others with optional clauses),
the heavy weight can be avoided by iterating over the clauses one by
one. I think the idea is to have this iteration over clauses implemented in
one place, in BooleanQuery.
That means that a constant scoring query might also provide an iterator
over its clauses, ie. the ones that are to be OR'ed, ie subject to
disjunction.
 

> > Also, instead of BitSet we could use an interface:
> >
> >    public interface DocIdSet {
> >      void add(int docId);
> >      boolean contains(int docId);
> >      int next(int docId);
> >    }
> >
> > to permit sparse representations.
>
> Definitely a DocIdSet.  It's called DocSet in my code and has a bitset
> implementation and a compact implementation that's an int hash set
> (unordered cause I just use it as a filter now).  Here is the basic
> interface:
>
> public interface DocSet {
>   public int size();
>   public boolean exists(int docid);
>   public DocIterator iterator();
>   public BitSet getBits();
>   public long memSize();
>   public DocSet intersection(DocSet other);
>   public int intersectionSize(DocSet other);
>   public DocSet union(DocSet other);
>   public int unionSize(DocSet other);
> }

contains(docid) and exists(docid) cannot be efficiently implemented
on a VInt based compact filter, so I'd prefer to leave these out.
BitSet getBits() is also problematic, as it would need to create a BitSet.

The rest of this interface looks ok to me, and a negative intersection
for AND NOT may have to be added for excluded clauses.
 
> I would separate out int next(int docId) into an iterator.  It may be
> more efficient to iterate over certain structures if you can maintain
> state about where you are (and this may even be true of a BitSet).

Is the int next(int docId) is the same as the DocNrSkipper here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32921
ie. skipping over document number returned by the DocIterator above?
In DocNrSkipper the skipping can be suppressed by using currentDoc + 1
as argument, so these two might be combined into a single one, but
I don't know whether a hash based implementation would allow such
ordered access.

When I wrote the SkipFilter, I considered a hash table based implementation,
but I did not implement it because I thought it would use too much memory.
However, I did not check the memory usage of a java implementation of
a hashed document id set. Would you have some figures for memSize()
results of the DocSet above?

Anyway, the things above are converging nicely.
There are some further points for an implementation of cached query results:
- should the use of BitSet in public interfaces be deprecated in favour of
  a (skipping) document id iterator?
- the boolean (and,or,not) logic over the document id sets.
- the cache in IndexSearcher.

The boolean logic over doc id sets could be done in the corresponding
scorers (ConjunctionScorer for and, DisjunctionScorer for or, and
ReqExclScorer for not). Like Query, Scorer may have to be extended
with an isConstantScoring() method for this. The implementation of
DisjunctionScorer will probably need to be changed to use the iterator
over the clauses.

For the cache in IndexSearcher a LinkedHashMap seems good.
It can evict the cached least recently used filter(s) when the total memory
used by the cached filters exceeds a maximum.

This could also provide for changes in the underlying reader.
A callback interface (as mentioned by Robert Engels) similar to
updateQuery(Term t,BitSet newdocs);
could be used in for added documents, but when docs are deleted
this would not work well on the level of a reader. It might be
better to do this on Segments because these are immutable.
Associating a Query with a Segment is a far fetch at the moment,
though.


Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: constant scoring queries

Yonik Seeley
> To have the actual implementation of java.util.BitSet
> in the interface is not really nice.

Totally agree.

> The FilteredQuery here:
> http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> has two constructors, one for a Filter that provides a BitSet,
> and one for SkipFilter that provides a DocNrSkipper, which
> is an iterator that can skip (see below).
> Making a DocNrSkipper out of a BitSet is no problem,
> the underlying compact SortedVIntList has a constructor from a BitSet
> (see the useSkipFilter method of this FilteredQuery).
>
> > > 3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
> > > is not thrown.
>
> > This is good, but not sufficient for RangeQuery.  If
> > RangeQuery.constantScoring(), then it should not rewrite to a
> > BooleanQuery at all.  Depending on the RangeQuery, just the creation
> > of a BooleanQuery that matches it is too heavyweight.
>
> By itself, a BooleanQuery is no more than a collection of clauses.
> For the case of a RangeQuery (and others with optional clauses),
> the heavy weight can be avoided by iterating over the clauses one by
> one. I think the idea is to have this iteration over clauses implemented in
> one place, in BooleanQuery.

But for certain range queries you want to avoid even materializing
that collection of clauses (currently done in RangeQuery.rewrite())
since it can number in the millions.

> That means that a constant scoring query might also provide an iterator
> over its clauses, ie. the ones that are to be OR'ed, ie subject to
> disjunction.

Ahhh, I see...

> > > Also, instead of BitSet we could use an interface:
> > >
> > >    public interface DocIdSet {
> > >      void add(int docId);
> > >      boolean contains(int docId);
> > >      int next(int docId);
> > >    }
> > >
> > > to permit sparse representations.
> >
> > Definitely a DocIdSet.  It's called DocSet in my code and has a bitset
> > implementation and a compact implementation that's an int hash set
> > (unordered cause I just use it as a filter now).  Here is the basic
> > interface:
> >
> > public interface DocSet {
> >   public int size();
> >   public boolean exists(int docid);
> >   public DocIterator iterator();
> >   public BitSet getBits();
> >   public long memSize();
> >   public DocSet intersection(DocSet other);
> >   public int intersectionSize(DocSet other);
> >   public DocSet union(DocSet other);
> >   public int unionSize(DocSet other);
> > }
>
> contains(docid) and exists(docid) cannot be efficiently implemented
> on a VInt based compact filter, so I'd prefer to leave these out.

exists() on a BitSet is much faster than next() though...

> BitSet getBits() is also problematic, as it would need to create a BitSet.

Agree.  Any method like that should be specific to a subclass of
DocIdSet (the one that implements it as a BitSet).

> When I wrote the SkipFilter, I considered a hash table based implementation,
> but I did not implement it because I thought it would use too much memory.
> However, I did not check the memory usage of a java implementation of
> a hashed document id set. Would you have some figures for memSize()
> results of the DocSet above?

I use a power-of-two hash table with a load factor of .75.  So putting
500 docs in my hashset would take up 1024 slots at 4 bytes per slot
(4k).

Hashes also have big speed advantage over BitSets for iterating over
all docs or taking intersection sizes.  The hash also has fast random
access that docNrSkipper doesn't have.

If lucene goes to iterating over filters in order  (via docNrSkipper &
your new FilteredQuery), a hash implementation no longer makes sense.

> Anyway, the things above are converging nicely.
> There are some further points for an implementation of cached query results:
> - should the use of BitSet in public interfaces be deprecated in favour of
>   a (skipping) document id iterator?

BitSet as the only way to access a filter should be depracated.
However, *If* a filter implementation internally uses a BitSet, it
might be nice to be able to get access to it.

> - the boolean (and,or,not) logic over the document id sets.

Absolutely.  And functions that provide counts too (can be more
efficient than functions that have to materialize the resulting set).

> - the cache in IndexSearcher.
Yes!  But user extensible please...

> The boolean logic over doc id sets could be done in the corresponding
> scorers (ConjunctionScorer for and, DisjunctionScorer for or, and
> ReqExclScorer for not). Like Query, Scorer may have to be extended
> with an isConstantScoring() method for this. The implementation of
> DisjunctionScorer will probably need to be changed to use the iterator
> over the clauses.
>
> For the cache in IndexSearcher a LinkedHashMap seems good.
> It can evict the cached least recently used filter(s) when the total memory
> used by the cached filters exceeds a maximum.

That's what I use.

Will setConstantScore(true) on a query delegate that call to all
subqueries?  It would seem very useful to be able to easily switch an
entire query to constant scoring (for instance when a non-score sort
is specified and the user doesn't care about the score).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: constant scoring queries

Paul Elschot
On Tuesday 17 May 2005 19:35, Yonik Seeley wrote:

> > To have the actual implementation of java.util.BitSet
> > in the interface is not really nice.
>
> Totally agree.
>
> > The FilteredQuery here:
> > http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> > has two constructors, one for a Filter that provides a BitSet,
> > and one for SkipFilter that provides a DocNrSkipper, which
> > is an iterator that can skip (see below).
> > Making a DocNrSkipper out of a BitSet is no problem,
> > the underlying compact SortedVIntList has a constructor from a BitSet
> > (see the useSkipFilter method of this FilteredQuery).
> >
> > > > 3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
> > > > is not thrown.
> >
> > > This is good, but not sufficient for RangeQuery.  If
> > > RangeQuery.constantScoring(), then it should not rewrite to a
> > > BooleanQuery at all.  Depending on the RangeQuery, just the creation
> > > of a BooleanQuery that matches it is too heavyweight.
> >
> > By itself, a BooleanQuery is no more than a collection of clauses.
> > For the case of a RangeQuery (and others with optional clauses),
> > the heavy weight can be avoided by iterating over the clauses one by
> > one. I think the idea is to have this iteration over clauses implemented
in

> > one place, in BooleanQuery.
>
> But for certain range queries you want to avoid even materializing
> that collection of clauses (currently done in RangeQuery.rewrite())
> since it can number in the millions.
>
> > That means that a constant scoring query might also provide an iterator
> > over its clauses, ie. the ones that are to be OR'ed, ie subject to
> > disjunction.
>
> Ahhh, I see...

I realize just now that this also means that a simple flag on a Query
will not be enough.

> > > > Also, instead of BitSet we could use an interface:
> > > >
> > > >    public interface DocIdSet {
> > > >      void add(int docId);
> > > >      boolean contains(int docId);
> > > >      int next(int docId);
> > > >    }
> > > >
> > > > to permit sparse representations.
> > >
> > > Definitely a DocIdSet.  It's called DocSet in my code and has a bitset
> > > implementation and a compact implementation that's an int hash set
> > > (unordered cause I just use it as a filter now).  Here is the basic
> > > interface:
> > >
> > > public interface DocSet {
> > >   public int size();
> > >   public boolean exists(int docid);
> > >   public DocIterator iterator();
> > >   public BitSet getBits();
> > >   public long memSize();
> > >   public DocSet intersection(DocSet other);
> > >   public int intersectionSize(DocSet other);
> > >   public DocSet union(DocSet other);
> > >   public int unionSize(DocSet other);
> > > }
> >
> > contains(docid) and exists(docid) cannot be efficiently implemented
> > on a VInt based compact filter, so I'd prefer to leave these out.
>
> exists() on a BitSet is much faster than next() though...

Yes, but the point is to iterate to the next document based in information
from RAM and to be able to skipTo() on the index instead of reading it
sequentially.
 
> > BitSet getBits() is also problematic, as it would need to create a BitSet.
>
> Agree.  Any method like that should be specific to a subclass of
> DocIdSet (the one that implements it as a BitSet).
>
> > When I wrote the SkipFilter, I considered a hash table based
implementation,
> > but I did not implement it because I thought it would use too much memory.
> > However, I did not check the memory usage of a java implementation of
> > a hashed document id set. Would you have some figures for memSize()
> > results of the DocSet above?
>
> I use a power-of-two hash table with a load factor of .75.  So putting
> 500 docs in my hashset would take up 1024 slots at 4 bytes per slot
> (4k).

So about 8 bytes per doc. A SortedVIntList normally has 1 byte per doc,
and never more than 4 bytes per doc (as long as doc numbers are int).
 
> Hashes also have big speed advantage over BitSets for iterating over
> all docs or taking intersection sizes.  The hash also has fast random
> access that docNrSkipper doesn't have.

Can it take determine the intersection size faster than iterating over
both sets with an intersecting merge?
 
> If lucene goes to iterating over filters in order  (via docNrSkipper &
> your new FilteredQuery), a hash implementation no longer makes sense.

The new BooleanScorer is based on iterating over documents in order.
There may be a performance advantage somewhere for
filters with hash implementations and I'd rather not miss that.
 
> > Anyway, the things above are converging nicely.
> > There are some further points for an implementation of cached query
results:
> > - should the use of BitSet in public interfaces be deprecated in favour of
> >   a (skipping) document id iterator?
>
> BitSet as the only way to access a filter should be depracated.
> However, *If* a filter implementation internally uses a BitSet, it
> might be nice to be able to get access to it.

An instanceof and a cast should be able to provide that.

> > - the boolean (and,or,not) logic over the document id sets.
>
> Absolutely.  And functions that provide counts too (can be more
> efficient than functions that have to materialize the resulting set).
>
> > - the cache in IndexSearcher.
> Yes!  But user extensible please...

I'd prefer to start with a simple implementation, deferring subclassing or
implementing an interface for later.
 
> > The boolean logic over doc id sets could be done in the corresponding
> > scorers (ConjunctionScorer for and, DisjunctionScorer for or, and
> > ReqExclScorer for not). Like Query, Scorer may have to be extended
> > with an isConstantScoring() method for this. The implementation of
> > DisjunctionScorer will probably need to be changed to use the iterator
> > over the clauses.

In case the filters are wrapped in a Scorer all the boolean logic is already
implemented in these scorers. DisjunctionScorer should probably target a
BitSet when constant scoring.
 
> > For the cache in IndexSearcher a LinkedHashMap seems good.
> > It can evict the cached least recently used filter(s) when the total
memory
> > used by the cached filters exceeds a maximum.
>
> That's what I use.
>
> Will setConstantScore(true) on a query delegate that call to all
> subqueries?  It would seem very useful to be able to easily switch an

As a minimum, setConstantScore would ensure that Scorer.score()
will not be called for the scorers of the subqueries.

> entire query to constant scoring (for instance when a non-score sort
> is specified and the user doesn't care about the score).

Sounds perfect to me, but I've never looked at the sorting code in depth.
Do the non-score sort methods use HitCollector?
That might be wasteful because it provides the score value for each doc.

As for implementing all of this constant scoring and filter caching,
I would like to make sure that the new BooleanScorer works close
to perfection before continuing.
In particular there are some patches for which I'd like to know
whether they are acceptable:
a patch for DisjunctionSumScorer:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34193
and a split off of the coordination:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34154

In fact, since this builds on the new scorer, I'd prefer to have that
used widely before continuing.
Also, I have no idea about the order in which order this constantScoring
and filter caching could be easily implemented. Any ideas?

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: constant scoring queries

Yonik Seeley
> > > contains(docid) and exists(docid) cannot be efficiently implemented
> > > on a VInt based compact filter, so I'd prefer to leave these out.
> >
> > exists() on a BitSet is much faster than next() though...
>
> Yes, but the point is to iterate to the next document based in information
> from RAM and to be able to skipTo() on the index instead of reading it
> sequentially.

Well, yes, that's one point to filters (and probably the main use).
Another that we are using is to enable fast intersection of two
filters you already have in memory.


> > I use a power-of-two hash table with a load factor of .75.  So putting
> > 500 docs in my hashset would take up 1024 slots at 4 bytes per slot
> > (4k).
>
> So about 8 bytes per doc. A SortedVIntList normally has 1 byte per doc,
> and never more than 4 bytes per doc (as long as doc numbers are int).

It depends on your platform and what tradeoffs you want too... our
production boxes all have 16G RAM.

> > Hashes also have big speed advantage over BitSets for iterating over
> > all docs or taking intersection sizes.  The hash also has fast random
> > access that docNrSkipper doesn't have.
>
> Can it take determine the intersection size faster than iterating over
> both sets with an intersecting merge?

In many cases, yes.
Consider the example of taking the intersection since of a small set
with a large.  You need fast random access (or a fast skip) on the
large set.

> > entire query to constant scoring (for instance when a non-score sort
> > is specified and the user doesn't care about the score).
>
> Sounds perfect to me, but I've never looked at the sorting code in depth.
> Do the non-score sort methods use HitCollector?
> That might be wasteful because it provides the score value for each doc.

Yes, but sometimes the score may still be desired even when sorting by
another field... so it should be configurable on a per-search basis or
per-query basis somehow.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: constant scoring queries

Paul Elschot
On Wednesday 18 May 2005 23:06, Yonik Seeley wrote:

> > > > contains(docid) and exists(docid) cannot be efficiently implemented
> > > > on a VInt based compact filter, so I'd prefer to leave these out.
> > >
> > > exists() on a BitSet is much faster than next() though...
> >
> > Yes, but the point is to iterate to the next document based in information
> > from RAM and to be able to skipTo() on the index instead of reading it
> > sequentially.
>
> Well, yes, that's one point to filters (and probably the main use).
> Another that we are using is to enable fast intersection of two
> filters you already have in memory.
>
>
> > > I use a power-of-two hash table with a load factor of .75.  So putting
> > > 500 docs in my hashset would take up 1024 slots at 4 bytes per slot
> > > (4k).
> >
> > So about 8 bytes per doc. A SortedVIntList normally has 1 byte per doc,
> > and never more than 4 bytes per doc (as long as doc numbers are int).
>
> It depends on your platform and what tradeoffs you want too... our
> production boxes all have 16G RAM.
>
> > > Hashes also have big speed advantage over BitSets for iterating over
> > > all docs or taking intersection sizes.  The hash also has fast random
> > > access that docNrSkipper doesn't have.
> >
> > Can it take determine the intersection size faster than iterating over
> > both sets with an intersecting merge?
>
> In many cases, yes.
> Consider the example of taking the intersection since of a small set
> with a large.  You need fast random access (or a fast skip) on the
> large set.

Well, a weak spot in the SortVIntList is that it does not have a fast skip.
This forces the skipTo method in the posted filtered query to use
a linear search.
More inspiration from the Lucene index data structures could be used
to build in the forwarding information to allow a faster skipTo
on the compact filter itself. The linear search is here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32921
in the nextDocNr(docNr) method of the DocNrSkipper returned
from SortedVIntList.getDocNrSkipper() .

Regards,
Paul Elschot.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]