Re: OpenBitSet

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: OpenBitSet

Ype Kingma
With a copy to java-dev, I suppose none you mind...

On Friday 12 May 2006 19:40, Yonik Seeley wrote:

> On 5/12/06, Doug Cutting <[hidden email]> wrote:
> > Yonik Seeley wrote:
> > > So the first step is to decide if we should migrate to this, and if
> > > so, where it belongs.
> > > - lucene.util?  BitSet is hard-coded into Lucene in enough places that
> > > I don't know if it would be useful to people there or not.
> > > - solr.util?
> > >
> > > The next step would be to actually use it... replacing BitSet with
> > > OpenBitSet in BitDocSet (an alternative would be to create another
> > > DocSet type, but that gets more complicated).
> >
> > Shouldn't we really replace BitSet in Lucene with an interface that
> > OpenBitSet & others implement?
>
> It depends on what the goal is and what the interface would cover.
> It would useful to have very restrictive small interfaces that do
> specific things, and implementations of these interfaces can wrap
> different underlying data structures.
>
> For example, there's DocNrSkipper for filtering a query:
> http://issues.apache.org/jira/browse/LUCENE-328
> BitSetSortedIntList wraps a BitSet and implements DocNrSkipper.

Is there also a nextSetBit(bitNr) somewhere on http://www.hackersdelight.org ?
This method is essential for filtering a query search.

>
> We could also have an interface for the creation side...
> SequentialIntListCreator where ids must be added in order,
> or RandomAccessIntListCreator where ids may be added in any order.
>
> But I don't see OpenBitSet implementing any of these interfaces
> directly, but instead being used as an underlying store for certain
> implementations.
>
> Did you have something else in mind?
>
> > This has been raised many times, that
> > Filters should return something that implements an interface, not a
> > BitSet.
>
> +1 on that...
>
> > Doing this back-compatibly will be a bit of a pain, but I think
> > the effort is warranted.

A simple way would be to deprecate the methods in IndexSearcher that
take a Filter, recommend to use FilteredQuery instead, and add
a constructor to FilteredQuery that takes a SkipFilter.
This posted full version of FilteredQuery has that:
http://issues.apache.org/jira/browse/LUCENE-330

>
> Disallowing the non-skipping BooleanScorer would allow use of SkipFilters.

I think the non-skipping BooleanScorer  is only useful on the top level
query search for disjunctions, without any filtering.
Then non-skipping and filtering never occur together, and a SkipFilter
could always be used instead of a Filter.
Also an implemention of search(HitCollector) for BooleanScorer2
using the BooleanScorer non-skipping implementation would fit nicely,
but that is not straightforward at the moment.


Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: OpenBitSet

Paul Elschot
With a copy to java-dev, I suppose none you mind...

On Friday 12 May 2006 19:40, Yonik Seeley wrote:

> On 5/12/06, Doug Cutting <[hidden email]> wrote:
> > Yonik Seeley wrote:
> > > So the first step is to decide if we should migrate to this, and if
> > > so, where it belongs.
> > > - lucene.util?  BitSet is hard-coded into Lucene in enough places that
> > > I don't know if it would be useful to people there or not.
> > > - solr.util?
> > >
> > > The next step would be to actually use it... replacing BitSet with
> > > OpenBitSet in BitDocSet (an alternative would be to create another
> > > DocSet type, but that gets more complicated).
> >
> > Shouldn't we really replace BitSet in Lucene with an interface that
> > OpenBitSet & others implement?
>
> It depends on what the goal is and what the interface would cover.
> It would useful to have very restrictive small interfaces that do
> specific things, and implementations of these interfaces can wrap
> different underlying data structures.
>
> For example, there's DocNrSkipper for filtering a query:
> http://issues.apache.org/jira/browse/LUCENE-328
> BitSetSortedIntList wraps a BitSet and implements DocNrSkipper.

Is there also a nextSetBit(bitNr) somewhere on http://www.hackersdelight.org ?
This method is essential for filtering a query search.

>
> We could also have an interface for the creation side...
> SequentialIntListCreator where ids must be added in order,
> or RandomAccessIntListCreator where ids may be added in any order.
>
> But I don't see OpenBitSet implementing any of these interfaces
> directly, but instead being used as an underlying store for certain
> implementations.
>
> Did you have something else in mind?
>
> > This has been raised many times, that
> > Filters should return something that implements an interface, not a
> > BitSet.
>
> +1 on that...
>
> > Doing this back-compatibly will be a bit of a pain, but I think
> > the effort is warranted.

A simple way would be to deprecate the methods in IndexSearcher that
take a Filter, recommend to use FilteredQuery instead, and add
a constructor to FilteredQuery that takes a SkipFilter.
This posted full version of FilteredQuery has that:
http://issues.apache.org/jira/browse/LUCENE-330

>
> Disallowing the non-skipping BooleanScorer would allow use of SkipFilters.

I think the non-skipping BooleanScorer  is only useful on the top level
query search for disjunctions, without any filtering.
Then non-skipping and filtering never occur together, and a SkipFilter
could always be used instead of a Filter.
Also an implemention of search(HitCollector) for BooleanScorer2
using the BooleanScorer non-skipping implementation would fit nicely,
but that is not straightforward at the moment.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: OpenBitSet

Yonik Seeley
In reply to this post by Ype Kingma
> Is there also a nextSetBit(bitNr) somewhere on http://www.hackersdelight.org ?
> This method is essential for filtering a query search.

They have some algorithms for ntz (number of trailing zeros) for a
single int value.  That's the harder part.  Using ntz to implement
nextSetBit in an int or array of ints is easier and thus not covered.

For OpenBitSet.nextSetBit(bitNr) I ended up coming up with my own
ntz() using a combination of table lookup and single level binary
search followed by linear search, which turned out to be fastest for
implementing nextSetBit()


-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: OpenBitSet

Yonik Seeley
Oh, and the performance for nextSetBit() was 46% faster (at least on
my box at home, which I developed on, and hence this stuff is tuned
for).

-Yonik

On 5/12/06, Yonik Seeley <[hidden email]> wrote:

> > Is there also a nextSetBit(bitNr) somewhere on http://www.hackersdelight.org ?
> > This method is essential for filtering a query search.
>
> They have some algorithms for ntz (number of trailing zeros) for a
> single int value.  That's the harder part.  Using ntz to implement
> nextSetBit in an int or array of ints is easier and thus not covered.
>
> For OpenBitSet.nextSetBit(bitNr) I ended up coming up with my own
> ntz() using a combination of table lookup and single level binary
> search followed by linear search, which turned out to be fastest for
> implementing nextSetBit()

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: OpenBitSet

Yonik Seeley
Code is here for those interested:
http://issues.apache.org/jira/browse/SOLR-15

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: OpenBitSet

Yonik Seeley
In reply to this post by Yonik Seeley
ntz8 or ntz8a could possibly be faster than what I have now for low
density bit sets:
http://www.hackersdelight.org/HDcode/ntz.cc

I don't know how to expand those to 64 bit, but they could always be
used on the two 32 bit chunks I guess.  Anyway, for higher density bit
sets, my current implementation should be faster.

-Yonik

On 5/12/06, Yonik Seeley <[hidden email]> wrote:

> > Is there also a nextSetBit(bitNr) somewhere on http://www.hackersdelight.org ?
> > This method is essential for filtering a query search.
>
> They have some algorithms for ntz (number of trailing zeros) for a
> single int value.  That's the harder part.  Using ntz to implement
> nextSetBit in an int or array of ints is easier and thus not covered.
>
> For OpenBitSet.nextSetBit(bitNr) I ended up coming up with my own
> ntz() using a combination of table lookup and single level binary
> search followed by linear search, which turned out to be fastest for
> implementing nextSetBit()

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: OpenBitSet

eks dev
In reply to this post by Yonik Seeley
  It is faster than BitSet, even against Mustang. The numbers are a bit less than on Yonik’s HW, but quite convincing. I did small test on my XP Notebook (Pentium M 1.6GHz). Only “union” test is some 20% slower on 8Mio size with 80k bits set. I did not dig deeper.
   
  As much as it is worth, my proposal would be to hide all usages of BitSet (there is also one BitVector around?) inside Lucene behind appropriate interfaces in order to leverage pearls like this one. Actually it should not be too complicated, the only ugly back compatibility issue I could identify is in Filter, but for this Yonik, Hoss and Paul already made rather acceptable extend/deprecate plans.
   
  Maybe separate package for various BitSet / IntegerSet implementation would not be such a bad idea as there is no single best implementation? Just let me remind on what we have around:
   
  BitSet (OpenBitSet as faster variant of the more or less the same)
  HashIntegerSet
  IntArraySotredIntList
  VIntSortedIntList
  TreeBasedIntSet (I am trying to make my own implementation inspired by http://www.iis.uni-stuttgart.de/intset/ )
  More?
   
  I could easily imagine to put all these together behind simple interfaces and to provide some Factories and a few utility methods with a bit of intelligence (for example, CheekyBitSet that collects document ids in some small int[] and if it fits, keep them there, if not switch automatically to some compact representation)…
   
  Implementations above cover practically all important cases for caching/Filtering and bring some extra speed/memory for other uses:
  OpenBitSet (referent, “general workhorse”):
  Fast random access, does not grow, high memory demands for sparse BitsSets, reasonably fast Iterator over set bits
  HashIntegerSet:
       Slower random access, very fast Iterator possible, low memory demands on very sparse bitsets.
  IntArraySotredIntList:
       Slow random access, extremely fast Iterator over set bits, low memory demands on sparse bit sets, fast populating if adding elements that are sorted…
  VIntSortedIntList:
    Almost the same as IntArraySotredIntList, but random access almost not possible, but Memory demands very low, very useful for moderately sparse bit sets.
  TreeBasedIntSet:
     The best general performance for sparse bit sets, grows dynamically, exploits runs of clear/set bits to reduce memory demands. Extremly fast Iterator over set  bits  for low to medium sparse bit sets, acceptable random access.
   
  Considering Zipf distribution curse in almost all cases I have ever seen (except for “pure” filtering fields like language, categories…), having new BitSet(MAX_DOCUMENT) looks like pure luxury (e.g. in ChainedFilter). As you all know Big majority of bit vectors usually met in practice are very to modestly sparse due to Zipf…
   
  Did I say it before, this code Yonik wrote is way above my head, I just tried to summarize a few thoughts from “library user” perspective as it seams to me that caching and this bit twiddling is next major performance/scalability milestone for Lucene.  
 

----- Original Message ----
From: Yonik Seeley <[hidden email]>
To: [hidden email]
Cc: [hidden email]
Sent: Friday, 12 May, 2006 10:29:24 PM
Subject: Re: OpenBitSet

Code is here for those interested:
http://issues.apache.org/jira/browse/SOLR-15

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: OpenBitSet

Yonik Seeley
On 5/14/06, eks dev <[hidden email]> wrote:
>   It is faster than BitSet, even against Mustang. The numbers are a bit less than on Yonik's HW, but quite convincing.

The level of outperformance isn't quite as high on my work box, I
think because my home machine has higher memory bandwidth (both P4's
but mine has dual channel PC3200)

> I did small test on my XP Notebook (Pentium M 1.6GHz). Only "union" test is some 20% slower on 8Mio size with 80k bits set. I did not dig deeper.

Weird... I'm not sure how that could be.  Are you sure you didn't get
the numbers reversed?
I just tried 1.6, and bitset/openbitset = 1.26 for me.
Are any memory controllers optimized for forward streaming more than
reverse?  My union loop counts down to zero, which is often faster
since the register status flags are already set as the result of the
decrement operation (hence avoiding an additional compare instruction
on most processor architectures, including x86).

$ c:/opt/jdk16/bin/java -version
java version "1.6.0-beta2"
Java(TM) SE Runtime Environment (build 1.6.0-beta2-b83)
Java HotSpot(TM) Client VM (build 1.6.0-beta2-b83, mixed mode, sharing)

Yonik@spidey /cygdrive/f/code/solr/classes
$ c:/opt/jdk16/bin/java -server -Xbatch org.apache.solr.util.BitSetPerf 1000000
 50 10000 union 3000 bit
ret=0
TIME=13203

Yonik@spidey /cygdrive/f/code/solr/classes
$ c:/opt/jdk16/bin/java -server -Xbatch org.apache.solr.util.BitSetPerf 1000000
 50 10000 union 3000 open
ret=0
TIME=10406

>As you all know Big majority of bit vectors usually met in practice
are very to modestly
> sparse due to Zipf…

Yeah, that's why I concentrated on performance of the dense bit
sets... when the set is sparse, bit sets take up too much room and
something else should be used anyway.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: OpenBitSet

eks dev

>Weird... I'm not sure how that could be.  Are you sure you didn't get
>the numbers reversed?

that is exactly what happend, sorry for wrong numbers,  now it looks as it should:

java -version
Java(TM) SE Runtime Environment (build 1.6.0-beta2-b83)
Java HotSpot(TM) Client VM (build 1.6.0-beta2-b83, mixed mode, sharing)

java -server -Xbatch org.apache.solr.util.BitSetPerf 1000000 50 10000 union 3000 bit
ret=0
TIME=21966

java -server -Xbatch org.apache.solr.util.BitSetPerf 1000000 50 10000 union 3000 open
ret=0
TIME=19832

I measured also on different densities, and it looks about the same. When I find a few spare minutes will make one PerfTest that generates gnuplot diagrams. Wold be interesting to see how all key methods behave as a function of density/size.





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: OpenBitSet

Chris Hostetter-3

: I measured also on different densities, and it looks about the same.
: When I find a few spare minutes will make one PerfTest that generates
: gnuplot diagrams. Wold be interesting to see how all key methods behave
: as a function of density/size.

I was thinking the same thing ... i just haven't had time to play with it.

It migh also be usefull to check how the distribution of the set bits
affects things -- i suspect that for some "Filters" there some amount of
clustering as many people index their documents in a particular order, and
then filter on ranges of that order (ie: index documents as they are
created, and then filtering on create date) ... using
Random.nextGaussian() to pick which bets to set might be interesting.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: OpenBitSet

eks dev
Yeah, good hint. We actually made such measurements on TreeIntegerSet implementation, and it is totally astonishing what you get as a result (I remember 6Meg against 2k Memory consumption for "predominantly sorted bit vectors" like zip codes, conjuction/disjunct speed oreder of magnitude faster as it walks shallow tree in that case). If you have any posibility to sort your indexes, do so, even Lucene on disk representation appreciates this I guess (skips are faster, bit vectors on disk better compressed/decompresed?)
 
We even made one small visualizer of bit vectors that visualizes (generates image) HitCollector results for any specified query (gray image where every pixel represents 8-32 succesive bits from bit vector higher density=>darker color ). I like to see the enemy first.  
 
When we are allready in this area, just a curiosity,  friend of mine has one head spinning idea, to utilize graphics card HW to do super fast bit vector operations.  These thingies today are really optimized for basic bit operations. I am just curious to see what he comes up with.
 
I hope I will have some time next week or so to polish some tests for OpenBitSet a bit and drop it somewhere on Jira if anybody has interest to play with.

A bit off  topic, is there anybody who is doing ChainedFilter version that uses docNrSkipper? As I recall, you wrote BitSet version :)
 
----- Original Message ----
From: Chris Hostetter <[hidden email]>
To: [hidden email]; eks dev <[hidden email]>
Sent: Tuesday, 16 May, 2006 8:13:53 PM
Subject: Re: OpenBitSet


: I measured also on different densities, and it looks about the same.
: When I find a few spare minutes will make one PerfTest that generates
: gnuplot diagrams. Wold be interesting to see how all key methods behave
: as a function of density/size.

I was thinking the same thing ... i just haven't had time to play with it.

It migh also be usefull to check how the distribution of the set bits
affects things -- i suspect that for some "Filters" there some amount of
clustering as many people index their documents in a particular order, and
then filter on ranges of that order (ie: index documents as they are
created, and then filtering on create date) ... using
Random.nextGaussian() to pick which bets to set might be interesting.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

trivial util to Visualize BitSets (Query results actually)

eks dev
In reply to this post by eks dev
Maybe there are some more people that like to see bits. Feel free to do whatever you like with it.

Idea is simple, map 8 bits from HitCollector to one pixel by changing gray levels. Implementation is Quick 'n Dirty, but does the job.

/**
 * Copyright 2004 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import javax.imageio.ImageIO;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.HitCollector;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import java.awt.image.*;
import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import java.util.BitSet;

/*
 * Generates .png image of the search result. Maps bits from result  
 * sequentially from left to right untill specified image lenth is achieved
 * and then continues from the begining in the next line.
 *
 * Mapps 8 bits to one pixel, by changing leves of gray
 * e.g
 * WHITE PIXEL: 0 of eight bits set
 * GRAY PIXEL: some bits set (more set => darker)
 * BLACK PIXEL: all 8 bits set  
 *
 *  main has 3 params: index path, Query, and optional imageWidth
 */
public class VisualizeBitVector {

    // Invoke the init method in BufImagePanel to build the image.
    public static void main(String[] args) throws IOException, ParseException {
        Img fImg;
        Directory dir;
        IndexReader reader;

        fImg = new Img();
        if (args.length < 2) {
            System.out.println("VisualizeBitVector index_path query [imageWidth]");
            System.exit(-1);
        }

        String path = args[0];
        String query = args[1];
       
        int imageWidth = 800;
        if(args.length > 2){
            imageWidth = Integer.parseInt(args[2]);
        }

        System.getProperties().setProperty(
                "org.apache.lucene.FSDirectory.class",
                "org.apache.lucene.store.MMapDirectory");
        dir = FSDirectory.getDirectory(path, false);
        reader = IndexReader.open(dir);

        BufferedImage img;

        img = fImg.makeImage(reader, query, imageWidth);
        File file = new File("queryBits.png");
        ImageIO.write(img, "png", file);

    }

}


class Img {
    BitSet bs;

    BufferedImage fBufferedImage = null;

    int fWidth = 0, fHeight = 0;

    byte[] fPixels = null;

    /** Build a BufferedImage from a pixel array. *
     * @throws ParseException */

    public BufferedImage makeImage(IndexReader reader, String query, int imageWidth)
            throws IOException, ParseException {

        int maxDocs = reader.maxDoc();

        fWidth = imageWidth;
        fHeight = (maxDocs / 8 / fWidth) + 1;
        fPixels = new byte[fWidth * fHeight];

        //reset
        Arrays.fill(fPixels, (byte) (0 | 0xFF));

        Searcher s = new IndexSearcher(reader);
        bs = new BitSet(maxDocs);

        HitCollector hc = new HitCollector() {
            public void collect(int doc, float score) {
                bs.set(doc);
            }
        };

        Analyzer analyzer = new StandardAnalyzer();
        QueryParser parser = new QueryParser("", analyzer);
        Query q = parser.parse(query);

        s.search(q, null, hc);

        //This is the only interesting thing here
        for (int i = bs.nextSetBit(0); i >= 0; i = bs.nextSetBit(i + 1)) {
            int pos = i >> 3;
            fPixels[pos] = (byte) ((fPixels[pos] & 0xFF) >>> 1);
        }


        s.close();
        reader.close();

        // Create a BufferedIamge of the gray values in bytes.
        fBufferedImage = new BufferedImage(fWidth, fHeight,
                BufferedImage.TYPE_BYTE_GRAY);

        // Get the writable raster so that data can be changed.
        WritableRaster wr = fBufferedImage.getRaster();

        // Now write the byte data to the raster
        wr.setDataElements(0, 0, fWidth, fHeight, fPixels);
        return fBufferedImage;
    }

}





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: trivial util to Visualize BitSets (Query results actually)

mark harwood
I added something similar to Luke but without the
colour intensity - I may add your code in to do this.
Another Luke plugin I have visualizes "vocabulary
growth" for a field as a chart over time. This is
useful to see if a field is "matured" or is still
accumulating new terms.
A Zipf term distribution would be another good
visualization.

--- eks dev <[hidden email]> wrote:

> Maybe there are some more people that like to see
> bits. Feel free to do whatever you like with it.
>
> Idea is simple, map 8 bits from HitCollector to one
> pixel by changing gray levels. Implementation is
> Quick 'n Dirty, but does the job.
>
> /**
>  * Copyright 2004 The Apache Software Foundation
>  *
>  * Licensed under the Apache License, Version 2.0
> (the "License");
>  * you may not use this file except in compliance
> with the License.
>  * You may obtain a copy of the License at
>  *
>  *     http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in
> writing, software
>  * distributed under the License is distributed on
> an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
> either express or implied.
>  * See the License for the specific language
> governing permissions and
>  * limitations under the License.
>  */
>
> import javax.imageio.ImageIO;
>
> import org.apache.lucene.analysis.Analyzer;
> import
>
org.apache.lucene.analysis.standard.StandardAnalyzer;

> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.queryParser.ParseException;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.search.HitCollector;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.Searcher;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
>
> import java.awt.image.*;
> import java.io.File;
> import java.io.IOException;
> import java.util.Arrays;
> import java.util.BitSet;
>
> /*
>  * Generates .png image of the search result. Maps
> bits from result  
>  * sequentially from left to right untill specified
> image lenth is achieved
>  * and then continues from the begining in the next
> line.
>  *
>  * Mapps 8 bits to one pixel, by changing leves of
> gray
>  * e.g
>  * WHITE PIXEL: 0 of eight bits set
>  * GRAY PIXEL: some bits set (more set => darker)
>  * BLACK PIXEL: all 8 bits set  
>  *
>  *  main has 3 params: index path, Query, and
> optional imageWidth
>  */
> public class VisualizeBitVector {
>
>     // Invoke the init method in BufImagePanel to
> build the image.
>     public static void main(String[] args) throws
> IOException, ParseException {
>         Img fImg;
>         Directory dir;
>         IndexReader reader;
>
>         fImg = new Img();
>         if (args.length < 2) {
>             System.out.println("VisualizeBitVector
> index_path query [imageWidth]");
>             System.exit(-1);
>         }
>
>         String path = args[0];
>         String query = args[1];
>        
>         int imageWidth = 800;
>         if(args.length > 2){
>             imageWidth = Integer.parseInt(args[2]);
>         }
>
>         System.getProperties().setProperty(
>                
> "org.apache.lucene.FSDirectory.class",
>                
> "org.apache.lucene.store.MMapDirectory");
>         dir = FSDirectory.getDirectory(path, false);
>         reader = IndexReader.open(dir);
>
>         BufferedImage img;
>
>         img = fImg.makeImage(reader, query,
> imageWidth);
>         File file = new File("queryBits.png");
>         ImageIO.write(img, "png", file);
>
>     }
>
> }
>
>
> class Img {
>     BitSet bs;
>
>     BufferedImage fBufferedImage = null;
>
>     int fWidth = 0, fHeight = 0;
>
>     byte[] fPixels = null;
>
>     /** Build a BufferedImage from a pixel array. *
>      * @throws ParseException */
>
>     public BufferedImage makeImage(IndexReader
> reader, String query, int imageWidth)
>             throws IOException, ParseException {
>
>         int maxDocs = reader.maxDoc();
>
>         fWidth = imageWidth;
>         fHeight = (maxDocs / 8 / fWidth) + 1;
>         fPixels = new byte[fWidth * fHeight];
>
>         //reset
>         Arrays.fill(fPixels, (byte) (0 | 0xFF));
>
>         Searcher s = new IndexSearcher(reader);
>         bs = new BitSet(maxDocs);
>
>         HitCollector hc = new HitCollector() {
>             public void collect(int doc, float
> score) {
>                 bs.set(doc);
>             }
>         };
>
>         Analyzer analyzer = new StandardAnalyzer();
>         QueryParser parser = new QueryParser("",
> analyzer);
>         Query q = parser.parse(query);
>
>         s.search(q, null, hc);
>
>         //This is the only interesting thing here
>         for (int i = bs.nextSetBit(0); i >= 0; i =
> bs.nextSetBit(i + 1)) {
>             int pos = i >> 3;
>             fPixels[pos] = (byte) ((fPixels[pos] &
> 0xFF) >>> 1);
>         }
>
>
>         s.close();
>         reader.close();
>
>         // Create a BufferedIamge of the gray values
> in bytes.
>         fBufferedImage = new BufferedImage(fWidth,
> fHeight,
>                 BufferedImage.TYPE_BYTE_GRAY);
>
>         // Get the writable raster so that data can
> be changed.
>         WritableRaster wr =
> fBufferedImage.getRaster();
>
>         // Now write the byte data to the raster
>         wr.setDataElements(0, 0, fWidth, fHeight,
> fPixels);
>         return fBufferedImage;
>     }
>
> }
>
>
>
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [hidden email]
> For additional commands, e-mail:
> [hidden email]
>
>



               
___________________________________________________________
Next-generation email? Now you can have it with All New Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: trivial util to Visualize BitSets (Query results actually)

eks dev
right, luke rocks. Such things are best placed in luke indeed...
 
Image gets just too big without intensity trick, for really large bit vectors, I had to use 32 bits per pixel.

how I can see these plugins, is  it somewhere on http://www.getopt.org/luke/?

----- Original Message ----
From: mark harwood <[hidden email]>
To: [hidden email]; eks dev <[hidden email]>
Sent: Wednesday, 31 May, 2006 5:12:06 PM
Subject: Re: trivial util to Visualize  BitSets (Query results actually)

I added something similar to Luke but without the
colour intensity - I may add your code in to do this.
Another Luke plugin I have visualizes "vocabulary
growth" for a field as a chart over time. This is
useful to see if a field is "matured" or is still
accumulating new terms.
A Zipf term distribution would be another good
visualization.

--- eks dev <[hidden email]> wrote:

> Maybe there are some more people that like to see
> bits. Feel free to do whatever you like with it.
>
> Idea is simple, map 8 bits from HitCollector to one
> pixel by changing gray levels. Implementation is
> Quick 'n Dirty, but does the job.
>
> /**
>  * Copyright 2004 The Apache Software Foundation
>  *
>  * Licensed under the Apache License, Version 2.0
> (the "License");
>  * you may not use this file except in compliance
> with the License.
>  * You may obtain a copy of the License at
>  *
>  *     http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in
> writing, software
>  * distributed under the License is distributed on
> an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
> either express or implied.
>  * See the License for the specific language
> governing permissions and
>  * limitations under the License.
>  */
>
> import javax.imageio.ImageIO;
>
> import org.apache.lucene.analysis.Analyzer;
> import
>
org.apache.lucene.analysis.standard.StandardAnalyzer;

> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.queryParser.ParseException;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.search.HitCollector;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.Searcher;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
>
> import java.awt.image.*;
> import java.io.File;
> import java.io.IOException;
> import java.util.Arrays;
> import java.util.BitSet;
>
> /*
>  * Generates .png image of the search result. Maps
> bits from result  
>  * sequentially from left to right untill specified
> image lenth is achieved
>  * and then continues from the begining in the next
> line.
>  *
>  * Mapps 8 bits to one pixel, by changing leves of
> gray
>  * e.g
>  * WHITE PIXEL: 0 of eight bits set
>  * GRAY PIXEL: some bits set (more set => darker)
>  * BLACK PIXEL: all 8 bits set  
>  *
>  *  main has 3 params: index path, Query, and
> optional imageWidth
>  */
> public class VisualizeBitVector {
>
>     // Invoke the init method in BufImagePanel to
> build the image.
>     public static void main(String[] args) throws
> IOException, ParseException {
>         Img fImg;
>         Directory dir;
>         IndexReader reader;
>
>         fImg = new Img();
>         if (args.length < 2) {
>             System.out.println("VisualizeBitVector
> index_path query [imageWidth]");
>             System.exit(-1);
>         }
>
>         String path = args[0];
>         String query = args[1];
>        
>         int imageWidth = 800;
>         if(args.length > 2){
>             imageWidth = Integer.parseInt(args[2]);
>         }
>
>         System.getProperties().setProperty(
>                
> "org.apache.lucene.FSDirectory.class",
>                
> "org.apache.lucene.store.MMapDirectory");
>         dir = FSDirectory.getDirectory(path, false);
>         reader = IndexReader.open(dir);
>
>         BufferedImage img;
>
>         img = fImg.makeImage(reader, query,
> imageWidth);
>         File file = new File("queryBits.png");
>         ImageIO.write(img, "png", file);
>
>     }
>
> }
>
>
> class Img {
>     BitSet bs;
>
>     BufferedImage fBufferedImage = null;
>
>     int fWidth = 0, fHeight = 0;
>
>     byte[] fPixels = null;
>
>     /** Build a BufferedImage from a pixel array. *
>      * @throws ParseException */
>
>     public BufferedImage makeImage(IndexReader
> reader, String query, int imageWidth)
>             throws IOException, ParseException {
>
>         int maxDocs = reader.maxDoc();
>
>         fWidth = imageWidth;
>         fHeight = (maxDocs / 8 / fWidth) + 1;
>         fPixels = new byte[fWidth * fHeight];
>
>         //reset
>         Arrays.fill(fPixels, (byte) (0 | 0xFF));
>
>         Searcher s = new IndexSearcher(reader);
>         bs = new BitSet(maxDocs);
>
>         HitCollector hc = new HitCollector() {
>             public void collect(int doc, float
> score) {
>                 bs.set(doc);
>             }
>         };
>
>         Analyzer analyzer = new StandardAnalyzer();
>         QueryParser parser = new QueryParser("",
> analyzer);
>         Query q = parser.parse(query);
>
>         s.search(q, null, hc);
>
>         //This is the only interesting thing here
>         for (int i = bs.nextSetBit(0); i >= 0; i =
> bs.nextSetBit(i + 1)) {
>             int pos = i >> 3;
>             fPixels[pos] = (byte) ((fPixels[pos] &
> 0xFF) >>> 1);
>         }
>
>
>         s.close();
>         reader.close();
>
>         // Create a BufferedIamge of the gray values
> in bytes.
>         fBufferedImage = new BufferedImage(fWidth,
> fHeight,
>                 BufferedImage.TYPE_BYTE_GRAY);
>
>         // Get the writable raster so that data can
> be changed.
>         WritableRaster wr =
> fBufferedImage.getRaster();
>
>         // Now write the byte data to the raster
>         wr.setDataElements(0, 0, fWidth, fHeight,
> fPixels);
>         return fBufferedImage;
>     }
>
> }
>
>
>
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [hidden email]
> For additional commands, e-mail:
> [hidden email]
>
>



       
___________________________________________________________
Next-generation email? Now you can have it with All New Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Lexicon access questions

eks dev

We have faced the following use case:

In order to optimize performance and more importantly quality of search results we are forced to attach more attributes to particular words (Terms). Generic attributes like TF, IDF are usefull to model our "similarity" only up to some level.

Examples:
1. Is one Term first or last name, (e.g. we have comprehensive list of such words). This enables us to make smarter (faster and better queries) in case someone has multiple first names, it influences ranking...
2. Agreement weight and Disagreement weigt of some words is modelled diferently.
3. Semantic classes of words influence ranking (if something verb or noun changes search strategy and ranking radically)

On top of that, we can afford to load all terms in memory, in order to alow fast string distance callculations and some limited pattern matching using some strange Trie-s.

Today, we solve these things by implementing totally redundant data structures that keep some kind of map Term->ValuesObject, which is redundant to Lucene Lexicon storage. Instead of "one access gets all" we have two access terms using two diferent access paths, once using our dictionary and second time implicitly via Query or so... So we introduce performance/memory penalties. (Pls. do not forget, we need to access copy of analyzed document in order to attach "additional info" to Terms)

I guess we are not the only ones to face such a case, as increase in precision above TF/IDF can be only achieved by introducing some "domain semantics" where available. For this, "attaching" domain specific info to Term would be perfect solution. Also, enabling flexible implementations for Lexicon access could give us some flexibility (e.g. implementation in mg4j goes in that direction)

Could somebody imagine 2.x version of Lucene to have some Interface that needs to be implemented with clear contract, that would enable us to attach our implementation for accessing lexicon?

Or even better, some hints how I can do it today :)




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lexicon access questions

Chuck Williams-2
This approach comes to mind. You could model your semantic tags as
tokens and index them at the same positions as the words or phrases to
which they apply. This is particularly easy if you can integrate your
taggers with your Analyzer. You would probably want to create one or
more new Query subclasses to facilitate certain types of matching,
making it easy to associate terms/phrases with different tags (e.g.,
OverlappingQuery). This approach would support generation of queries
that are tag-dependent, but would not directly help using tags in a
ranking algorithm for tag-independent queries. As an off-hand thought,
you might be able to extend the idea to support this by naming your tags
something like TERM_TAG where TERM is the term they apply to (best if
the character used for '_' cannot occur in any term). Then something
like a TaggedTermQuery could easily find the tags relevant to a term in
the query and iterate their docs/positions in parallel with those of the
term (rougly equilvaent to OverlappingQuery(term, PrefixQuery(term_*))).

Top-of-mind thoughts,

Chuck


eks dev wrote on 06/01/2006 12:10 AM:

> We have faced the following use case:
>
> In order to optimize performance and more importantly quality of search results we are forced to attach more attributes to particular words (Terms). Generic attributes like TF, IDF are usefull to model our "similarity" only up to some level.
>
> Examples:
> 1. Is one Term first or last name, (e.g. we have comprehensive list of such words). This enables us to make smarter (faster and better queries) in case someone has multiple first names, it influences ranking...
> 2. Agreement weight and Disagreement weigt of some words is modelled diferently.
> 3. Semantic classes of words influence ranking (if something verb or noun changes search strategy and ranking radically)
>
> On top of that, we can afford to load all terms in memory, in order to alow fast string distance callculations and some limited pattern matching using some strange Trie-s.
>
> Today, we solve these things by implementing totally redundant data structures that keep some kind of map Term->ValuesObject, which is redundant to Lucene Lexicon storage. Instead of "one access gets all" we have two access terms using two diferent access paths, once using our dictionary and second time implicitly via Query or so... So we introduce performance/memory penalties. (Pls. do not forget, we need to access copy of analyzed document in order to attach "additional info" to Terms)
>
> I guess we are not the only ones to face such a case, as increase in precision above TF/IDF can be only achieved by introducing some "domain semantics" where available. For this, "attaching" domain specific info to Term would be perfect solution. Also, enabling flexible implementations for Lexicon access could give us some flexibility (e.g. implementation in mg4j goes in that direction)
>
> Could somebody imagine 2.x version of Lucene to have some Interface that needs to be implemented with clear contract, that would enable us to attach our implementation for accessing lexicon?
>
> Or even better, some hints how I can do it today :)
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lexicon access questions

eks dev
In reply to this post by eks dev
Thanks Chuck,

 I have to try it with example (s).

Use case one:

Documents:
D1 == "John Doe"
D2 == "sky scraper"
D3 ==  "blue sky LTD"

Imagine name "John" is ultra frequent => low IDF weight  
and "sky" is super low freq => very high weigt

So Query:
Q: "sky john"

will give order:
D2,  D3, D1

Also imagine, I know (external knowledge) that "John" is personal name and its "importance"  in  Similarity calculus  should be  corrected by  some  boost  due to this fact.

So, what I do today is to Lookup in some Dictionary Map where I attach boost to this token (reformulate query to "sky john^250").

What I was proposing, is to be able to attach this boost (practically IDF correction of some tokens during indexing) to tokens during indexing. With this, I could spare one lookup in memory hungry Dictionary and  reformulation of the Query.
This example case is just introduction to the idea. This example is over-simplified and possible to solve by indexing the same token many times at the same position. Having this possible, things like SweetSpotSimilarity could be done  as an optional offline task (adjust IDF curve).

Second "problem" to store semantic TAGS per token looks definitly doable by your proposal, but I am heving problems to comprehend all noughty details (performance impact and expressive power) as I never tried that parts of Lucene.
The quetion, when we are accessing Term from Lexicon anyhow for serching purposes (postings offset, freq), would it not be faster to attach this TAG info to the Term?
 

The third issue I briefly mentioned. Use Case where Lexicon can be loaded completely in memory (not an unusal case these days) gives us some space to play with FuzzyQuery and make them really usful in terms of speed. I guess there could be also some other implementations that can work on disk as well.

We currently deal with ca. 50Mio  Docs collection  (short documents) and all terms fit nicely in memory in  TernarySearchTree that alows us to issue Term lookups "give me all Terms that have at most N edits" than we run our hand tuned Needlman-Wunsch (different costs for substitutions like in "hitec" vs "hitek"...)... I would say, nice feature for people with reasonably sized collections. Better way of doing it would be to have posibility for our implementation of the Dictionary to implement Lucene interface "Lexicon" which would provides Lucene with postings offset or whatever is needed for Lucene when you search for Term.


Lucene today is great, this here is just "could we do beter" not a  "can someone scratch my itch"










----- Original Message ----
From: Chuck Williams <[hidden email]>
To: [hidden email]
Sent: Thursday, 1 June, 2006 7:05:27 PM
Subject: Re: Lexicon access questions

This approach comes to mind. You could model your semantic tags as
tokens and index them at the same positions as the words or phrases to
which they apply. This is particularly easy if you can integrate your
taggers with your Analyzer. You would probably want to create one or
more new Query subclasses to facilitate certain types of matching,
making it easy to associate terms/phrases with different tags (e.g.,
OverlappingQuery). This approach would support generation of queries
that are tag-dependent, but would not directly help using tags in a
ranking algorithm for tag-independent queries. As an off-hand thought,
you might be able to extend the idea to support this by naming your tags
something like TERM_TAG where TERM is the term they apply to (best if
the character used for '_' cannot occur in any term). Then something
like a TaggedTermQuery could easily find the tags relevant to a term in
the query and iterate their docs/positions in parallel with those of the
term (rougly equilvaent to OverlappingQuery(term, PrefixQuery(term_*))).

Top-of-mind thoughts,

Chuck


eks dev wrote on 06/01/2006 12:10 AM:

> We have faced the following use case:
>
> In order to optimize performance and more importantly quality of search results we are forced to attach more attributes to particular words (Terms). Generic attributes like TF, IDF are usefull to model our "similarity" only up to some level.
>
> Examples:
> 1. Is one Term first or last name, (e.g. we have comprehensive list of such words). This enables us to make smarter (faster and better queries) in case someone has multiple first names, it influences ranking...
> 2. Agreement weight and Disagreement weigt of some words is modelled diferently.
> 3. Semantic classes of words influence ranking (if something verb or noun changes search strategy and ranking radically)
>
> On top of that, we can afford to load all terms in memory, in order to alow fast string distance callculations and some limited pattern matching using some strange Trie-s.
>
> Today, we solve these things by implementing totally redundant data structures that keep some kind of map Term->ValuesObject, which is redundant to Lucene Lexicon storage. Instead of "one access gets all" we have two access terms using two diferent access paths, once using our dictionary and second time implicitly via Query or so... So we introduce performance/memory penalties. (Pls. do not forget, we need to access copy of analyzed document in order to attach "additional info" to Terms)
>
> I guess we are not the only ones to face such a case, as increase in precision above TF/IDF can be only achieved by introducing some "domain semantics" where available. For this, "attaching" domain specific info to Term would be perfect solution. Also, enabling flexible implementations for Lexicon access could give us some flexibility (e.g. implementation in mg4j goes in that direction)
>
> Could somebody imagine 2.x version of Lucene to have some Interface that needs to be implemented with clear contract, that would enable us to attach our implementation for accessing lexicon?
>
> Or even better, some hints how I can do it today :)
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]







---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: OpenBitSet

Yonik Seeley
In reply to this post by eks dev
FYI, I updated the bug with Operon performance numbers.
More in line with what I originally expected - the intersection count
functions are the true standouts, and what you care about for faceted
browsing. anything else is gravy.

http://issues.apache.org/jira/browse/SOLR-15

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]