Bitwise Operations on Integer Fields in Lucene and Solr Index

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Bitwise Operations on Integer Fields in Lucene and Solr Index

Israel Ekpo

Hello Lucene and Solr Community

I have a custom org.apache.lucene.search.Filter that I would like to contribute to the Lucene and Solr projects.

So I would need some direction as to how to create and ISSUE or submit a patch.

It looks like there have been changes to the way this is done since the latest merge of the two projects (Lucene and Solr).

Recently, some Solr users have been looking for a way to perform bitwise operations between and integer value and some fields in the Index

So, I wrote a Solr QParser plugin to do this using a custom Lucene Filter.

This package makes it possible to filter results returned from a query based on the results of a bitwise operation on an integer field in the documents returned from the pre-constructed query.

You can perform three basic types of operations on these integer fields

    * BitwiseOperation.BITWISE_AND (bitwise AND)
    * BitwiseOperation.BITWISE_OR (bitwise inclusive OR)
    * BitwiseOperation.BITWISE_XOR (bitwise exclusive OR)

You can also negate the results of these operations.

For example, imagine there is an integer field in the index named "flags" with the a value 8 (1000 in binary). The following results will be expected :

   1. A source value of 8 will match during a BitwiseOperation.BITWISE_AND operation, with negate set to false.
   2. A source value of 4 will match during a BitwiseOperation.BITWISE_AND operation, with negate set to true.

The BitwiseFilter constructor accepts the following values

    * The name of the integer field (A string)
    * The BitwiseOperation object. Example BitwiseOperation.BITWISE_XOR
    * The source value (an integer)
    * A boolean value indicating whether or not to negate the results of the operation
    * A pre-constructed org.apache.lucene.search.Query

Here is an example of how you would use it with Solr

<a href="http://localhost:8983/solr/bitwise/select/?q={!bitwise">http://localhost:8983/solr/bitwise/select/?q={!bitwise field=user_permissions op=AND source=3 negate=true}state:FL

<a href="http://localhost:8983/solr/bitwise/select/?q={!bitwise">http://localhost:8983/solr/bitwise/select/?q={!bitwise field=user_permissions op=AND source=3}state:FL

Here is an example of how you would use it with Lucene
   
public class BitwiseTestSearch extends BitwiseTestBase {

    public BitwiseTestSearch()
    {
       
    }
   
    public void search() throws IOException, ParseException
    {
        setupSearch();
       
        // term
        Term t = new Term(COUNTRY_KEY, "us");
       
        // term query
        Query q = new TermQuery(t);
       
        // maximum number of documents to display
        int limit = 1000;
       
        int sourceValue = 0 ;
       
        boolean negate = false;
       
        BitwiseFilter bitwiseFilter = new BitwiseFilter(USER_PERMS_KEY, BitwiseOperation.BITWISE_XOR, sourceValue, negate, q);
       
        Query fq = new FilteredQuery(q, bitwiseFilter);
       
        ScoreDoc[] hits = isearcher.search(fq, null, limit).scoreDocs;
       
        BitwiseResultFilter resultFilter = bitwiseFilter.getResultFilter();
       
        for (int i = 0; i < hits.length; i++) {
           
            Document hitDoc = isearcher.doc(hits[i].doc);
           
            System.out.println(FIRST_NAME_KEY + " field has a value of " + hitDoc.get(FIRST_NAME_KEY));
            System.out.println(LAST_NAME_KEY + " field has a value of " + hitDoc.get(LAST_NAME_KEY));
            System.out.println(ACTIVE_KEY + " field has a value of " + hitDoc.get(ACTIVE_KEY));
           
            System.out.println(USER_PERMS_KEY + " field has a value of " + hitDoc.get(USER_PERMS_KEY));

            System.out.println("doc ID --> " + hits[i].doc);           
           
            System.out.println("...............................................................");
        }
       
        System.out.println("sourceValue = " + sourceValue + ",operation = " + resultFilter.getOperation().getOperationName() + ", negate = " + negate);
       
        System.out.println("A total of " + hits.length + " documents were found from the search\n");
       
        shutdown();
    }
   
    public static void main(String args[]) throws IOException, ParseException
    {
        BitwiseTestSearch search = new BitwiseTestSearch();
       
        search.search();
    }
}

Any guidance would be highly appreciated.

Thanks.


--
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/
Reply | Threaded
Open this post in threaded view
|

Re: Bitwise Operations on Integer Fields in Lucene and Solr Index

Andrzej Białecki-2
On 2010-05-13 23:27, Israel Ekpo wrote:

> Hello Lucene and Solr Community
>
> I have a custom org.apache.lucene.search.Filter that I would like to
> contribute to the Lucene and Solr projects.
>
> So I would need some direction as to how to create and ISSUE or submit a
> patch.
>
> It looks like there have been changes to the way this is done since the
> latest merge of the two projects (Lucene and Solr).
>
> Recently, some Solr users have been looking for a way to perform bitwise
> operations between and integer value and some fields in the Index
>
> So, I wrote a Solr QParser plugin to do this using a custom Lucene Filter.
>
> This package makes it possible to filter results returned from a query based
> on the results of a bitwise operation on an integer field in the documents
> returned from the pre-constructed query.

Hi,

What a coincidence! :) I'm working on something very similar, only the
use case that I need to support is slightly different - I want to
support a ranked search based on a bitwise overlap of query value and
field value. That is, the number of differing bits would reduce the
score. This scenario occurs e.g. during near-duplicate detection that
uses fuzzy signatures, on document- or sentence levels.

I'm going to submit my code early next week, it still needs some
polishing. I have two ways to execute this query, neither of which uses
filters at the moment:

* method 1: during indexing the bits in the fields are turned into
on/off terms on the same field, and during search a BooleanQuery is
formed from the int value with the same terms. Scoring is courtesy of
BooleanScorer. This method supports only a single int value per field.

* method 2, incomplete yet - during indexing the bits are turned into
terms as before, but this method supports multiple int values per field:
terms that correspond to bitmasks on the same value are put at the same
positions. Then a specialized Query / Scorer traverses all 32 posting
lists in parallel, moving through all matching docs and scoring
according to how many terms matched at the same position.

I wrapped this in a Solr FieldType, and instead of using a custom
QParser plugin I simply implemented FieldType.getFieldQuery().

It would be great to work out a convenient user-level API for this
feature, both the scoring and the non-scoring case.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Bitwise Operations on Integer Fields in Lucene and Solr Index

Israel Ekpo
I have created two ISSUES as new features

https://issues.apache.org/jira/browse/LUCENE-1560

https://issues.apache.org/jira/browse/SOLR-1913

The first one is for the Lucene Filter.

The second one is for the Solr QParserPlugin

The source code and jar files are attached and the Solr plugin is available for use immediately.



On Thu, May 13, 2010 at 6:42 PM, Andrzej Bialecki <[hidden email]> wrote:
On 2010-05-13 23:27, Israel Ekpo wrote:
> Hello Lucene and Solr Community
>
> I have a custom org.apache.lucene.search.Filter that I would like to
> contribute to the Lucene and Solr projects.
>
> So I would need some direction as to how to create and ISSUE or submit a
> patch.
>
> It looks like there have been changes to the way this is done since the
> latest merge of the two projects (Lucene and Solr).
>
> Recently, some Solr users have been looking for a way to perform bitwise
> operations between and integer value and some fields in the Index
>
> So, I wrote a Solr QParser plugin to do this using a custom Lucene Filter.
>
> This package makes it possible to filter results returned from a query based
> on the results of a bitwise operation on an integer field in the documents
> returned from the pre-constructed query.

Hi,

What a coincidence! :) I'm working on something very similar, only the
use case that I need to support is slightly different - I want to
support a ranked search based on a bitwise overlap of query value and
field value. That is, the number of differing bits would reduce the
score. This scenario occurs e.g. during near-duplicate detection that
uses fuzzy signatures, on document- or sentence levels.

I'm going to submit my code early next week, it still needs some
polishing. I have two ways to execute this query, neither of which uses
filters at the moment:

* method 1: during indexing the bits in the fields are turned into
on/off terms on the same field, and during search a BooleanQuery is
formed from the int value with the same terms. Scoring is courtesy of
BooleanScorer. This method supports only a single int value per field.

* method 2, incomplete yet - during indexing the bits are turned into
terms as before, but this method supports multiple int values per field:
terms that correspond to bitmasks on the same value are put at the same
positions. Then a specialized Query / Scorer traverses all 32 posting
lists in parallel, moving through all matching docs and scoring
according to how many terms matched at the same position.

I wrapped this in a Solr FieldType, and instead of using a custom
QParser plugin I simply implemented FieldType.getFieldQuery().

It would be great to work out a convenient user-level API for this
feature, both the scoring and the non-scoring case.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/
Reply | Threaded
Open this post in threaded view
|

Re: Bitwise Operations on Integer Fields in Lucene and Solr Index

Israel Ekpo
Correction,

I meant to list

https://issues.apache.org/jira/browse/LUCENE-2460
https://issues.apache.org/jira/browse/SOLR-1913



On Thu, May 13, 2010 at 10:13 PM, Israel Ekpo <[hidden email]> wrote:
I have created two ISSUES as new features

https://issues.apache.org/jira/browse/LUCENE-1560

https://issues.apache.org/jira/browse/SOLR-1913

The first one is for the Lucene Filter.

The second one is for the Solr QParserPlugin

The source code and jar files are attached and the Solr plugin is available for use immediately.




On Thu, May 13, 2010 at 6:42 PM, Andrzej Bialecki <[hidden email]> wrote:
On 2010-05-13 23:27, Israel Ekpo wrote:
> Hello Lucene and Solr Community
>
> I have a custom org.apache.lucene.search.Filter that I would like to
> contribute to the Lucene and Solr projects.
>
> So I would need some direction as to how to create and ISSUE or submit a
> patch.
>
> It looks like there have been changes to the way this is done since the
> latest merge of the two projects (Lucene and Solr).
>
> Recently, some Solr users have been looking for a way to perform bitwise
> operations between and integer value and some fields in the Index
>
> So, I wrote a Solr QParser plugin to do this using a custom Lucene Filter.
>
> This package makes it possible to filter results returned from a query based
> on the results of a bitwise operation on an integer field in the documents
> returned from the pre-constructed query.

Hi,

What a coincidence! :) I'm working on something very similar, only the
use case that I need to support is slightly different - I want to
support a ranked search based on a bitwise overlap of query value and
field value. That is, the number of differing bits would reduce the
score. This scenario occurs e.g. during near-duplicate detection that
uses fuzzy signatures, on document- or sentence levels.

I'm going to submit my code early next week, it still needs some
polishing. I have two ways to execute this query, neither of which uses
filters at the moment:

* method 1: during indexing the bits in the fields are turned into
on/off terms on the same field, and during search a BooleanQuery is
formed from the int value with the same terms. Scoring is courtesy of
BooleanScorer. This method supports only a single int value per field.

* method 2, incomplete yet - during indexing the bits are turned into
terms as before, but this method supports multiple int values per field:
terms that correspond to bitmasks on the same value are put at the same
positions. Then a specialized Query / Scorer traverses all 32 posting
lists in parallel, moving through all matching docs and scoring
according to how many terms matched at the same position.

I wrapped this in a Solr FieldType, and instead of using a custom
QParser plugin I simply implemented FieldType.getFieldQuery().

It would be great to work out a convenient user-level API for this
feature, both the scoring and the non-scoring case.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/



--
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/