Sorting on a field that can have null values

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Sorting on a field that can have null values

Peter Keegan
I'm copying this reply from a topic with the same title from the defunct
'lucene-user' list. My comments follow it.

: I thought of putting empty strings instead of null values but I think
: empty strings are put first in the list while sorting which is the
: reverse of what anyone would want.

instead of adding a field with a null value, or value of an epty string,
why not just leave the field out for that/those doc(s)?

there's no requirement that every doc in your index has to have the exact
same set of fields.

If i rememebr correctly (you'll have to test this) sorting on a field
which doesn't exist for every doc does what you would want (docs with
values are listed before docs without)



-Hoss



The actual behavior is different than described above. I modified
TestSort.java:

    // test sorts where the type of field is specified
    public void testTypedSort() throws Exception {
        assertMatches (full, queryF, sort, "JIZ");
    }

The actual order of the results is: "ZJI". I believe this happens because
the field string cache 'order' array contains 0's for all the documents that
don't contain the field and thus sort first.

Suppose I want to exclude documents from being collected if they don't
contain the sort field. One way to do this is to index a unique
'empty_value' value for those documents and add a MUST_NOT boolean clause to
the query, for example: "<query terms> -field:empty_value)". But this seems
inefficient. Is there a better way?

Thanks,
Peter
Reply | Threaded
Open this post in threaded view
|

Re: Sorting on a field that can have null values

Chris Hostetter-3

: If i rememebr correctly (you'll have to test this) sorting on a field
: which doesn't exist for every doc does what you would want (docs with
: values are listed before docs without)

: The actual behavior is different than described above. I modified
: TestSort.java:

: The actual order of the results is: "ZJI". I believe this happens because
: the field string cache 'order' array contains 0's for all the documents that
: don't contain the field and thus sort first.

i guess wasn't precise enough in that old thread, what i ment was that not
having a vlaue results in the docs sorting the same as if they had a value
lower then the lowest existing value -- so they sort at the end of the
list if you are doing a descending sort, and at the begining of the list
if you do an ascending sort.  If you want to always have them come "last"
regardless of order, there is a SortComparator for that purpose in Solr...

https://issues.apache.org/jira/browse/LUCENE-406
http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/search/MissingStringLastComparatorSource.java?view=log

: Suppose I want to exclude documents from being collected if they don't
: contain the sort field. One way to do this is to index a unique
: 'empty_value' value for those documents and add a MUST_NOT boolean clause to
: the query, for example: "<query terms> -field:empty_value)". But this seems
: inefficient. Is there a better way?

excluding them completely is a slightly differnet task, you don't need to
index a special marker value, you can just use a
RangeFilter (or ConstantScoreRangeQuery) to ensure you only get docs with
a value for that field (ie: field:[* TO *])



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sorting on a field that can have null values

Peter Keegan
> excluding them completely is a slightly differnet task, you don't need to
> index a special marker value, you can just use a
> RangeFilter (or ConstantScoreRangeQuery) to ensure you only get docs with
> a value for that field (ie: field:[* TO *])

Excellent, this is a much better solution. BTW, adding a
ConstantScoreRangeQuery clause to the query works fine, but building the
RangeFilter from the query string "field:[* TO *] doesn't work. The reason
is that the terms expanded from the lowerTerm wildcard are compared to
'upperTerm' which is literally '*', which is incorrect. This would appear to
be a bug in QueryParser as it ought to set lowerTerm = upperTerm = null in
this case.

Thanks,
Peter


On 4/12/07, Chris Hostetter <[hidden email]> wrote:

>
>
> : If i rememebr correctly (you'll have to test this) sorting on a field
> : which doesn't exist for every doc does what you would want (docs with
> : values are listed before docs without)
>
> : The actual behavior is different than described above. I modified
> : TestSort.java:
>
> : The actual order of the results is: "ZJI". I believe this happens
> because
> : the field string cache 'order' array contains 0's for all the documents
> that
> : don't contain the field and thus sort first.
>
> i guess wasn't precise enough in that old thread, what i ment was that not
> having a vlaue results in the docs sorting the same as if they had a value
> lower then the lowest existing value -- so they sort at the end of the
> list if you are doing a descending sort, and at the begining of the list
> if you do an ascending sort.  If you want to always have them come "last"
> regardless of order, there is a SortComparator for that purpose in Solr...
>
> https://issues.apache.org/jira/browse/LUCENE-406
>
> http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/search/MissingStringLastComparatorSource.java?view=log
>
> : Suppose I want to exclude documents from being collected if they don't
> : contain the sort field. One way to do this is to index a unique
> : 'empty_value' value for those documents and add a MUST_NOT boolean
> clause to
> : the query, for example: "<query terms> -field:empty_value)". But this
> seems
> : inefficient. Is there a better way?
>
> excluding them completely is a slightly differnet task, you don't need to
> index a special marker value, you can just use a
> RangeFilter (or ConstantScoreRangeQuery) to ensure you only get docs with
> a value for that field (ie: field:[* TO *])
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Sorting on a field that can have null values

Theodan
In reply to this post by Chris Hostetter-3
Chris Hostetter wrote
: If i rememebr correctly (you'll have to test this) sorting on a field
: which doesn't exist for every doc does what you would want (docs with
: values are listed before docs without)

: The actual behavior is different than described above. I modified
: TestSort.java:

: The actual order of the results is: "ZJI". I believe this happens because
: the field string cache 'order' array contains 0's for all the documents that
: don't contain the field and thus sort first.

i guess wasn't precise enough in that old thread, what i ment was that not
having a vlaue results in the docs sorting the same as if they had a value
lower then the lowest existing value -- so they sort at the end of the
list if you are doing a descending sort, and at the begining of the list
if you do an ascending sort.  If you want to always have them come "last"
regardless of order, there is a SortComparator for that purpose in Solr...

https://issues.apache.org/jira/browse/LUCENE-406
http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/search/MissingStringLastComparatorSource.java?view=log
But how can you use both the MissingStringLastComparatorSource and also your own custom SortComparator (i.e. having a custom getComparable() method)?

I have tried the obvious, which was to make my custom SortComparator extend MissingStringLastComparatorSource instead of SortComparator.  But then it seems that my custom getComparable() method is ignored.  The sorting framework doesn't seem to use the Comparables returned from my getComparable() method to sort the results; instead, it seems to use the ScoreDocComparator returned from the newComparator() method of MissingStringLastComparatorSource.

FYI, my end goal is to be able to sort on a field called "AssetType".  Some of the docs in the index may be missing this field (and I'd like those docs to be sorted at the end of the results).  Furthermore, I need a custom sorting order on the values in this "AssetType" field (first videos, then articles, then images, etc.).

Here is my custom comparator, after changing it to extend MissingStringLastComparatorSource (all that I changed was the "extends" clause; the body remained the same):

======================================================================
private static class AssetTypeSortComparator extends MissingStringLastComparatorSource /*SortComparator*/ {

        private static final Map ASSET_TYPE_ORDER_MAP = new HashMap();
        static {
                ASSET_TYPE_ORDER_MAP.put("Video", new Integer(0));
                ASSET_TYPE_ORDER_MAP.put("Article", new Integer(1));
                ASSET_TYPE_ORDER_MAP.put("Image", new Integer(2));
        }

        private static final Integer DEFAULT_ORDER = new Integer(3);

        protected Comparable getComparable(String termtext) {
                if (ASSET_TYPE_ORDER_MAP.containsKey(termtext)) {
                        return (Integer)ASSET_TYPE_ORDER_MAP.get(termtext);
                }
                else {
                        return DEFAULT_ORDER;
                }
        }

}
======================================================================

-Theo
Reply | Threaded
Open this post in threaded view
|

Re: Sorting on a field that can have null values

Chris Hostetter-3
: But how can you use both the MissingStringLastComparatorSource and also your
: own custom SortComparator (i.e. having a custom getComparable() method)?
:
: I have tried the obvious, which was to make my custom SortComparator extend
: MissingStringLastComparatorSource instead of SortComparator.  But then it
: seems that my custom getComparable() method is ignored.  The sorting

well, yes that's true.  getComparable is an abstract method in
SortComparator so that subclasses of SortComparator can efinte how the
newComparator method will behave ... if you don't subcalss SortComparator
getComparable has no meaning.

(SortComparator is an instance of SortComparatorSource, but that doesn't
mean all other instances of SortComparatorSource -- like
MissingStringLastComparatorSource -- have any notion of APIs introduced in
SortComparator)

It didn't occur to me beofre that you were writing a custom SortComparator
... skimming the code for SortComparator and FieldCache briefly there
seems to be a limitation/feature (depending on how you look at it)
such that getComparable(null) is never called to decide what do do with
the null values -- it just assumes you want a null Comparable as well.

: FYI, my end goal is to be able to sort on a field called "AssetType".  Some
: of the docs in the index may be missing this field (and I'd like those docs
: to be sorted at the end of the results).  Furthermore, I need a custom
: sorting order on the values in this "AssetType" field (first videos, then
: articles, then images, etc.).

Frankly, My advice to you is to "encode" the AssetType field into some new
"SortableAssetType" field when you index your docs -- the code to do that
is not only easier to udnerstand then writiting a custom
SortComparator or SortComparatorSource but it's also going to be a lot
faster ... you do a little more work when indexing, and you make
searching/sortign a lot more efficient.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]