Lucene sorting case-sensitive by default?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene sorting case-sensitive by default?

Alex Wang-4
Hi All,

 

I was searching my index with sorting on a field called "Label" which is
not tokenized, here is what came back:

 

Extended Sites Catalog Asset Store

Extended Sites Catalog Asset Store SALES

Print Catalog 2

Print catalog test

Test Print Catalog

Test refresh catalog

print test  3

test catalog 1

 

Looks like Lucene is separating upper case and lower case while sorting.
Can someone shed some light on this as to while this is happening and
how to fix it?

 

Thanks in advance for your help!

 

Alex

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Lucene sorting case-sensitive by default?

Tom Emerson-3
String fields are sorted using natural (lexicographic) order. For characters
in ASCII range this means uppercase letters will sort before lowercase
letters (e.g., 'A' U+0041 sorts before 'a' U+0061). This behaviour is
documented on in the JavaDocs for org.apache.lucene.search.Sort.

    -tree


On Jan 11, 2008 11:40 AM, Alex Wang <[hidden email]> wrote:

> Hi All,
>
>
>
> I was searching my index with sorting on a field called "Label" which is
> not tokenized, here is what came back:
>
>
>
> Extended Sites Catalog Asset Store
>
> Extended Sites Catalog Asset Store SALES
>
> Print Catalog 2
>
> Print catalog test
>
> Test Print Catalog
>
> Test refresh catalog
>
> print test  3
>
> test catalog 1
>
>
>
> Looks like Lucene is separating upper case and lower case while sorting.
> Can someone shed some light on this as to while this is happening and
> how to fix it?
>
>
>
> Thanks in advance for your help!
>
>
>
> Alex
>
>
>
>
>
>


--
Tom Emerson
[hidden email]
http://www.dreamersrealm.net/~tree <http://www.dreamersrealm.net/%7Etree>
Reply | Threaded
Open this post in threaded view
|

Re: Lucene sorting case-sensitive by default?

Erick Erickson
In reply to this post by Alex Wang-4
I've often stored a special sort field that's lower-cased.


On Jan 11, 2008 11:40 AM, Alex Wang <[hidden email]> wrote:

> Hi All,
>
>
>
> I was searching my index with sorting on a field called "Label" which is
> not tokenized, here is what came back:
>
>
>
> Extended Sites Catalog Asset Store
>
> Extended Sites Catalog Asset Store SALES
>
> Print Catalog 2
>
> Print catalog test
>
> Test Print Catalog
>
> Test refresh catalog
>
> print test  3
>
> test catalog 1
>
>
>
> Looks like Lucene is separating upper case and lower case while sorting.
> Can someone shed some light on this as to while this is happening and
> how to fix it?
>
>
>
> Thanks in advance for your help!
>
>
>
> Alex
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Lucene sorting case-sensitive by default?

Toke Eskildsen
In reply to this post by Alex Wang-4
On Fri, 2008-01-11 at 11:40 -0500, Alex Wang wrote:
> Looks like Lucene is separating upper case and lower case while sorting.

As Tom points out, default sorting uses natural order. It's worth noting
that this implies that default sorting does not produce usable results
as soon as you use non-ASCII characters. Using a Collator works but does
take a fair amount of memory.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Lucene sorting case-sensitive by default?

Alex Wang-4
Thanks everyone for your replies! Guess I did not fully understand the
meaning of "natural order" in the Lucene Java doc.

To add another all-lower-case field for each sortable field in my index
is a little too much, since the app requires sorting on pretty much all
fields (over 100).

Toke, you mentioned "Using a Collator works but does take a fair amount
of memory", can you please elaborate a little more on that. Thanks.

Alex

-----Original Message-----
From: Toke Eskildsen [mailto:[hidden email]]
Sent: Monday, January 14, 2008 3:13 AM
To: [hidden email]
Subject: Re: Lucene sorting case-sensitive by default?

On Fri, 2008-01-11 at 11:40 -0500, Alex Wang wrote:
> Looks like Lucene is separating upper case and lower case while
sorting.

As Tom points out, default sorting uses natural order. It's worth noting
that this implies that default sorting does not produce usable results
as soon as you use non-ASCII characters. Using a Collator works but does
take a fair amount of memory.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene sorting case-sensitive by default?

Erick Erickson
Several things:

1> do you need to display all the fields? Would just storing them
lower-case work? The only time I've needed to store fields case-
sensitive is when I'm showing them to the user. If the user is just
searching on them, I can store them any way I want and she'll never
know.

2> You might very well be surprised at how little extra it takes to
index (but not store) the lower-case version. How big is your index
anyway? And be warned that the size increase is not linear, so
just comparing the index sizes for, say, 10 document is misleading.
If your index is 10M, there's no reason at all not to store twice. If it's
10G........

3> You could store (but not index) the cased version. You could
index (but not store) the lower-case version. The total size of
your index is (I believe) about the same as indexing AND storing
the fields. That gives you a way to search caselessly and display
case-sensitively.

Best
Erick

On Jan 14, 2008 10:58 AM, Alex Wang <[hidden email]> wrote:

> Thanks everyone for your replies! Guess I did not fully understand the
> meaning of "natural order" in the Lucene Java doc.
>
> To add another all-lower-case field for each sortable field in my index
> is a little too much, since the app requires sorting on pretty much all
> fields (over 100).
>
> Toke, you mentioned "Using a Collator works but does take a fair amount
> of memory", can you please elaborate a little more on that. Thanks.
>
> Alex
>
> -----Original Message-----
> From: Toke Eskildsen [mailto:[hidden email]]
> Sent: Monday, January 14, 2008 3:13 AM
> To: [hidden email]
> Subject: Re: Lucene sorting case-sensitive by default?
>
> On Fri, 2008-01-11 at 11:40 -0500, Alex Wang wrote:
> > Looks like Lucene is separating upper case and lower case while
> sorting.
>
> As Tom points out, default sorting uses natural order. It's worth noting
> that this implies that default sorting does not produce usable results
> as soon as you use non-ASCII characters. Using a Collator works but does
> take a fair amount of memory.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Lucene sorting case-sensitive by default?

Alex Wang-4
Thanks a lot Erik for the great tip! I do need to display all the fields
and allow the users to sort by each field as they wish. My index is
currently about 200 mb.

Your suggestion about storing (but not index) the cased version, and
indexing (but not store) the lower-case version is an excellent solution
for me.

Is it possible to do it in the same field or do I have to do it in 2
separate fields? If I do it in one field, what are the Lucene
class/methods I need to overwrite?

Thanks again for your help!

Alex
 

This message may contain confidential and/or privileged information. If
you are not the addressee or authorized to receive this for the
addressee, you must not use, copy, disclose, or take any action based on
this message or any information herein. If you have received this
message in error, please advise the sender immediately by reply e-mail
and delete this message. Thank you for your cooperation.


-----Original Message-----
From: Erick Erickson [mailto:[hidden email]]
Sent: Monday, January 14, 2008 11:24 AM
To: [hidden email]
Subject: Re: Lucene sorting case-sensitive by default?

Several things:

1> do you need to display all the fields? Would just storing them
lower-case work? The only time I've needed to store fields case-
sensitive is when I'm showing them to the user. If the user is just
searching on them, I can store them any way I want and she'll never
know.

2> You might very well be surprised at how little extra it takes to
index (but not store) the lower-case version. How big is your index
anyway? And be warned that the size increase is not linear, so
just comparing the index sizes for, say, 10 document is misleading.
If your index is 10M, there's no reason at all not to store twice. If
it's
10G........

3> You could store (but not index) the cased version. You could
index (but not store) the lower-case version. The total size of
your index is (I believe) about the same as indexing AND storing
the fields. That gives you a way to search caselessly and display
case-sensitively.

Best
Erick

On Jan 14, 2008 10:58 AM, Alex Wang <[hidden email]> wrote:

> Thanks everyone for your replies! Guess I did not fully understand the
> meaning of "natural order" in the Lucene Java doc.
>
> To add another all-lower-case field for each sortable field in my
index
> is a little too much, since the app requires sorting on pretty much
all
> fields (over 100).
>
> Toke, you mentioned "Using a Collator works but does take a fair
amount

> of memory", can you please elaborate a little more on that. Thanks.
>
> Alex
>
> -----Original Message-----
> From: Toke Eskildsen [mailto:[hidden email]]
> Sent: Monday, January 14, 2008 3:13 AM
> To: [hidden email]
> Subject: Re: Lucene sorting case-sensitive by default?
>
> On Fri, 2008-01-11 at 11:40 -0500, Alex Wang wrote:
> > Looks like Lucene is separating upper case and lower case while
> sorting.
>
> As Tom points out, default sorting uses natural order. It's worth
noting
> that this implies that default sorting does not produce usable results
> as soon as you use non-ASCII characters. Using a Collator works but
does

> take a fair amount of memory.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene sorting case-sensitive by default?

Erick Erickson
Sorry, I was confused about this for the longest time (and it shows!). You
don't actually have to store two separate fields. Field.Store.YES stores
the input exactly as is, without passing it through anything. So you
really only have to store your field. I still think of it conceptually as
two
entirely different things, but it's not.

This code:
   public static void main(String[] args) throws Exception
    {
        try {
            RAMDirectory dir = new RAMDirectory();
            IndexWriter iw = new IndexWriter(
                        dir,
                        new StandardAnalyzer(Collections.emptySet()),
                        true);

            Document doc = new Document();

            doc.add(
                    new Field(
                            "f",
                            "This is Some Mixed, case Junk($*%& With Ugly
SYmbols",
                            Field.Store.YES,
                            Field.Index.TOKENIZED));
            iw.addDocument(doc);
            iw.close();
            IndexReader ir =  IndexReader.open(dir);
            Document d = ir.document(0);
            System.out.println(d.get("f"));
        } catch (Exception e) {
            e.printStackTrace();
        }

        System.out.println("done");
    }

prints "This is Some Mixed, case Junk($*%& With Ugly SYmbols"
yet still finds the document with a search for "junk" using
StandardAnalyzer.

Sorry for the confusion!
Erick

On Jan 14, 2008 11:48 AM, Alex Wang <[hidden email]> wrote:

> Thanks a lot Erik for the great tip! I do need to display all the fields
> and allow the users to sort by each field as they wish. My index is
> currently about 200 mb.
>
> Your suggestion about storing (but not index) the cased version, and
> indexing (but not store) the lower-case version is an excellent solution
> for me.
>
> Is it possible to do it in the same field or do I have to do it in 2
> separate fields? If I do it in one field, what are the Lucene
> class/methods I need to overwrite?
>
> Thanks again for your help!
>
> Alex
>
>
> This message may contain confidential and/or privileged information. If
> you are not the addressee or authorized to receive this for the
> addressee, you must not use, copy, disclose, or take any action based on
> this message or any information herein. If you have received this
> message in error, please advise the sender immediately by reply e-mail
> and delete this message. Thank you for your cooperation.
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:[hidden email]]
> Sent: Monday, January 14, 2008 11:24 AM
> To: [hidden email]
> Subject: Re: Lucene sorting case-sensitive by default?
>
> Several things:
>
> 1> do you need to display all the fields? Would just storing them
> lower-case work? The only time I've needed to store fields case-
> sensitive is when I'm showing them to the user. If the user is just
> searching on them, I can store them any way I want and she'll never
> know.
>
> 2> You might very well be surprised at how little extra it takes to
> index (but not store) the lower-case version. How big is your index
> anyway? And be warned that the size increase is not linear, so
> just comparing the index sizes for, say, 10 document is misleading.
> If your index is 10M, there's no reason at all not to store twice. If
> it's
> 10G........
>
> 3> You could store (but not index) the cased version. You could
> index (but not store) the lower-case version. The total size of
> your index is (I believe) about the same as indexing AND storing
> the fields. That gives you a way to search caselessly and display
> case-sensitively.
>
> Best
> Erick
>
> On Jan 14, 2008 10:58 AM, Alex Wang <[hidden email]> wrote:
>
> > Thanks everyone for your replies! Guess I did not fully understand the
> > meaning of "natural order" in the Lucene Java doc.
> >
> > To add another all-lower-case field for each sortable field in my
> index
> > is a little too much, since the app requires sorting on pretty much
> all
> > fields (over 100).
> >
> > Toke, you mentioned "Using a Collator works but does take a fair
> amount
> > of memory", can you please elaborate a little more on that. Thanks.
> >
> > Alex
> >
> > -----Original Message-----
> > From: Toke Eskildsen [mailto:[hidden email]]
> > Sent: Monday, January 14, 2008 3:13 AM
> > To: [hidden email]
> > Subject: Re: Lucene sorting case-sensitive by default?
> >
> > On Fri, 2008-01-11 at 11:40 -0500, Alex Wang wrote:
> > > Looks like Lucene is separating upper case and lower case while
> > sorting.
> >
> > As Tom points out, default sorting uses natural order. It's worth
> noting
> > that this implies that default sorting does not produce usable results
> > as soon as you use non-ASCII characters. Using a Collator works but
> does
> > take a fair amount of memory.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Lucene sorting case-sensitive by default?

Alex Wang-4
No problem Erick. Thanks for clarifying it.

Alex

-----Original Message-----
From: Erick Erickson [mailto:[hidden email]]
Sent: Monday, January 14, 2008 12:35 PM
To: [hidden email]
Subject: Re: Lucene sorting case-sensitive by default?

Sorry, I was confused about this for the longest time (and it shows!).
You
don't actually have to store two separate fields. Field.Store.YES stores
the input exactly as is, without passing it through anything. So you
really only have to store your field. I still think of it conceptually
as
two
entirely different things, but it's not.

This code:
   public static void main(String[] args) throws Exception
    {
        try {
            RAMDirectory dir = new RAMDirectory();
            IndexWriter iw = new IndexWriter(
                        dir,
                        new StandardAnalyzer(Collections.emptySet()),
                        true);

            Document doc = new Document();

            doc.add(
                    new Field(
                            "f",
                            "This is Some Mixed, case Junk($*%& With
Ugly
SYmbols",
                            Field.Store.YES,
                            Field.Index.TOKENIZED));
            iw.addDocument(doc);
            iw.close();
            IndexReader ir =  IndexReader.open(dir);
            Document d = ir.document(0);
            System.out.println(d.get("f"));
        } catch (Exception e) {
            e.printStackTrace();
        }

        System.out.println("done");
    }

prints "This is Some Mixed, case Junk($*%& With Ugly SYmbols"
yet still finds the document with a search for "junk" using
StandardAnalyzer.

Sorry for the confusion!
Erick

On Jan 14, 2008 11:48 AM, Alex Wang <[hidden email]> wrote:

> Thanks a lot Erik for the great tip! I do need to display all the
fields
> and allow the users to sort by each field as they wish. My index is
> currently about 200 mb.
>
> Your suggestion about storing (but not index) the cased version, and
> indexing (but not store) the lower-case version is an excellent
solution

> for me.
>
> Is it possible to do it in the same field or do I have to do it in 2
> separate fields? If I do it in one field, what are the Lucene
> class/methods I need to overwrite?
>
> Thanks again for your help!
>
> Alex
>
>
> This message may contain confidential and/or privileged information.
If
> you are not the addressee or authorized to receive this for the
> addressee, you must not use, copy, disclose, or take any action based
on

> this message or any information herein. If you have received this
> message in error, please advise the sender immediately by reply e-mail
> and delete this message. Thank you for your cooperation.
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:[hidden email]]
> Sent: Monday, January 14, 2008 11:24 AM
> To: [hidden email]
> Subject: Re: Lucene sorting case-sensitive by default?
>
> Several things:
>
> 1> do you need to display all the fields? Would just storing them
> lower-case work? The only time I've needed to store fields case-
> sensitive is when I'm showing them to the user. If the user is just
> searching on them, I can store them any way I want and she'll never
> know.
>
> 2> You might very well be surprised at how little extra it takes to
> index (but not store) the lower-case version. How big is your index
> anyway? And be warned that the size increase is not linear, so
> just comparing the index sizes for, say, 10 document is misleading.
> If your index is 10M, there's no reason at all not to store twice. If
> it's
> 10G........
>
> 3> You could store (but not index) the cased version. You could
> index (but not store) the lower-case version. The total size of
> your index is (I believe) about the same as indexing AND storing
> the fields. That gives you a way to search caselessly and display
> case-sensitively.
>
> Best
> Erick
>
> On Jan 14, 2008 10:58 AM, Alex Wang <[hidden email]> wrote:
>
> > Thanks everyone for your replies! Guess I did not fully understand
the

> > meaning of "natural order" in the Lucene Java doc.
> >
> > To add another all-lower-case field for each sortable field in my
> index
> > is a little too much, since the app requires sorting on pretty much
> all
> > fields (over 100).
> >
> > Toke, you mentioned "Using a Collator works but does take a fair
> amount
> > of memory", can you please elaborate a little more on that. Thanks.
> >
> > Alex
> >
> > -----Original Message-----
> > From: Toke Eskildsen [mailto:[hidden email]]
> > Sent: Monday, January 14, 2008 3:13 AM
> > To: [hidden email]
> > Subject: Re: Lucene sorting case-sensitive by default?
> >
> > On Fri, 2008-01-11 at 11:40 -0500, Alex Wang wrote:
> > > Looks like Lucene is separating upper case and lower case while
> > sorting.
> >
> > As Tom points out, default sorting uses natural order. It's worth
> noting
> > that this implies that default sorting does not produce usable
results
> > as soon as you use non-ASCII characters. Using a Collator works but
> does
> > take a fair amount of memory.
> >
> >
> >
---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
---------------------------------------------------------------------

> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Lucene sorting case-sensitive by default?

Toke Eskildsen
In reply to this post by Alex Wang-4
On Mon, 2008-01-14 at 10:58 -0500, Alex Wang wrote:
> Toke, you mentioned "Using a Collator works but does take a fair amount
> of memory", can you please elaborate a little more on that. Thanks.

We have an index with 10 million records that takes up 37GB. Practically
all records have a title, which can be used for sorting. The size of the
titles is about 450MB in total. Using a Collator the usual way means
that a CollatorKey is stored for all of those titles, among with the
titles themselves.

As for using a Collator, there's a neat tutorial at
http://wiki.epc.ub.uu.se/display/~ronnie/International+Sorting


Unfortunately for us, the memory requirements for this was too much for
our machine. Instead of using the build-in caching mechanism, we used
the fact that our index is currently static and calculated a single
integer for every document, stating its position relative to all other
documents in local sort order. That gives us fast sort with low memory
overhead. Of course we'll have to discard that idea when we move to a
non-static index.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: Lucene sorting case-sensitive by default?

adb
In reply to this post by Erick Erickson
Erick Erickson wrote:

>             doc.add(
>                     new Field(
>                             "f",
>                             "This is Some Mixed, case Junk($*%& With Ugly
> SYmbols",
>                             Field.Store.YES,
>                             Field.Index.TOKENIZED));

<snip>

> prints "This is Some Mixed, case Junk($*%& With Ugly SYmbols"
> yet still finds the document with a search for "junk" using
> StandardAnalyzer.

Don't forget you can't sort on that as the field's tokenized, so although it's
stored in the original and indexed as lower case into multiple tokens, you will
get the RuntimeException from FieldCache.

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]