CapitilizationFilterFactory

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

CapitilizationFilterFactory

Grant Ingersoll-2
I have started on SOLR-330 and the first one to tackle is the  
CapitilizationFilterFactory (just starting at the top of the analysis  
package).

At any rate, there are some optimizations to be made here, but one  
thing in the file that is not explicitly stated is that the "keep"  
word list is case-insensitive.  This is the current, undocumented,  
behavior.  I am fine with documenting and making it so going forward.  
However, if, instead, we make it case-sensitive, we can then use a  
CharArraySet (from Lucene) to do quick look ups of the term buffer  
char array.  The reason this comes up is that Token.termText() is now  
deprecated and I am switching off to use the Token.termBuffer() char  
array.  This filter can then just operate directly on the char array,  
which should be a lot faster.

Any opinion on this?

-Grant
Reply | Threaded
Open this post in threaded view
|

Re: CapitilizationFilterFactory

Grant Ingersoll-2
Scratch that.  CharArraySet has an ignoreCase option that I missed.

-Grant

On Jan 31, 2008, at 12:42 PM, Grant Ingersoll wrote:

> I have started on SOLR-330 and the first one to tackle is the  
> CapitilizationFilterFactory (just starting at the top of the  
> analysis package).
>
> At any rate, there are some optimizations to be made here, but one  
> thing in the file that is not explicitly stated is that the "keep"  
> word list is case-insensitive.  This is the current, undocumented,  
> behavior.  I am fine with documenting and making it so going  
> forward.  However, if, instead, we make it case-sensitive, we can  
> then use a CharArraySet (from Lucene) to do quick look ups of the  
> term buffer char array.  The reason this comes up is that  
> Token.termText() is now deprecated and I am switching off to use the  
> Token.termBuffer() char array.  This filter can then just operate  
> directly on the char array, which should be a lot faster.
>
> Any opinion on this?
>
> -Grant


Reply | Threaded
Open this post in threaded view
|

Re: CapitilizationFilterFactory

Yonik Seeley-2
In reply to this post by Grant Ingersoll-2
On Jan 31, 2008 12:42 PM, Grant Ingersoll <[hidden email]> wrote:

> I have started on SOLR-330 and the first one to tackle is the
> CapitilizationFilterFactory (just starting at the top of the analysis
> package).
>
> At any rate, there are some optimizations to be made here, but one
> thing in the file that is not explicitly stated is that the "keep"
> word list is case-insensitive.  This is the current, undocumented,
> behavior.  I am fine with documenting and making it so going forward.
> However, if, instead, we make it case-sensitive, we can then use a
> CharArraySet (from Lucene) to do quick look ups of the term buffer
> char array.

CharArraySet can be either case sensitive or case insensitive, but you
must specify when creating it.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: CapitilizationFilterFactory

Grant Ingersoll-2
In reply to this post by Grant Ingersoll-2
I would, add, though, that the semantics of keep in this are a bit  
confusing.  The current functionality does not "keep" the original  
term, it "keeps" whatever the value for the keep map is regardless of  
case.  The test case spells this out with the test:

keep set = and

Test token: AnD

Assertion is that the result is And (forceFirstLetter = true)

This is not consistent with my notion of keep.  I would think keep  
should preserve the original token exactly as it came in and the keep  
list should be case sensitive.  Consider the word PhD.  This is a  
prime token, IMO, for a word that belongs in the keep list, yet, the  
current functionality would return Phd.

Should I open a bug for this, such that it can be tracked separately?

-Grant

On Jan 31, 2008, at 12:42 PM, Grant Ingersoll wrote:

> I have started on SOLR-330 and the first one to tackle is the  
> CapitilizationFilterFactory (just starting at the top of the  
> analysis package).
>
> At any rate, there are some optimizations to be made here, but one  
> thing in the file that is not explicitly stated is that the "keep"  
> word list is case-insensitive.  This is the current, undocumented,  
> behavior.  I am fine with documenting and making it so going  
> forward.  However, if, instead, we make it case-sensitive, we can  
> then use a CharArraySet (from Lucene) to do quick look ups of the  
> term buffer char array.  The reason this comes up is that  
> Token.termText() is now deprecated and I am switching off to use the  
> Token.termBuffer() char array.  This filter can then just operate  
> directly on the char array, which should be a lot faster.
>
> Any opinion on this?
>
> -Grant


Reply | Threaded
Open this post in threaded view
|

Re: CapitilizationFilterFactory

Ryan McKinley

>
> keep set = and
>
> Test token: AnD
>
> Assertion is that the result is And (forceFirstLetter = true)
>
> This is not consistent with my notion of keep.  I would think keep
> should preserve the original token exactly as it came in and the keep
> list should be case sensitive.  Consider the word PhD.  This is a prime
> token, IMO, for a word that belongs in the keep list, yet, the current
> functionality would return Phd.
>
> Should I open a bug for this, such that it can be tracked separately?
>

yes, i think you should open a JIRA issue.  CapitilizationFilterFactory
is new to 1.3, so we should make sure it has semantics we think are
consistent before too long

ryan


Reply | Threaded
Open this post in threaded view
|

Re: CapitilizationFilterFactory

jjlarrea
In reply to this post by Grant Ingersoll-2
Beware... I just looked at CharArraySet at -rHEAD and it *modifies the input token* if ignoreCase is set:

  /** Add this char[] directly to the set.
   * If ignoreCase is true for this Set, the text array will be directly modified.
   * The user should never modify this text array after calling this method.
   */
  public boolean add(char[] text) {
    if (ignoreCase)
      for(int i=0;i<text.length;i++)
        text[i] = Character.toLowerCase(text[i]);
    int slot = getSlot(text, 0, text.length);

I'm not sure whether that affects your use for SOLR-468.
 
I wonder whether this design tradeoff was worth it; getHashCode(...) already can nondestructively lowercase while computing the hashcode, so if the line in  both equals(...) methods:
        if (Character.toLowerCase(text1[off+i]) != text2[i])
were modified to lowercase text2 then the destructive one-time lowercasing could be avoided. Sadly there's no Character.equalsIgnoreCase to avoid the second method call.

- J.J.

At 12:45 PM -0500 1/31/08, Grant Ingersoll wrote:

>Scratch that.  CharArraySet has an ignoreCase option that I missed.
>
>-Grant
>
>On Jan 31, 2008, at 12:42 PM, Grant Ingersoll wrote:
>
>>I have started on SOLR-330 and the first one to tackle is the CapitilizationFilterFactory (just starting at the top of the analysis package).
>>
>>At any rate, there are some optimizations to be made here, but one thing in the file that is not explicitly stated is that the "keep" word list is case-insensitive.  This is the current, undocumented, behavior.  I am fine with documenting and making it so going forward.  However, if, instead, we make it case-sensitive, we can then use a CharArraySet (from Lucene) to do quick look ups of the term buffer char array.  The reason this comes up is that Token.termText() is now deprecated and I am switching off to use the Token.termBuffer() char array.  This filter can then just operate directly on the char array, which should be a lot faster.
>>
>>Any opinion on this?
>>
>>-Grant

Reply | Threaded
Open this post in threaded view
|

Re: CapitilizationFilterFactory

Yonik Seeley-2
In reply to this post by Grant Ingersoll-2
On Jan 31, 2008 2:03 PM, J.J. Larrea <[hidden email]> wrote:
> Beware... I just looked at CharArraySet at -rHEAD and it *modifies the input token* if ignoreCase is set:

This is only when *building* the set (which is just done once when
initializing the schema).  That code should just use add(String)
anyway...

-Yonik