Letter-number transitions - can this be turned off

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Letter-number transitions - can this be turned off

F Knudson
Is there a flag to disable the letter-number transition in the solr.WordDelimiterFilterFactory?  We are indexing category codes, thesaurus codes for which this letter number transition makes no sense.  It is bloating the indexing (which is already large).

Thanks
F Knudson
Reply | Threaded
Open this post in threaded view
|

Re: Letter-number transitions - can this be turned off

Mike Klaas
On 30-Sep-07, at 12:47 PM, F Knudson wrote:

>
> Is there a flag to disable the letter-number transition in the
> solr.WordDelimiterFilterFactory?  We are indexing category codes,  
> thesaurus
> codes for which this letter number transition makes no sense.  It is
> bloating the indexing (which is already large).

Have you considered using a different analyzer?

If you want to continue using WDF, you could make a quick change  
around since 320:

             if (splitOnCaseChange == 0 &&
                 (lastType & ALPHA) != 0 && (type & ALPHA) != 0) {
               // ALPHA->ALPHA: always ignore if case isn't considered.

             } else if ((lastType & UPPER)!=0 && (type & LOWER)!=0) {
               // UPPER->LOWER: Don't split
             } else {

            ...

by adding a clause that catches ALPHA -> NUMERIC (and vice versa) and  
ignores it.

Another approach that I am using locally is to maintain the  
transitions, but force tokens to be a minimum size (so r2d2 doesn't  
tokenize to four tokens but arrr2222deee2222 does).

There is a patch here: http://issues.apache.org/jira/browse/SOLR-293

If you vote for it, I promise to get it in for 1.3 <g>

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Letter-number transitions - can this be turned off

F Knudson
Thanks for your helpful suggestions.

I have considered other analyzers but WDF has great strengths.  I will experiment with maintaining transitions and then consider modifying the code.

F. Knudson

Mike Klaas wrote
On 30-Sep-07, at 12:47 PM, F Knudson wrote:

>
> Is there a flag to disable the letter-number transition in the
> solr.WordDelimiterFilterFactory?  We are indexing category codes,  
> thesaurus
> codes for which this letter number transition makes no sense.  It is
> bloating the indexing (which is already large).

Have you considered using a different analyzer?

If you want to continue using WDF, you could make a quick change  
around since 320:

             if (splitOnCaseChange == 0 &&
                 (lastType & ALPHA) != 0 && (type & ALPHA) != 0) {
               // ALPHA->ALPHA: always ignore if case isn't considered.

             } else if ((lastType & UPPER)!=0 && (type & LOWER)!=0) {
               // UPPER->LOWER: Don't split
             } else {

            ...

by adding a clause that catches ALPHA -> NUMERIC (and vice versa) and  
ignores it.

Another approach that I am using locally is to maintain the  
transitions, but force tokens to be a minimum size (so r2d2 doesn't  
tokenize to four tokens but arrr2222deee2222 does).

There is a patch here: http://issues.apache.org/jira/browse/SOLR-293

If you vote for it, I promise to get it in for 1.3 <g>

-Mike