WordDelimiterFilterFactory and StandardTokenizer

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

WordDelimiterFilterFactory and StandardTokenizer

Bob Laferriere

 
I am seeing odd behavior from WordDelimiterFilterFactory  (WDFF) when used in conjunction with StandardTokenizerFactory (STF).
 
If I use the following configuration:
 
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
                          <analyzer type="index">
                                          <tokenizer class="solr.StandardTokenizerFactory"/>
                            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                                          <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
                                          <filter class="solr.SynonymFilterFactory" synonyms="synonyms_index.txt" ignoreCase="true" expand="true"/>
                                          <filter class="solr.LowerCaseFilterFactory"/>
                                          <filter class="solr.EnglishPossessiveFilterFactory"/>
                                          <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
                                          <filter class="solr.PorterStemFilterFactory"/>
                          </analyzer>
                          <analyzer type="query">
                                          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                                          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="1" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                                          <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
                                          <filter class="solr.SynonymFilterFactory" synonyms="synonyms_query.txt" ignoreCase="true" expand="true"/>
                                          <filter class="solr.LowerCaseFilterFactory"/>
                                          <filter class="solr.EnglishPossessiveFilterFactory"/>
                                          <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
                                          <filter class="solr.PorterStemFilterFactory"/>
                          </analyzer>
 
 
I see the following results for the document of “wi-fi”:
 
Index: “wi”, “fi”
Query: “wi”,”fi”,”wifi”
 
The documentation seems to indicate that I should see the same results in either case as the WDFF is handling the generation of word parts. But the concatenate of words does not seem to work with a StandardTokenizer? If I flip to use the WhiteSpaceTokenizerFactory on the index handler, I get the following:
 
Index: “wi”,”fi”,”wifi”
 
I checked all documentation and did not find any indication that there is a conflict between using the WDFF and STF vs WDFF and WhitespaceTokenizer. I assume it is because STF is tokenizing off the hyphen first before passing to the filter chain?
 
_______________________________________
Robert J. Laferriere 
Director of Software Technology, Corporate Information Services
Chief Software Architect

Direct Supply . 6767 N Industrial Rd  Milwaukee, WI  53223
office 414-760-5833 . mobile 414-721-1092 . fax 877-282-5285
[hidden email] .  <a href="blocked::blocked::http://www.directsupply.net" style="color: rgb(149, 79, 114); text-decoration: underline;">www.directsupply.com

Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterFilterFactory and StandardTokenizer

Shawn Heisey-4
On 4/16/2014 8:37 PM, Bob Laferriere wrote:
>> I am seeing odd behavior from WordDelimiterFilterFactory  (WDFF) when
>> used in conjunction with StandardTokenizerFactory (STF).

<snip>

>> I see the following results for the document of “wi-fi”:
>>  
>> Index: “wi”, “fi”
>> Query: “wi”,”fi”,”wifi”
>>  
>> The documentation seems to indicate that I should see the same results
>> in either case as the WDFF is handling the generation of word parts.
>> But the concatenate of words does not seem to work with a
>> StandardTokenizer?

The standard tokenizer breaks things up by punctuation, so when it hits
WDFF, there's nothing for it to do.  The following page links to a
Unicode document that explains how it all works:

http://lucene.apache.org/core/4_7_0/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html

If you use the Analysis page in the Solr admin UI, you can see how the
analysis works at each step.

https://cwiki.apache.org/confluence/display/solr/Analysis+Screen

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterFilterFactory and StandardTokenizer

Jack Krupansky-2
Typically the white space tokenizer is the best choice when the word
delimiter filter will be used.

-- Jack Krupansky

-----Original Message-----
From: Shawn Heisey
Sent: Wednesday, April 16, 2014 11:03 PM
To: [hidden email]
Subject: Re: WordDelimiterFilterFactory and StandardTokenizer

On 4/16/2014 8:37 PM, Bob Laferriere wrote:
>> I am seeing odd behavior from WordDelimiterFilterFactory  (WDFF) when
>> used in conjunction with StandardTokenizerFactory (STF).

<snip>

>> I see the following results for the document of “wi-fi”:
>>
>> Index: “wi”, “fi”
>> Query: “wi”,”fi”,”wifi”
>>
>> The documentation seems to indicate that I should see the same results
>> in either case as the WDFF is handling the generation of word parts.
>> But the concatenate of words does not seem to work with a
>> StandardTokenizer?

The standard tokenizer breaks things up by punctuation, so when it hits
WDFF, there's nothing for it to do.  The following page links to a
Unicode document that explains how it all works:

http://lucene.apache.org/core/4_7_0/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html

If you use the Analysis page in the Solr admin UI, you can see how the
analysis works at each step.

https://cwiki.apache.org/confluence/display/solr/Analysis+Screen

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterFilterFactory and StandardTokenizer

aiguofer
Jack Krupansky-2 wrote
Typically the white space tokenizer is the best choice when the word
delimiter filter will be used.

-- Jack Krupansky
If we wanted to keep the StandardTokenizer (because we make use of the token types) but wanted to use the WDFF to get combinations of words that are split with certain characters (mainly - and /, but possibly others as well), what is the suggested way of accomplishing this? Would we just have to extend the JFlex file for the tokenizer and re-compile it?
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterFilterFactory and StandardTokenizer

iorixxx
Hi Aiguofer,

You mean ClassicTokenizer? Because StandardTokenizer does not set token types (e-mail, url, etc).


I wouldn't go with the JFlex edit, mainly because maintenance costs. It will be a burden to maintain a custom tokenizer.

MappingCharFilters could be used to manipulate tokenizer behavior.

Just an example, if you don't want your tokenizer to break on hyphens, replace it with something that your tokenizer does not break. For example under score.

"-" => "_"



Plus WDF can be customized too. Please see types attribute :

http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt

 
Ahmet


On Friday, May 16, 2014 6:24 PM, aiguofer <[hidden email]> wrote:
Jack Krupansky-2 wrote

> Typically the white space tokenizer is the best choice when the word
> delimiter filter will be used.
>
> -- Jack Krupansky

If we wanted to keep the StandardTokenizer (because we make use of the token
types) but wanted to use the WDFF to get combinations of words that are
split with certain characters (mainly - and /, but possibly others as well),
what is the suggested way of accomplishing this? Would we just have to
extend the JFlex file for the tokenizer and re-compile it?



--
View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterFilterFactory and StandardTokenizer

Shawn Heisey-4
In reply to this post by aiguofer
On 5/16/2014 9:24 AM, aiguofer wrote:

> Jack Krupansky-2 wrote
>> Typically the white space tokenizer is the best choice when the word
>> delimiter filter will be used.
>>
>> -- Jack Krupansky
>
> If we wanted to keep the StandardTokenizer (because we make use of the token
> types) but wanted to use the WDFF to get combinations of words that are
> split with certain characters (mainly - and /, but possibly others as well),
> what is the suggested way of accomplishing this? Would we just have to
> extend the JFlex file for the tokenizer and re-compile it?

You can use the ICUTokenizer instead, and pass it a special rulefile
that makes it only break Latin characters on whitespace instead of all
the usual places.  This is exactly what I do in my index.

In the Solr source code, you can find this special rulefile at the
following path:

lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/Latin-break-only-on-whitespace.rbbi

You would place the rule file in the same location as schema.xml, and
then use this in your fieldType:

<tokenizer class="solr.ICUTokenizerFactory"
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>

Note that the ICUTokenizer requires that you add contrib jars to your
Solr install -- the required jars and a README outlining which files you
need are included in the Solr download in solr/contrib/analysis-extras.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterFilterFactory and StandardTokenizer

aiguofer
In reply to this post by iorixxx
Great, thanks for the information!  Right now we're using the StandardTokenizer types to filter out CJK characters with a custom filter.  I'll test using MappingCharFilters, although I'm a little concerned with possible adverse scenarios.  

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


----- Original Message -----

> Hi Aiguofer,
>
> You mean ClassicTokenizer? Because StandardTokenizer does not set token types
> (e-mail, url, etc).
>
>
> I wouldn't go with the JFlex edit, mainly because maintenance costs. It will
> be a burden to maintain a custom tokenizer.
>
> MappingCharFilters could be used to manipulate tokenizer behavior.
>
> Just an example, if you don't want your tokenizer to break on hyphens,
> replace it with something that your tokenizer does not break. For example
> under score.
>
> "-" => "_"
>
>
>
> Plus WDF can be customized too. Please see types attribute :
>
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt
>
>  
> Ahmet
>
>
> On Friday, May 16, 2014 6:24 PM, aiguofer <[hidden email]> wrote:
> Jack Krupansky-2 wrote
>
> > Typically the white space tokenizer is the best choice when the word
> > delimiter filter will be used.
> >
> > -- Jack Krupansky
>
> If we wanted to keep the StandardTokenizer (because we make use of the token
> types) but wanted to use the WDFF to get combinations of words that are
> split with certain characters (mainly - and /, but possibly others as well),
> what is the suggested way of accomplishing this? Would we just have to
> extend the JFlex file for the tokenizer and re-compile it?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterFilterFactory and StandardTokenizer

iorixxx
Hi Diego,

Did you miss Shawn's response? His ICUTokenizerFactory solution is better than mine. 

By the way, what solr version are you using? Does StandardTokenizer set type attribute for CJK words?

To filter out given types, you not need a custom filter. Type Token filter serves exactly that purpose.
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-TypeTokenFilter



On Tuesday, May 20, 2014 5:50 PM, Diego Fernandez <[hidden email]> wrote:
Great, thanks for the information!  Right now we're using the StandardTokenizer types to filter out CJK characters with a custom filter.  I'll test using MappingCharFilters, although I'm a little concerned with possible adverse scenarios. 

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics



----- Original Message -----

> Hi Aiguofer,
>
> You mean ClassicTokenizer? Because StandardTokenizer does not set token types
> (e-mail, url, etc).
>
>
> I wouldn't go with the JFlex edit, mainly because maintenance costs. It will
> be a burden to maintain a custom tokenizer.
>
> MappingCharFilters could be used to manipulate tokenizer behavior.
>
> Just an example, if you don't want your tokenizer to break on hyphens,
> replace it with something that your tokenizer does not break. For example
> under score.
>
> "-" => "_"
>
>
>
> Plus WDF can be customized too. Please see types attribute :
>
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt
>
>  
> Ahmet
>
>
> On Friday, May 16, 2014 6:24 PM, aiguofer <[hidden email]> wrote:
> Jack Krupansky-2 wrote
>
> > Typically the white space tokenizer is the best choice when the word
> > delimiter filter will be used.
> >
> > -- Jack Krupansky
>
> If we wanted to keep the StandardTokenizer (because we make use of the token
> types) but wanted to use the WDFF to get combinations of words that are
> split with certain characters (mainly - and /, but possibly others as well),
> what is the suggested way of accomplishing this? Would we just have to
> extend the JFlex file for the tokenizer and re-compile it?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiterFilterFactory and StandardTokenizer

aiguofer
Hey Ahmet,

Yeah I had missed Shawn's response, I'll have to give that a try as well.  As for the version, we're using 4.4.  StandardTokenizer sets type for HANGUL, HIRAGANA, IDEOGRAPHIC, KATAKANA, and SOUTHEAST_ASIAN and you're right, we're using TypeTokenFilter to remove those.

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


----- Original Message -----

> Hi Diego,
>
> Did you miss Shawn's response? His ICUTokenizerFactory solution is better
> than mine.
>
> By the way, what solr version are you using? Does StandardTokenizer set type
> attribute for CJK words?
>
> To filter out given types, you not need a custom filter. Type Token filter
> serves exactly that purpose.
> https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-TypeTokenFilter
>
>
>
> On Tuesday, May 20, 2014 5:50 PM, Diego Fernandez <[hidden email]>
> wrote:
> Great, thanks for the information!  Right now we're using the
> StandardTokenizer types to filter out CJK characters with a custom filter.
>   I'll test using MappingCharFilters, although I'm a little concerned with
> possible adverse scenarios.
>
> Diego Fernandez - 爱国
> Software Engineer
> US GSS Supportability - Diagnostics
>
>
>
> ----- Original Message -----
> > Hi Aiguofer,
> >
> > You mean ClassicTokenizer? Because StandardTokenizer does not set token
> > types
> > (e-mail, url, etc).
> >
> >
> > I wouldn't go with the JFlex edit, mainly because maintenance costs. It
> > will
> > be a burden to maintain a custom tokenizer.
> >
> > MappingCharFilters could be used to manipulate tokenizer behavior.
> >
> > Just an example, if you don't want your tokenizer to break on hyphens,
> > replace it with something that your tokenizer does not break. For example
> > under score.
> >
> > "-" => "_"
> >
> >
> >
> > Plus WDF can be customized too. Please see types attribute :
> >
> > http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt
> >
> >  
> > Ahmet
> >
> >
> > On Friday, May 16, 2014 6:24 PM, aiguofer <[hidden email]> wrote:
> > Jack Krupansky-2 wrote
> >
> > > Typically the white space tokenizer is the best choice when the word
> > > delimiter filter will be used.
> > >
> > > -- Jack Krupansky
> >
> > If we wanted to keep the StandardTokenizer (because we make use of the
> > token
> > types) but wanted to use the WDFF to get combinations of words that are
> > split with certain characters (mainly - and /, but possibly others as
> > well),
> > what is the suggested way of accomplishing this? Would we just have to
> > extend the JFlex file for the tokenizer and re-compile it?
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
>