Question on Tokenizing email address

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Question on Tokenizing email address

Abhishek Srivastava
Hello Everyone,

I have a field in my solr schema which stores emails. The way I want the
emails to be tokenized is like this.
if the email address is [hidden email]
User should be able to search on

1. [hidden email]  (whole address)
2. abc
3. def
4. alpha-xyz

Which tokenizer should I use?

Also, is there a feature like "Must Match" in solr? in my schema there is
field called "from" which contains the email address of the person who sent
an email. For this field, I don't want any tokenization. When the user
issues a search. The users email ID must exactly match the "for" column
value for that document/record to be returned.
How can I do this?

Regards,
Abhishek
Reply | Threaded
Open this post in threaded view
|

Re: Question on Tokenizing email address

Jan Høydahl / Cominvent
Hi,

To match 1, 2, 3, 4 below you could use a fieldtype based on TextField, with just a simple WordDelimiterFactory. However, this would also match abc-def, def.alpha, xyz-com and abc@def, because all punctuation is treated the same. To avoid this, you could do some custom handling of "-", "." and "@":

    <!-- An unstemmed text field optimized for emails -->
    <fieldType name="email" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="\." replacement=" DOT " replace="all" />
        <filter class="solr.PatternReplaceFilterFactory" pattern="@" replacement=" AT " replace="all" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
      </analyzer>
    </fieldType>

You will see that this splits "[hidden email]" into "foo DOT bar AT apache DOT org" on both index and query side, and thus avoids false matches as above.

To support the "must match" case, you could use the "lowercase" fieldtype, which will give a case insensitive match for the whole content of the field only.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 9. feb. 2010, at 18.13, Abhishek Srivastava wrote:

> Hello Everyone,
>
> I have a field in my solr schema which stores emails. The way I want the
> emails to be tokenized is like this.
> if the email address is [hidden email]
> User should be able to search on
>
> 1. [hidden email]  (whole address)
> 2. abc
> 3. def
> 4. alpha-xyz
>
> Which tokenizer should I use?
>
> Also, is there a feature like "Must Match" in solr? in my schema there is
> field called "from" which contains the email address of the person who sent
> an email. For this field, I don't want any tokenization. When the user
> issues a search. The users email ID must exactly match the "for" column
> value for that document/record to be returned.
> How can I do this?
>
> Regards,
> Abhishek

Reply | Threaded
Open this post in threaded view
|

Re: Question on Tokenizing email address

Abhishek Srivastava
In reply to this post by Abhishek Srivastava
Thank you! it works very well.

I think that the field type suggested by you will index words like DOT, AT, com also

In order to prevent these words from getting indexed, I have changed the field type to

<fieldType name="email" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>                       
        <filter class="solr.PatternReplaceFilterFactory" pattern="\." replacement=" DOT " replace="all" />
        <filter class="solr.PatternReplaceFilterFactory" pattern="@" replacement=" AT " replace="all" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />               
  </analyzer>
</fieldType>

I have added the words dot, com to the stoplist file (at was already there).

Is this correct?
Reply | Threaded
Open this post in threaded view
|

Re: Question on Tokenizing email address

Jan Høydahl / Cominvent
My point is that I WANT the AT, DOT to be indexed, to avoid these being treated the same: [hidden email] and foo-bar.brown.fox
By using the LowerCaseFilterFactory before the replacements, you actually ensure that a search for email:at will not give a match because the query will be lower-cased and not match the indexed term "AT". For this reason I would not add the special tokens to stopword lists either, as you DO want them in the index.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 10. feb. 2010, at 08.34, abhishes wrote:

>
> Thank you! it works very well.
>
> I think that the field type suggested by you will index words like DOT, AT,
> com also
>
> In order to prevent these words from getting indexed, I have changed the
> field type to
>
> <fieldType name="email" class="solr.TextField" positionIncrementGap="100">
>  <analyzer>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.PatternReplaceFilterFactory" pattern="\." replacement="
> DOT " replace="all" />
> <filter class="solr.PatternReplaceFilterFactory" pattern="@" replacement="
> AT " replace="all" />
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>  </analyzer>
> </fieldType>
>
> I have added the words dot, com to the stoplist file (at was already there).
>
> Is this correct?
>
> --
> View this message in context: http://old.nabble.com/Question-on-Tokenizing-email-address-tp27518673p27527033.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>