Tokenizing and searching named character entity references

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Tokenizing and searching named character entity references

F Knudson
Greetings:

I am working with many different data sources - some source employ "entity references" ; others do not.  My goal is to make the searching across sources as consistent as possible.

Example text -

Source1:   weakening Hδ absorption
Source1:   zero-field gap ω

Source2:  weakening H delta absorption
Source2:  zero-field gap omega

Using the tokenizer solr.HTMLStripWhitespaceTokenizerFactory for Source1 - the entity is replaced with the "named character entity" -

This works great.  

But I want the searching tokens to be identical for each source.  I need to capture δ  as a token.


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
       <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> 
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
       <filter class="solr.ISOLatin1AccentFilterFactory"/>
       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateA
ll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
</fieldType>
 
Is this possible with the SOLR supplied tokenizers?  I experimented with different combinations and orders and was not successful.

Is this possible using synonyms?  I also experimented with this route but again was not successful.

Do I need to create a custom tokenizer?

Thanks
Frances
Reply | Threaded
Open this post in threaded view
|

RE: Tokenizing and searching named character entity references

steve_rowe
Hi Frances,

HTMLStripWhitespaceTokenizerFactory wraps a WhitespaceTokenizer around an HTMLStripReader.

You could extend HTMLStripReader to not decode named character entities, e.g. by overriding HTMLStripReader.read() so that it calls an alternative readEntity(), which instead of converting entity references to characters would just leave the entity references as-is, something like:

public class MyHTMLStripReader extends HTMLStripReader {

  ///// override read() to call myReadEntity(), but no other changes
  public int read() throws IOException {
    ...
    switch (ch) {
      case '&':
        saveState();
        ch = myReadEntity(); ///// Change this line to call new method
        if (ch>=0) return ch;
        if (ch==MISMATCH) {
          restoreState();
          return '&';
        }
        break;
      ...
    }
  }

  private int myReadEntity() throws IOException {
    int ch = next();
    if (ch=='#') return readNumericEntity();
    return MISMATCH;  ///// Always a mismatch, except for numeric entities
  }
}

Then you could create a new Factory, something like:

public class MyHTMLStripWhitespaceTokenizerFactory extends BaseTokenizerFactory {
  public TokenStream create(Reader input) {
    return new WhitespaceTokenizer(new MyHTMLStripReader(input));
  }
}

Steve

On 07/24/2008 at 9:53 AM, F Knudson wrote:

>
> Greetings:
>
> I am working with many different data sources - some source
> employ "entity references" ; others do not.  My goal is to
> make the searching across sources as consistent as possible.
>
> Example text -
>
> Source1:   weakening H&delta; absorption
> Source1:   zero-field gap &omega;
>
> Source2:  weakening H delta absorption
> Source2:  zero-field gap omega
>
> Using the tokenizer solr.HTMLStripWhitespaceTokenizerFactory
> for Source1 - the entity is replaced with the "named character
> entity" - This works great.
>
> But I want the searching tokens to be identical for each
> source.  I need to capture &delta;  as a token.
>
> <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateA ll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
> </fieldType>
>
> Is this possible with the SOLR supplied tokenizers?  I
> experimented with different combinations and orders and was
> not successful.
>
> Is this possible using synonyms?  I also experimented with
> this route but again was not successful.
>
> Do I need to create a custom tokenizer?
>
> Thanks
> Frances
Reply | Threaded
Open this post in threaded view
|

RE: Tokenizing and searching named character entity references

hossman

: You could extend HTMLStripReader to not decode named character entities,
: e.g. by overriding HTMLStripReader.read() so that it calls an
: alternative readEntity(), which instead of converting entity references
: to characters would just leave the entity references as-is, something
: like:

Alternately: use SynonymFilterFactory to map any entity "names" to the
real Unicode character so your "Source2" style docs get "omega" replaced
with the same character the HTMLStrip*TokenizerFactories generate when
they encounter the HTML entities.

generating the list of synonyms from the comment at the end of
HTMLSripReader.java should be easy.


: > Source1:   weakening H&delta; absorption
: > Source1:   zero-field gap &omega;
: >
: > Source2:  weakening H delta absorption
: > Source2:  zero-field gap omega



-Hoss