Dear list,

I try to use the Term Highlighter in my webapp but I have a problem. I want
to highlight the terms in a text without extracting the most relevant
The highlighting works but the last characters are trimmed !

Here is a portion of my code :

  Analyzer analyzer = new StandardAnalyzer();
  Query query = null;
  try {
   query = QueryParser.parse(queryStr, "scientificName", analyzer);
   query = query.rewrite("E:/specimenset-index"));
  } catch (ParseException e) {
   // TODO Auto-generated catch block
  } catch (IOException e) {
   // TODO Auto-generated catch block

  QueryScorer scorer = new QueryScorer(query);
  SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(
    "<span class=\"highlight\">", "</span>");

  Highlighter highlighter = new Highlighter(formatter, scorer);

  TokenStream tokenStream = analyzer.tokenStream("scientificName",
    new StringReader(text));

  String highlightedText = null;

  try {
   highlightedText = highlighter.getBestFragment(
    tokenStream, text);
  } catch (IOException e1) {
   // TODO Auto-generated catch block
  return highlightedText ;

A value for text variable is for instance :
    <a href='taxoninfo.html?id=112'><span class='genus-species'>Capparimyia
savastani</span> (Martelli)</a>

The corresponding value for highlightedText variable is :
    <a href='taxoninfo.html?id=112'><span class='genus-species'><span
class="highlight">Capparimyia</span> savastani</span> (Martelli

The ")</a>" are trimmed for some mysterious reason !! I try to play with
Encoder and Fragmenter classes but without success !

Any help would be appreciate.

Best regards,


Johan Duflost
Belgian Biodiversity Information Facility (BeBIF)
Universite Libre de Bruxelles

mark harwood
Hi Johan,
To avoid selecting fragments see here:

Be aware though that the highlighter is really
designed to decorate plain-text by adding highlight
tags - if your text already includes any HTML mark-up
it becomes hard to correctly add highlighter mark-up
into the text in a way which guarantees a legal
document. To work reliably the highlighter would need
to understand the structure of the existing tags eg to
understand that the "height" attribute in an image tag
should not be marked up if the user searched for
"height". You may also have to concern yourself with
handling badly marked-up content (ie most web pages)
where tags are not always closed. This level of
functionality is beyond the scope of the highlighter.
If you want to preserve existing mark-up someone made
reference to a custom solution for handling this here:


