hithighlighter bug

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

hithighlighter bug

Jason Eacott-2
Hi all,
        I have come across what I think is a curious but insidious bug with the
java lucene hit highlighter. I updated to the latest version of lucene
and the highlighter because I first found this problem in the lucene
v1.4 version, unfortunately its still there in v2.0.0 versions.

I am indexing XML documents and am also using the hit highlighter for
search results. This works perfectly in almost every case except for one.

in my I have this:

public class LuceneSearch implements
org.apache.lucene.search.highlight.Formatter
{
...
        public String highlightTerm(String originalText , TokenGroup group)
        {
                if(group.getTotalScore()<=0)
                {
                        return originalText;
                }
                return "<em>" + originalText + "</em>";
        }

when I search for -> Acquisition Plan <-
in my search results I get:
<summary>(ancilliary stuff deleted)....
attached to the <em>Acquisition</em>
< em>Plan</em>and signed</summary>

notice the space between the < and e in the second < em>
This only occurs for these search terms and for this document (as far as
I know) but because its part of a much larger XML document it breaks the
whole thing.

the original XML is unremarkable with no strange characters surrounding
these terms - a snipit from the relevant paragraph from which these
highlighted terms come:

-> attached to the Acquisition Plan and signed off<-

has anyone seen anything like this before? is this a genuine new bug or
something of which the lucene folk (or at least whoever wrote the
highlighter) are aware? can anyone think of a way to fix this without
scanning every element in my result text for rogue spaces?

Thanks in advance
Jason.






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: hithighlighter bug

steve_rowe
Jason wrote:
> Hi all,
>     I have come across what I think is a curious but insidious bug with
> the java lucene hit highlighter.
[...]
> when I search for -> Acquisition Plan <-
> in my search results I get:
> <summary>(ancilliary stuff deleted)....
> attached to the <em>Acquisition</em>
> < em>Plan</em>and signed</summary>
>
> notice the space between the < and e in the second < em>

Sorry, Jason, I don't have a solutions for you, but in case there's any
question about whether "< em>" is well-formed XML/XHTML/HTML:

1. It is not well-formed XML (and thus cannot be well-formed XHTML) -
from <http://www.w3.org/TR/xml/#sec-starttags>:

  [40] STag ::= '<' Name (S Attribute)* S? '>'
   [5] Name ::= (Letter | '_' | ':') (NameChar)*

("Letter" & "NameChar" declarations omitted - suffice to say whitespace
is excluded.)


2. AFAICT (IANASG), SGML (and hence the [pre-XHTML] HTML profiles of it)
disallows space chars between the '<' and the element name (a.k.a.
"generic identifier") - from
<http://www.oasis-open.org/cover/sgmlsyn/sgmlsyn.htm#C7.4>:

 [14] start-tag =
        ( stago , <
          document type specification [28] ,
          generic identifier specification [29] ,
          attribute specification list [31] ,
          s [5] *,
          tagc ) | >
          minimized start-tag [15]
 [29] generic identifier specification =
          generic identifier [30] | rank stem [120]
 [30] generic identifier = name [55]
[120] rank stem = name [55]
 [55] name = name start character [53] , name character [52] *

(Note 1: "name" & "name start character" declarations omitted - suffice
to say whitespace is excluded.)

(Note 2: "document type specification" declaration omitted, because all
HTML profiles include the "CONCUR NO" option, thus excluding this syntax.)

(Note 3: "minimized start-tag" declaration omitted, because although all
HTML profiles include the "SHORTTAG YES" option, the
element-minimization aspects of this option [as distinct from attribute
minimization, e.g. omitted and unquoted attribute values] are not
supported by mainstream browsers; in any case, whitespace is disallowed
prior to generic identifiers in all of the minimized start tag forms.)


3. Firefox 2.0.0.1 and IE 7.0 on WinXP both render "< em>...</em>" as
literal "< em>..." - the (malformed) start tag is rendered as non-markup
plain text, and the close tag is not displayed.


Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]