Highlighting problems with HTML tagged fields

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Highlighting problems with HTML tagged fields

Andrew May
Hi,

I'm indexing some content that contains HTML markup, and this seems to throw off the
highlighting somehow.

Example title field:

<SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic
terrane of NW Iberia

If I search form title:fabrics and turn highlighting on, the highlighted version has the
<em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the
sum of the lengths of the tags).

Because I don't want the tags indexed I'm using a modified version of the "text" field
type that uses the HTMLStripWhitespaceTokenizerFactory instead of the normal
WhitespaceTokenizerFactory. I've tried using this tokenizer just when indexing, or both
when indexing and querying, but both do the same thing.

There's no problem if I use the normal WhitespaceTokenizerFactory, but then it's possible
to search the tags and find matches, which isn't ideal.

This is about the closest thing I can find on the Lucene mailing list related to this -
but this would kind of suggest that this ought to work?

http://www.gossamer-threads.com/lists/lucene/java-user/14981?search_string=HTML%20strip;#14981

Thanks,

Andrew
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting problems with HTML tagged fields

Yonik Seeley-2
On 7/28/06, Andrew May <[hidden email]> wrote:
> Because I don't want the tags indexed I'm using a modified version of the "text" field
> type that uses the HTMLStripWhitespaceTokenizerFactory instead of the normal
> WhitespaceTokenizerFactory.

HTMLStripWhitespaceTokenizerFactory works in two phases...
HTMLStripReader removes the HTML and passes the result to
WhitespaceTokenizer... at that point, Tokens are generated, but the
offsets will correspond to the text after HTML removal, not before.

I did it this way so that HTMLStripReader  could go before any
tokenizer (like StandardTokenizer).

Can you open a JIRA bug for this?  The fix would be a special version
of HTMLStripReader integrated with a WhitespaceTokenizer to keep
offsets correct.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting problems with HTML tagged fields

nick19701
Yonik Seeley wrote
HTMLStripWhitespaceTokenizerFactory works in two phases...
HTMLStripReader removes the HTML and passes the result to
WhitespaceTokenizer... at that point, Tokens are generated, but the
offsets will correspond to the text after HTML removal, not before.

I did it this way so that HTMLStripReader  could go before any
tokenizer (like StandardTokenizer).

Can you open a JIRA bug for this?  The fix would be a special version
of HTMLStripReader integrated with a WhitespaceTokenizer to keep
offsets correct.

-Yonik
Is there a fix for this problem?

my solr is dated on 12/17/2006. HTMLStripWhitespaceTokenizerFactory + highlighting still
doesn't work. All the wrong items are highlighted.
Reply | Threaded
Open this post in threaded view
|

Re: Highlighting problems with HTML tagged fields

Chris Hostetter-3

It is tracked in http://issues.apache.org/jira/browse/SOLR-42

...there are currently no patches.


: Date: Tue, 6 Mar 2007 15:04:25 -0800 (PST)
: From: nick19701 <[hidden email]>
: Reply-To: [hidden email]
: To: [hidden email]
: Subject: Re: [2] Highlighting problems with HTML tagged fields
:
:
:
: Yonik Seeley wrote:
: >
: > HTMLStripWhitespaceTokenizerFactory works in two phases...
: > HTMLStripReader removes the HTML and passes the result to
: > WhitespaceTokenizer... at that point, Tokens are generated, but the
: > offsets will correspond to the text after HTML removal, not before.
: >
: > I did it this way so that HTMLStripReader  could go before any
: > tokenizer (like StandardTokenizer).
: >
: > Can you open a JIRA bug for this?  The fix would be a special version
: > of HTMLStripReader integrated with a WhitespaceTokenizer to keep
: > offsets correct.
: >
: > -Yonik
: >
: >
: Is there a fix for this problem?
:
: my solr is dated on 12/17/2006. HTMLStripWhitespaceTokenizerFactory +
: highlighting still
: doesn't work. All the wrong items are highlighted.
: --
: View this message in context: http://www.nabble.com/Highlighting-problems-with-HTML-tagged-fields-tf2017260.html#a9343253
: Sent from the Solr - User mailing list archive at Nabble.com.
:



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Highlighting problems with HTML tagged fields

nick19701
Chris Hostetter wrote
It is tracked in http://issues.apache.org/jira/browse/SOLR-42

...there are currently no patches.
The suggested fix from Mirko seems very simple. Hopefull a patch will be applied
very soon. In the meantime, I'll use my backup solution: http://fucoder.com/code/se-hilite/

Reply | Threaded
Open this post in threaded view
|

Re: Highlighting problems with HTML tagged fields

Chris Hostetter-3

: The suggested fix from Mirko seems very simple. Hopefull a patch will be
: applied
: very soon. In the meantime, I'll use my backup solution:

patches for issues can't be applied until someone who cares about them
write them and contribute them for committers to consider/apply :)

-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Highlighting problems with HTML tagged fields

nick19701
Chris Hostetter wrote
patches for issues can't be applied until someone who cares about them
write them and contribute them for committers to consider/apply :)
it seems I'm one of the very few people who care about this feature :)

Unfortunately my daily languages are c++ and c#. I only know a little bit Java. Otherwise I'll contribute.