[jira] Created: (SOLR-57) Highlighter does not work with HTML content that's passed through HTMLStrip*Tokenizer

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (SOLR-57) Highlighter does not work with HTML content that's passed through HTMLStrip*Tokenizer

Sebastian Nagel (Jira)
Highlighter does not work with HTML content that's passed through HTMLStrip*Tokenizer
-------------------------------------------------------------------------------------

                 Key: SOLR-57
                 URL: http://issues.apache.org/jira/browse/SOLR-57
             Project: Solr
          Issue Type: Bug
          Components: search
         Environment: Red Hat Linux 9, Tomcat 5.5.20
            Reporter: Ho Yin Au
            Priority: Minor


I have a fieldtype with the following definition:
        <fieldtype name="htmltext"  class="solr.TextField" positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
                <filter class="solr.StandardFilterFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.StopFilterFactory" />
                <filter class="solr.EnglishPorterFilterFactory" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
                <filter class="solr.ISOLatin1AccentFilterFactory" />
            </analyzer>
        </fieldtype>

When fields with that definition are included in the list of fields to be highlighted, the highlighted term is always offset because it does not take into account the HTML tags before it, so you end up with something like this for the highlighted snipplet:

Does your comptuer meet the <a href="http:/<em>/www.example</em>.com/system_requirements.shtml">minimum system requirements</a>?

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (SOLR-57) Highlighter does not work with HTML content that's passed through HTMLStrip*Tokenizer

Sebastian Nagel (Jira)
     [ http://issues.apache.org/jira/browse/SOLR-57?page=all ]

Yonik Seeley resolved SOLR-57.
------------------------------

    Resolution: Duplicate

known issue.
It probably wouldn't be too hard to fix for Whitespace*, but could be pretty difficult for Standard*

> Highlighter does not work with HTML content that's passed through HTMLStrip*Tokenizer
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-57
>                 URL: http://issues.apache.org/jira/browse/SOLR-57
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>         Environment: Red Hat Linux 9, Tomcat 5.5.20
>            Reporter: Ho Yin Au
>            Priority: Minor
>
> I have a fieldtype with the following definition:
>         <fieldtype name="htmltext"  class="solr.TextField" positionIncrementGap="100">
>             <analyzer>
>                 <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
>                 <filter class="solr.StandardFilterFactory" />
>                 <filter class="solr.LowerCaseFilterFactory" />
>                 <filter class="solr.StopFilterFactory" />
>                 <filter class="solr.EnglishPorterFilterFactory" />
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>                 <filter class="solr.ISOLatin1AccentFilterFactory" />
>             </analyzer>
>         </fieldtype>
> When fields with that definition are included in the list of fields to be highlighted, the highlighted term is always offset because it does not take into account the HTML tags before it, so you end up with something like this for the highlighted snipplet:
> Does your comptuer meet the <a href="http:/<em>/www.example</em>.com/system_requirements.shtml">minimum system requirements</a>?

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira