[jira] Created: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

JIRA jira@apache.org
Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field
-----------------------------------------------------------------------------------------

         Key: NUTCH-257
         URL: http://issues.apache.org/jira/browse/NUTCH-257
     Project: Nutch
        Type: Bug

  Components: searcher  
    Versions: 0.8-dev    
    Reporter: [hidden email]
    Priority: Minor


All search result data we display in search results has to be explicitly Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries.  Its already Entity.encoded.  This is fine when outputing HTML but it gets in the way when outputing otherwise -- as xml for example.  I'd suggest we not make any presumption about how search results are used.

The problem becomes especially acute when the text language is other than english.

Here is an example of what a Czech description field in an OpenSearchServlet hit record looks like:

<description>&lt;span class="ellipsis"&gt; ... &lt;/span&gt;V&amp;#283;deck&amp;aacute; knihovna v Olomouci Bezru&amp;#269;ova 2, Olomouc 9, 779 11, &amp;#268;esk&amp;aacute; republika &amp;nbsp; tel. +420-585223441 &amp;nbsp; fax +420-585225774 http://www.&lt;span class="highlight"&gt;vkol&lt;/span&gt;.cz/ &amp;nbsp;&amp;nbsp; mailto:info@&lt;span class="highlight"&gt;vkol&lt;/span&gt;.cz Otev&amp;#345;eno : &amp;nbsp; po-p&amp;aacute; &amp;nbsp; 8 30 -19 00 &amp;nbsp;&amp;nbsp;&amp;nbsp; so &amp;nbsp; 9 00 -13 00 &amp;nbsp;&amp;nbsp;&amp;nbsp; ne &amp;nbsp; zav&amp;#345;eno V katalogu s &amp;uacute;pln&amp;yacute;m &amp;#269;asov&amp;yacute;m&lt;span class="ellipsis"&gt; ... &lt;/span&gt;03 Organizace 20/12 Odkazy 19/04 Hledej 23/03 &amp;nbsp; 23/03 &amp;nbsp; Po&amp;#269;et p&amp;#345;&amp;iacute;stup&amp;#367; od 1.9.1998. Statistiky . [ ] &amp;nbsp; [ Nahoru ] &lt;span class="highlight"&gt;VKOL&lt;/span&gt;</description>

Here is same description field with Entity.encoding disabled:

<description>&lt;span class="ellipsis"&gt; ... &lt;/span&gt;tisky statistiky knihovny WWW serveru středověké rukopisy studovny CD-ROM historických fondů hlavní Internet Německé knihovny vázaných novin SVKOL viz &lt;span class="highlight"&gt;VKOL&lt;/span&gt; šatna T telefonní čísla knihovny zaměstnanců U V vazba věcný popis vedení knihovny vedoucí oddělení video &lt;span class="highlight"&gt;VKOL&lt;/span&gt; volný výběr výpůjčka výroční zpráva výstavy W webmaster WWW odkazy X Y Z - Ž zamluvení knihy zahraniční periodika zpracování fondu&lt;span class="highlight"&gt;VKOL&lt;/span&gt; - hledej Hledej [ &lt;span class="highlight"&gt;VKOL&lt;/span&gt; ] [ Novinky ] [ Katalog ] [ Služby ] [ Aktivity ] [ Průvodce ] [ Dokumenty ] [ Regionální fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [     ] [     ] Obsah full-textové vyhledávání, 19/04/2003 rejstřík vybraných&lt;span class="ellipsis"&gt; ... &lt;/span&gt;</description>

Notice how the Czech characters in the first are all numerically encoded: i.e. #NNN;.

I'd suggest that Summary#toString() become Summary#toEntityEncodedString() and that toString return raw aggregation of Fragments.  Would likely require adding methods to the HitSummarizer interface so can ask for either raw text or entity encoded with addition to NutchBean so can ask for either.  Or, better I'd suggest is that Summarizer never return Entity.encoded text.  Let that happen in search.jsp (I can make patch to do the latter if its amenable).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376989 ]

Doug Cutting commented on NUTCH-257:
------------------------------------

I'd vote to never have Summary#toString() perform entity encoding, to fix search.jsp to encode things itself, and *not* to add a new Summary#toEntityEncodedString() method.

> Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field
> -----------------------------------------------------------------------------------------
>
>          Key: NUTCH-257
>          URL: http://issues.apache.org/jira/browse/NUTCH-257
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: [hidden email]
>     Priority: Minor

>
> All search result data we display in search results has to be explicitly Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries.  Its already Entity.encoded.  This is fine when outputing HTML but it gets in the way when outputing otherwise -- as xml for example.  I'd suggest we not make any presumption about how search results are used.
> The problem becomes especially acute when the text language is other than english.
> Here is an example of what a Czech description field in an OpenSearchServlet hit record looks like:
> <description>&lt;span class="ellipsis"&gt; ... &lt;/span&gt;V&amp;#283;deck&amp;aacute; knihovna v Olomouci Bezru&amp;#269;ova 2, Olomouc 9, 779 11, &amp;#268;esk&amp;aacute; republika &amp;nbsp; tel. +420-585223441 &amp;nbsp; fax +420-585225774 http://www.&lt;span class="highlight"&gt;vkol&lt;/span&gt;.cz/ &amp;nbsp;&amp;nbsp; mailto:info@&lt;span class="highlight"&gt;vkol&lt;/span&gt;.cz Otev&amp;#345;eno : &amp;nbsp; po-p&amp;aacute; &amp;nbsp; 8 30 -19 00 &amp;nbsp;&amp;nbsp;&amp;nbsp; so &amp;nbsp; 9 00 -13 00 &amp;nbsp;&amp;nbsp;&amp;nbsp; ne &amp;nbsp; zav&amp;#345;eno V katalogu s &amp;uacute;pln&amp;yacute;m &amp;#269;asov&amp;yacute;m&lt;span class="ellipsis"&gt; ... &lt;/span&gt;03 Organizace 20/12 Odkazy 19/04 Hledej 23/03 &amp;nbsp; 23/03 &amp;nbsp; Po&amp;#269;et p&amp;#345;&amp;iacute;stup&amp;#367; od 1.9.1998. Statistiky . [ ] &amp;nbsp; [ Nahoru ] &lt;span class="highlight"&gt;VKOL&lt;/span&gt;</description>
> Here is same description field with Entity.encoding disabled:
> <description>&lt;span class="ellipsis"&gt; ... &lt;/span&gt;tisky statistiky knihovny WWW serveru st?edov?ké rukopisy studovny CD-ROM historických fond? hlavní Internet N?mecké knihovny vázaných novin SVKOL viz &lt;span class="highlight"&gt;VKOL&lt;/span&gt; ?atna T telefonní ?ísla knihovny zam?stnanc? U V vazba v?cný popis vedení knihovny vedoucí odd?lení video &lt;span class="highlight"&gt;VKOL&lt;/span&gt; volný výb?r výp?j?ka výro?ní zpráva výstavy W webmaster WWW odkazy X Y Z - ? zamluvení knihy zahrani?ní periodika zpracování fondu&lt;span class="highlight"&gt;VKOL&lt;/span&gt; - hledej Hledej [ &lt;span class="highlight"&gt;VKOL&lt;/span&gt; ] [ Novinky ] [ Katalog ] [ Slu?by ] [ Aktivity ] [ Pr?vodce ] [ Dokumenty ] [ Regionální fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [     ] [     ] Obsah full-textové vyhledávání, 19/04/2003 rejst?ík vybraných&lt;span class="ellipsis"&gt; ... &lt;/span&gt;</description>
> Notice how the Czech characters in the first are all numerically encoded: i.e. #NNN;.
> I'd suggest that Summary#toString() become Summary#toEntityEncodedString() and that toString return raw aggregation of Fragments.  Would likely require adding methods to the HitSummarizer interface so can ask for either raw text or entity encoded with addition to NutchBean so can ask for either.  Or, better I'd suggest is that Summarizer never return Entity.encoded text.  Let that happen in search.jsp (I can make patch to do the latter if its amenable).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376997 ]

[hidden email] commented on NUTCH-257:
-----------------------------------------

I took a closer look.  Turns out Summary is inherently all about rendering HTML (See the different Summary.Fragment subclasses -- one for ellipsis, another for hightlight.  In each of these, the to String wraps the fragment in some HTML 'span' markup).

What about changing HitSummarizer#getSummary to return Summary instead of String or String [].  If the rendering context requires HTML, ask Summary to compose the HTML to output (Summary#toHtmlString()?).  If, xml, get plain-text version of summary (Summary#toString())?

> Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field
> -----------------------------------------------------------------------------------------
>
>          Key: NUTCH-257
>          URL: http://issues.apache.org/jira/browse/NUTCH-257
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: [hidden email]
>     Priority: Minor

>
> All search result data we display in search results has to be explicitly Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries.  Its already Entity.encoded.  This is fine when outputing HTML but it gets in the way when outputing otherwise -- as xml for example.  I'd suggest we not make any presumption about how search results are used.
> The problem becomes especially acute when the text language is other than english.
> Here is an example of what a Czech description field in an OpenSearchServlet hit record looks like:
> <description>&lt;span class="ellipsis"&gt; ... &lt;/span&gt;V&amp;#283;deck&amp;aacute; knihovna v Olomouci Bezru&amp;#269;ova 2, Olomouc 9, 779 11, &amp;#268;esk&amp;aacute; republika &amp;nbsp; tel. +420-585223441 &amp;nbsp; fax +420-585225774 http://www.&lt;span class="highlight"&gt;vkol&lt;/span&gt;.cz/ &amp;nbsp;&amp;nbsp; mailto:info@&lt;span class="highlight"&gt;vkol&lt;/span&gt;.cz Otev&amp;#345;eno : &amp;nbsp; po-p&amp;aacute; &amp;nbsp; 8 30 -19 00 &amp;nbsp;&amp;nbsp;&amp;nbsp; so &amp;nbsp; 9 00 -13 00 &amp;nbsp;&amp;nbsp;&amp;nbsp; ne &amp;nbsp; zav&amp;#345;eno V katalogu s &amp;uacute;pln&amp;yacute;m &amp;#269;asov&amp;yacute;m&lt;span class="ellipsis"&gt; ... &lt;/span&gt;03 Organizace 20/12 Odkazy 19/04 Hledej 23/03 &amp;nbsp; 23/03 &amp;nbsp; Po&amp;#269;et p&amp;#345;&amp;iacute;stup&amp;#367; od 1.9.1998. Statistiky . [ ] &amp;nbsp; [ Nahoru ] &lt;span class="highlight"&gt;VKOL&lt;/span&gt;</description>
> Here is same description field with Entity.encoding disabled:
> <description>&lt;span class="ellipsis"&gt; ... &lt;/span&gt;tisky statistiky knihovny WWW serveru st?edov?ké rukopisy studovny CD-ROM historických fond? hlavní Internet N?mecké knihovny vázaných novin SVKOL viz &lt;span class="highlight"&gt;VKOL&lt;/span&gt; ?atna T telefonní ?ísla knihovny zam?stnanc? U V vazba v?cný popis vedení knihovny vedoucí odd?lení video &lt;span class="highlight"&gt;VKOL&lt;/span&gt; volný výb?r výp?j?ka výro?ní zpráva výstavy W webmaster WWW odkazy X Y Z - ? zamluvení knihy zahrani?ní periodika zpracování fondu&lt;span class="highlight"&gt;VKOL&lt;/span&gt; - hledej Hledej [ &lt;span class="highlight"&gt;VKOL&lt;/span&gt; ] [ Novinky ] [ Katalog ] [ Slu?by ] [ Aktivity ] [ Pr?vodce ] [ Dokumenty ] [ Regionální fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [     ] [     ] Obsah full-textové vyhledávání, 19/04/2003 rejst?ík vybraných&lt;span class="ellipsis"&gt; ... &lt;/span&gt;</description>
> Notice how the Czech characters in the first are all numerically encoded: i.e. #NNN;.
> I'd suggest that Summary#toString() become Summary#toEntityEncodedString() and that toString return raw aggregation of Fragments.  Would likely require adding methods to the HitSummarizer interface so can ask for either raw text or entity encoded with addition to NutchBean so can ask for either.  Or, better I'd suggest is that Summarizer never return Entity.encoded text.  Let that happen in search.jsp (I can make patch to do the latter if its amenable).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-257?page=all ]
     
Jerome Charron resolved NUTCH-257:
----------------------------------

    Fix Version: 0.8-dev
     Resolution: Fixed
      Assign To: Jerome Charron

Fixed in revision #405565 - http://svn.apache.org/viewcvs?view=rev&rev=405565 

> Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field
> -----------------------------------------------------------------------------------------
>
>          Key: NUTCH-257
>          URL: http://issues.apache.org/jira/browse/NUTCH-257
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.8-dev
>     Reporter: [hidden email]
>     Assignee: Jerome Charron
>     Priority: Minor
>      Fix For: 0.8-dev

>
> All search result data we display in search results has to be explicitly Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries.  Its already Entity.encoded.  This is fine when outputing HTML but it gets in the way when outputing otherwise -- as xml for example.  I'd suggest we not make any presumption about how search results are used.
> The problem becomes especially acute when the text language is other than english.
> Here is an example of what a Czech description field in an OpenSearchServlet hit record looks like:
> <description>&lt;span class="ellipsis"&gt; ... &lt;/span&gt;V&amp;#283;deck&amp;aacute; knihovna v Olomouci Bezru&amp;#269;ova 2, Olomouc 9, 779 11, &amp;#268;esk&amp;aacute; republika &amp;nbsp; tel. +420-585223441 &amp;nbsp; fax +420-585225774 http://www.&lt;span class="highlight"&gt;vkol&lt;/span&gt;.cz/ &amp;nbsp;&amp;nbsp; mailto:info@&lt;span class="highlight"&gt;vkol&lt;/span&gt;.cz Otev&amp;#345;eno : &amp;nbsp; po-p&amp;aacute; &amp;nbsp; 8 30 -19 00 &amp;nbsp;&amp;nbsp;&amp;nbsp; so &amp;nbsp; 9 00 -13 00 &amp;nbsp;&amp;nbsp;&amp;nbsp; ne &amp;nbsp; zav&amp;#345;eno V katalogu s &amp;uacute;pln&amp;yacute;m &amp;#269;asov&amp;yacute;m&lt;span class="ellipsis"&gt; ... &lt;/span&gt;03 Organizace 20/12 Odkazy 19/04 Hledej 23/03 &amp;nbsp; 23/03 &amp;nbsp; Po&amp;#269;et p&amp;#345;&amp;iacute;stup&amp;#367; od 1.9.1998. Statistiky . [ ] &amp;nbsp; [ Nahoru ] &lt;span class="highlight"&gt;VKOL&lt;/span&gt;</description>
> Here is same description field with Entity.encoding disabled:
> <description>&lt;span class="ellipsis"&gt; ... &lt;/span&gt;tisky statistiky knihovny WWW serveru st?edov?ké rukopisy studovny CD-ROM historických fond? hlavní Internet N?mecké knihovny vázaných novin SVKOL viz &lt;span class="highlight"&gt;VKOL&lt;/span&gt; ?atna T telefonní ?ísla knihovny zam?stnanc? U V vazba v?cný popis vedení knihovny vedoucí odd?lení video &lt;span class="highlight"&gt;VKOL&lt;/span&gt; volný výb?r výp?j?ka výro?ní zpráva výstavy W webmaster WWW odkazy X Y Z - ? zamluvení knihy zahrani?ní periodika zpracování fondu&lt;span class="highlight"&gt;VKOL&lt;/span&gt; - hledej Hledej [ &lt;span class="highlight"&gt;VKOL&lt;/span&gt; ] [ Novinky ] [ Katalog ] [ Slu?by ] [ Aktivity ] [ Pr?vodce ] [ Dokumenty ] [ Regionální fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [     ] [     ] Obsah full-textové vyhledávání, 19/04/2003 rejst?ík vybraných&lt;span class="ellipsis"&gt; ... &lt;/span&gt;</description>
> Notice how the Czech characters in the first are all numerically encoded: i.e. #NNN;.
> I'd suggest that Summary#toString() become Summary#toEntityEncodedString() and that toString return raw aggregation of Fragments.  Would likely require adding methods to the HitSummarizer interface so can ask for either raw text or entity encoded with addition to NutchBean so can ask for either.  Or, better I'd suggest is that Summarizer never return Entity.encoded text.  Let that happen in search.jsp (I can make patch to do the latter if its amenable).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira