[jira] Created: (LUCENE-590) Demo HTML parser gives incorrect summaries when title is repeated as a heading

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-590) Demo HTML parser gives incorrect summaries when title is repeated as a heading

Sebastian Nagel (Jira)
Demo HTML parser gives incorrect summaries when title is repeated as a heading
------------------------------------------------------------------------------

         Key: LUCENE-590
         URL: http://issues.apache.org/jira/browse/LUCENE-590
     Project: Lucene - Java
        Type: Bug

  Components: Examples  
    Versions: 2.0.0    
    Reporter: Curtis d'Entremont



If you have an html document where the title is repeated as a heading at the top of the document, the HTMLParser will return the title as the summary, ignoring everything else that was added to the summary. Instead, it should keep the rest of the summary and chop off the title part at the beginning (essentially the opposite). I don't see any benefit to repeating the title in the summary for any case.

In HTMLParser.jj's getSummary():

    String sum = summary.toString().trim();
    String tit = getTitle();
    if (sum.startsWith(tit) || sum.equals(""))
      return tit;
    else
      return sum;

change it to: (* denotes a line that has changed)

    String sum = summary.toString().trim();
    String tit = getTitle();
*    if (sum.startsWith(tit))             // don't repeat title in summary
*      return sum.substring(tit.length()).trim();
    else
      return sum;


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-590) Demo HTML parser gives incorrect summaries when title is repeated as a heading

Sebastian Nagel (Jira)
     [ http://issues.apache.org/jira/browse/LUCENE-590?page=all ]

Daniel Naber updated LUCENE-590:
--------------------------------

    Description:
If you have an html document where the title is repeated as a heading at the top of the document, the HTMLParser will return the title as the summary, ignoring everything else that was added to the summary. Instead, it should keep the rest of the summary and chop off the title part at the beginning (essentially the opposite). I don't see any benefit to repeating the title in the summary for any case.

In HTMLParser.jj's getSummary():

    String sum = summary.toString().trim();
    String tit = getTitle();
    if (sum.startsWith(tit) || sum.equals(""))
      return tit;
    else
      return sum;

change it to: (* denotes a line that has changed)

    String sum = summary.toString().trim();
    String tit = getTitle();
*    if (sum.startsWith(tit))             // don't repeat title in summary
*      return sum.substring(tit.length()).trim();
    else
      return sum;


  was:

If you have an html document where the title is repeated as a heading at the top of the document, the HTMLParser will return the title as the summary, ignoring everything else that was added to the summary. Instead, it should keep the rest of the summary and chop off the title part at the beginning (essentially the opposite). I don't see any benefit to repeating the title in the summary for any case.

In HTMLParser.jj's getSummary():

    String sum = summary.toString().trim();
    String tit = getTitle();
    if (sum.startsWith(tit) || sum.equals(""))
      return tit;
    else
      return sum;

change it to: (* denotes a line that has changed)

    String sum = summary.toString().trim();
    String tit = getTitle();
*    if (sum.startsWith(tit))             // don't repeat title in summary
*      return sum.substring(tit.length()).trim();
    else
      return sum;


       Priority: Minor  (was: Major)

decrease priority (affects demo only)

> Demo HTML parser gives incorrect summaries when title is repeated as a heading
> ------------------------------------------------------------------------------
>
>          Key: LUCENE-590
>          URL: http://issues.apache.org/jira/browse/LUCENE-590
>      Project: Lucene - Java
>         Type: Bug

>   Components: Examples
>     Versions: 2.0.0
>     Reporter: Curtis d'Entremont
>     Priority: Minor

>
> If you have an html document where the title is repeated as a heading at the top of the document, the HTMLParser will return the title as the summary, ignoring everything else that was added to the summary. Instead, it should keep the rest of the summary and chop off the title part at the beginning (essentially the opposite). I don't see any benefit to repeating the title in the summary for any case.
> In HTMLParser.jj's getSummary():
>     String sum = summary.toString().trim();
>     String tit = getTitle();
>     if (sum.startsWith(tit) || sum.equals(""))
>       return tit;
>     else
>       return sum;
> change it to: (* denotes a line that has changed)
>     String sum = summary.toString().trim();
>     String tit = getTitle();
> *    if (sum.startsWith(tit))             // don't repeat title in summary
> *      return sum.substring(tit.length()).trim();
>     else
>       return sum;

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]