[jira] [Commented] (NUTCH-2318) Text extraction in HtmlParser adds too much whitespace.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (NUTCH-2318) Text extraction in HtmlParser adds too much whitespace.

Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980167#comment-16980167 ]

Sebastian Nagel commented on NUTCH-2318:

Still a problem, also in 1.x. [~markus17] - you're right, that's the correct approach. Implemented a similar solution [here|https://github.com/commoncrawl/ia-web-commons/blob/7ce4e8849fd4a8ff31ec56875bf9022f481072c1/src/main/java/org/archive/resource/html/ExtractingParseObserver.java#L47] - it defines 3 classes: 1. block elements indicating a line break, 2. inline elements which usually indicate a space (eg. <td>), 3. remaining inline elements to not cause the insertion of a space. Of course, we could make these classes configurable.

> Text extraction in HtmlParser adds too much whitespace.
> -------------------------------------------------------
>                 Key: NUTCH-2318
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2318
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.3.1, 1.15
>            Reporter: Felix Zett
>            Priority: Major
>             Fix For: 1.17
> In parse-html, org.apache.nutch.parse.html.HtmlParser will call DOMContentUtils.getText() to extract the text content. For every text node encountered in the document, the getTextHelper() function will first add a space character to the already extracted text and then the text content itself (stripped of excess whitespace). This means that parsing HTML such as
> {{<p>behavi<em>ou</em>r</p>}}
> will lead to this extracted text:
> {{behavi ou r}}
> I would have expected a parser not to add whitespace to content that visually (and actually) does not contain any in the first place. This applies to all similar semantic tags as well as {{<span>}}.
> My naiive approach would be to remove the lines {{text = text.trim()}} and {{sb.append(' ')}}, but I'm aware that this will lead to bad parsing of stuff like {{<p>foo</p><p>bar</p>}}.
> This is not an issue in parse-tika, since tika removes all "unimportant" tags beforehand. However, I'd like to keep using parse-html because I need to keep the document reasonably intact for parse filters applied later.
> I know I could write a parse filter that will re-extract the text content, but this feels like a bug (or at least a shortcoming) in the ParseHtml.

This message was sent by Atlassian Jira