Entity �

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Entity �

marcel.schnippe
Hi,

org.apache.nutch.html.Enities.encode has a little annoying bug, producing
wrong markup. The generated entity � is illegal.

see:

> http://www.w3.org/International/questions/qa-controls
> The NUL (Null) control is illegal and cannot be represented by NCR or
encoded directly in markup languages.

This occurs, if "s" contains any NUL characters. (e.g. while generation a
description from a pdf files which crosses table-cells).

Quick fix:
add

encoder[0]=" ";

to the entity initialization list, so that � will be replaced with a
blank which at least will be legal XML 1.1.
For other markup languages more replacements are needed. (see table on
link)

Marcel