Nutch - Dev
only in this topic
Open this post in threaded view
org.apache.nutch.html.Enities.encode has a little annoying bug, producing
wrong markup. The generated entity � is illegal.
> The NUL (Null) control is illegal and cannot be represented by NCR or
encoded directly in markup languages.
This occurs, if "s" contains any NUL characters. (e.g. while generation a
description from a pdf files which crosses table-cells).
to the entity initialization list, so that � will be replaced with a
blank which at least will be legal XML 1.1.
For other markup languages more replacements are needed. (see table on
Return to Nutch - Dev
1 view|%1 views
Free forum by Nabble
Edit this page