[Resending with an image instead of the HTML example - previous
attempt was rejected by Apache.org as being spam...weird]
I'm doing a comparison of the Tika HtmlParser with the original Nutch
HTML parsing code.
I've run into some issues, and wanted input before filing any Jira
As an example of a test document:
1. The handler's startElement() never gets called with the <base> tag.
I'm assuming this is because <base> isn't part of the SAFE_ELEMENTS set.
But without the base tag, you can't correctly resolve relative URLs in
Seems like <base> should be part of the SAFE_ELEMENTS set.
How as this set of tags derived?
2. The handler's characters() method gets called with the following text
The first six calls make sense to me.
The last two calls (with a single \n) happen just before
endElement("body") is called, and this is unexpected.
From the offset in the buffer, passed to characters(), these are the
return _after_ the </body> tag. If I put any number of returns in
between the </body> and </html>, they all get passed to characters()
before the endElement("body") call. This seems like a bug.