Html parser questions

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Html parser questions

kkrugler
Hi all,

[Resending with an image instead of the HTML example - previous  
attempt was rejected by Apache.org as being spam...weird]

I'm doing a comparison of the Tika HtmlParser with the original Nutch  
HTML parsing code.

I've run into some issues, and wanted input before filing any Jira  
requests/bugs.

As an example of a test document:




1. The handler's startElement() never gets called with the <base> tag.  
I'm assuming this is because <base> isn't part of the SAFE_ELEMENTS set.

But without the base tag, you can't correctly resolve relative URLs in  
anchor tags.

Seems like <base> should be part of the SAFE_ELEMENTS set.

How as this set of tags derived?

2. The handler's characters() method gets called with the following text

Untitled
\n\n
link1
\n
link2
\n\n
\n
\n

The first six calls make sense to me.

The last two calls (with a single \n) happen just before  
endElement("body") is called, and this is unexpected.

 From the offset in the buffer, passed to characters(), these are the  
return _after_ the </body> tag. If I put any number of returns in  
between the </body> and </html>, they all get passed to characters()  
before the endElement("body") call. This seems like a bug.

Has anybody else noticed this?

Thanks,

-- Ken



--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378