HTML parser??

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

HTML parser??

Bartosch Warzecha

Hello,

I´m building a search engine for HTML-Dokuments, and I´ve got a HTML-parsing
problem.

This documents are in german. In this documents are different special
characters, and different ways of writing this special characters, like "ö",
"ö" and "&#246". Do somebody know a parsing engine that has no problems
with all this different ways to write this special characters?

Thanks

b.warzecha

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HTML parser??

Erik Hatcher

On May 3, 2005, at 4:35 AM, Bartosch Warzecha wrote:

>
> Hello,
>
> I´m building a search engine for HTML-Dokuments, and I´ve got a  
> HTML-parsing
> problem.
>
> This documents are in german. In this documents are different special
> characters, and different ways of writing this special characters,  
> like "ö",
> "ö" and "&#246". Do somebody know a parsing engine that has no  
> problems
> with all this different ways to write this special characters?

What HTML parser are you using?  Those entity references should not  
be seen by your code once resolved by a parser.  Try NekoHTML.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HTML parser??

Damian Gajda
In reply to this post by Bartosch Warzecha
Hello,

> This documents are in german. In this documents are different special
> characters, and different ways of writing this special characters, like "�",
> "ö" and "&#246". Do somebody know a parsing engine that has no problems
> with all this different ways to write this special characters?

I've created a component for parsing HTML entities (special characters).
This component is a part of ObjectLedge project - it is stored in
components subproject. Please feel free to use this component. It is
licensed under BSD (Apache like) license. You will need to check the
ledge-components CVS module.

http://objectledge.org/

You are also welcome to use ObjectLedge as a whole :)

Regards,
--
Damian Gajda



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...