HTML styles and <li> tags are ignored

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

HTML styles and <li> tags are ignored

andrewtr
Hi:

While I am parsing the PDF or Word document using AutoDetectParser the <li>,
<ul> tags are converted as <p> tags. I need the exact HTML content what is
been there for PDF or Word Document.

I tried in several ways as below:

ToHTMLContentHandler textHandler = new ToHTMLContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(HtmlMapper.class, new IdentityHtmlMapper());
parser.parse(in, textHandler, metadata, context);

---------------------------------------------------------

SAXTransformerFactory factory =
(SAXTransformerFactory)SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "utf-8");
handler.setResult(new StreamResult(writer));
System.out.println(handler.toString());
return handler;

But the <li> tags are been replaced with <p> tags with class but the CSS
style is not seen in the parsed HTML output.

Any help is appreciated.

--
View this message in context: http://lucene.472066.n3.nabble.com/HTML-styles-and-li-tags-are-ignored-tp3987550.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: HTML styles and <li> tags are ignored

Jukka Zitting
Hi,

On Mon, Jun 4, 2012 at 2:21 PM, andrewtr <[hidden email]> wrote:
> While I am parsing the PDF or Word document using AutoDetectParser the <li>,
> <ul> tags are converted as <p> tags. I need the exact HTML content what is
> been there for PDF or Word Document.

<li> and <ul> tags in PDF or Word? I assume you rather mean the native
list formatting of those document types?

The Tika parsers for PDF and Office documents could/should
automatically map such formatting to equivalent XHTML constructs, but
I don't think they currently do. You'll need to look into the source
code to see how to make that happen.

BR,

Jukka Zitting