[jira] [Updated] (TIKA-2500) Apache Tika do not extract first line of the RTF file, It only extract last three char of first line.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (TIKA-2500) Apache Tika do not extract first line of the RTF file, It only extract last three char of first line.

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohit Sureshrao Shelhalkar updated TIKA-2500:
---------------------------------------------
    Description:
When I am parsing the RTF file, it only prints last three characters of the first line.Remaining lines are printed properly.

Following is the RTF file text. Please copy the text and save as RTF format.

{\rtf1\ansi\ansicpg1252\uc1\deff0 {\rtf1\ansi\deff0 {\fonttbl{\f0 Arial;}}\f0 {\line {\b Level1} : \par} {\line This is level1 paragraph\line This is level1 paragraph\line This is level1 paragraph\line This is level1 paragraph\line This is level1 paragraph\line This is level1 paragraph\par} {\line {\b Level2} : \par} This is level2 paragraph.\line This is level2 paragraph.\line This is level2 paragraph.\par} } \line \par}

When you save this text as RTF file it will look like this.

[https://i.stack.imgur.com/p1rWe.png]

I have parsed above RTF from the following code.


{code:java}

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("level1Missing.rtf"));
ParseContext pcontext = new ParseContext();
RTFParser rt = new RTFParser();
rt.parse(inputstream, handler, metadata, pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :\n\n" + handler.toString());
{code}

The output of the above code is following.

https://i.stack.imgur.com/cJYuQ.png



  was:
When I am parsing the RTF file, it only prints last three characters of the first line.Remaining lines are printed properly.

Following is the RTF file text. Please copy the text and save as RTF format.

{\rtf1\ansi\ansicpg1252\uc1\deff0 {\rtf1\ansi\deff0 {\fonttbl{\f0 Arial;}}\f0 {\line {\b Level1} : \par} {\line This is level1 paragraph\line This is level1 paragraph\line This is level1 paragraph\line This is level1 paragraph\line This is level1 paragraph\line This is level1 paragraph\par} {\line {\b Level2} : \par} This is level2 paragraph.\line This is level2 paragraph.\line This is level2 paragraph.\par} } \line \par}

When you save this text as RTF file it will look like this.

[https://i.stack.imgur.com/p1rWe.png]

I have parsed above RTF from the following code.


{code:java}
// Some comments here
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("level1Missing.rtf"));
ParseContext pcontext = new ParseContext();
RTFParser rt = new RTFParser();
rt.parse(inputstream, handler, metadata, pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :\n\n" + handler.toString());
{code}

The output of the above code is following.

https://i.stack.imgur.com/cJYuQ.png




> Apache Tika do not extract first line of the RTF file, It only extract last three char of first line.
> -----------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2500
>                 URL: https://issues.apache.org/jira/browse/TIKA-2500
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>            Reporter: Rohit Sureshrao Shelhalkar
>              Labels: RTF, RTFParser
>
> When I am parsing the RTF file, it only prints last three characters of the first line.Remaining lines are printed properly.
> Following is the RTF file text. Please copy the text and save as RTF format.
> {\rtf1\ansi\ansicpg1252\uc1\deff0 {\rtf1\ansi\deff0 {\fonttbl{\f0 Arial;}}\f0 {\line {\b Level1} : \par} {\line This is level1 paragraph\line This is level1 paragraph\line This is level1 paragraph\line This is level1 paragraph\line This is level1 paragraph\line This is level1 paragraph\par} {\line {\b Level2} : \par} This is level2 paragraph.\line This is level2 paragraph.\line This is level2 paragraph.\par} } \line \par}
> When you save this text as RTF file it will look like this.
> [https://i.stack.imgur.com/p1rWe.png]
> I have parsed above RTF from the following code.
> {code:java}
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> FileInputStream inputstream = new FileInputStream(new File("level1Missing.rtf"));
> ParseContext pcontext = new ParseContext();
> RTFParser rt = new RTFParser();
> rt.parse(inputstream, handler, metadata, pcontext);
> //getting the content of the document
> System.out.println("Contents of the PDF :\n\n" + handler.toString());
> {code}
> The output of the above code is following.
> https://i.stack.imgur.com/cJYuQ.png



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)