[jira] Created: (TIKA-394) Missing spaces on html parsing

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-394) Missing spaces on html parsing

JIRA jira@apache.org
Missing spaces on html parsing
------------------------------

                 Key: TIKA-394
                 URL: https://issues.apache.org/jira/browse/TIKA-394
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.6
         Environment: Tomcat 6, Windows XP (russian locale)
            Reporter: Andrey Barhatov


On parsing such html code:

text<p>more<br>yet<select><option>city1<option>city2</select>

resulting text is:

textmore
yetcity1city2

But must be:

text
more
yet city1 city2

Code sample:

import java.io.*;
import org.apache.tika.metadata.*;
import org.apache.tika.parser.*;

public class test {

   public static void main(String[] args) throws Exception {
      Metadata metadata = new Metadata();
      metadata.set(Metadata.CONTENT_TYPE, "text/html");
      String content = "text<p>more<br>yet<select><option>city1<option>city2</select>";

      InputStream in = new ByteArrayInputStream(content.getBytes("UTF-8"));
      AutoDetectParser parser = new AutoDetectParser();
      Reader reader = new ParsingReader(parser, in, metadata, new ParseContext());
      char[] buf = new char[10000];
      int len;
      StringBuffer text = new StringBuffer();
      while((len = reader.read(buf)) > 0) {
         text.append(buf, 0, len);
      }
      System.out.print(text);
   }
}


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.