parsing mime-type text/html with parse-tika

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

parsing mime-type text/html with parse-tika

alxsss

Hello,  
   
   
   
I try to use nutch-2.x trunk to parse text/html types with tika.
   
I get error "parser for     text/html not found".  
   
   
   
   
   I see that parse-tika code was changed. These lines  
   
   
   
   
   // get the right parser using the mime type as a clue  
   
   
    String mimeType = page.getContentType().toString();
    CompositeParser compositeParser = (CompositeParser) tikaConfig.getParser();
    Parser parser = compositeParser.getParsers().get(MediaType.parse(mimeType));
return no parser.  
   
   
   
   
However, if I revert back to older version with  
   
   
 // get the right parser using the mime type as a clue
    String mimeType = page.getContentType().toString();
    Parser parser = tikaConfig.getParser(mimeType);
   
   
it works.
   
Has anyone tested the new tika with text/html types?
   
Thanks.
   
Alex.