parser metadata empty after tika detect

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

parser metadata empty after tika detect

aliosha79
This post was updated on .
i'm facing up to with tika parsing.
I my use case i have to parse different file types using the right parser, including an .eml file.
As input of my app i can have every kind of file. In particular i have a MyEmail.eml file whose content-type is recognized as text/html. I aim to get all the available file's metadata.
Using AutoDetectParser MyEmail.eml is recognized as text/html and it's not good enough... so i have to use the right RFC822Parser by which i can get Message-From .. Message-To metadata.
For this purpose i have written these few code lines:
       File f = new File("MyEmail.eml");
       is= new FileInputStream(f);

       Tika tika = new Tika();
       String mimeType = tika.detect(is);
    
      
      if (FileUtils.getExtension("MyEmail.eml").equalsIgnoreCase("eml")){
    	  if (mimeType.equalsIgnoreCase("text/html"))    	  
    		  parser = new RFC822Parser();
    	  else
    		  parser = new AutoDetectParser();
    	  
      }else{
    	  parser = new AutoDetectParser();
      }
    
      parser.parse(is, ch, metadata,new ParseContext());
      for (int i = 0; i < metadata.names().length; i++) {
          String item = metadata.names()[i];
          System.out.println(item + " -- " + metadata.get(item));
      }
In this case the result of metadata syso is just content-type =application/octet-stream.
If i comment out tika.detect(is) ... the syso output print all the metadata i need.
If i initialize a second input stream on the same filename and i write:
       is2= new FileInputStream(f);
       Tika tika = new Tika();
       String mimeType = tika.detect(is2);
the syso  prints all the metadata i need.
What happens using the tika.detect(inputstream) function?
thanks a lot
Reply | Threaded
Open this post in threaded view
|

Re: parser metadata empty after tika detect

Nick Burch-2
On Fri, 16 May 2014, aliosha79 wrote:
> For this purpose i have write these few code lines:
>
>       File f = new File("MyEmail.eml");
>       is= new FileInputStream(f);
>
>       Tika tika = new Tika();
>       String mimeType = tika.detect(is);

This will most likely use a fair bit (to possibly all) of the input
stream. You'd be much much better off initialising a TikaInputStream from
the File object directly

> As input of my app i can have every kind of file. In particular i have a
> MyEmail.eml file whose content-type is recognized as text/html

I'd suggest you raise a bug, and attach a small file that doesn't detect
properly. We can then look at if we can improve the detection

Nick