AutoDetectParser and MS Office formats

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

AutoDetectParser and MS Office formats

Litrik De Roy-2
All,

I started working on the Eclipse plug-in that I have mentioned earlier
but I ran into a problem with the AutoDetectParser.

It does not seem to recognize any of the MS Office file formats. They
all return "application/octet-stream" as content type, but no
metadata.
All other file formats work OK.

I tested this with the test files included in the
src\test\resources\test-documents directory.

My source looks like this:

----8<--------8<--------8<--------8<--------8<--------8<----
private AutoDetectParser parser = new AutoDetectParser();
private Metadata metadata = new Metadata();
...
parser.parse(stream, new DefaultHandler(), metadata);
----8<--------8<--------8<--------8<--------8<--------8<----

I'm running Java 1.6.0_03 on Windows.

I there anything special that must be done to get POI to work?

--
Litrik De Roy
Norio ICT Consulting - http://www.norio.be/
Reply | Threaded
Open this post in threaded view
|

Re: AutoDetectParser and MS Office formats

Jukka Zitting
Hi,

On Feb 1, 2008 4:22 PM, Litrik De Roy <[hidden email]> wrote:
> I started working on the Eclipse plug-in that I have mentioned earlier
> but I ran into a problem with the AutoDetectParser.
>
> It does not seem to recognize any of the MS Office file formats. They
> all return "application/octet-stream" as content type, but no
> metadata. All other file formats work OK.
> [...]
> I there anything special that must be done to get POI to work?

We currently don't have any magic header matchers for Microsoft Office
file formats, so the only thing AutoDetectParser can use to detect the
file type is the file name suffix.

Do you have the file name available to your plugin? You can feed the
file name to AutoDetectParser like this:

    AutoDetectParser parser = new AutoDetectParser();
    InputStream stream = ...;
    ContentHandler handler = ...;
    Metadata metadata = new Metadata();
    metadata.set(Metadata.RESOURCE_NAME_KEY, ...);
    parser.parse(stream, handler, metadata);

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: AutoDetectParser and MS Office formats

Litrik De Roy-3
On Feb 1, 2008 3:34 PM, Jukka Zitting <[hidden email]> wrote:

>
> On Feb 1, 2008 4:22 PM, Litrik De Roy <[hidden email]> wrote:
> > I started working on the Eclipse plug-in that I have mentioned earlier
> > but I ran into a problem with the AutoDetectParser.
> >
> > [...]
> > I there anything special that must be done to get POI to work?
>
> We currently don't have any magic header matchers for Microsoft Office
> file formats, so the only thing AutoDetectParser can use to detect the
> file type is the file name suffix.
>
> Do you have the file name available to your plugin? You can feed the
> file name to AutoDetectParser like this:
>
>     AutoDetectParser parser = new AutoDetectParser();
>     InputStream stream = ...;
>     ContentHandler handler = ...;
>     Metadata metadata = new Metadata();
>     metadata.set(Metadata.RESOURCE_NAME_KEY, ...);
>     parser.parse(stream, handler, metadata);
>

That does the trick. Thanks.

--
Litrik De Roy
Norio ICT Consulting - http://www.norio.be/
[hidden email] - 0475 873235