Custom parser error

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Custom parser error

122jxgcn
Hi, I'm continuing my question from
http://lucene.472066.n3.nabble.com/Convert-file-before-Tika-processes-it-td3990629.html
this post

So, I wrote some code and test, but it's not passing

On the test, I did something like


InputStream stream = HWPParserTest.class.getResourceAsStream(
        "/test-documents/testHWP.hwp");
try {
        parser.parse(stream, handler, metadata, context);
} finally {
        stream.close();
}


And my parser looks like


public void parse(
                        InputStream stream, ContentHandler handler,
                        Metadata metadata, ParseContext context)
                        throws IOException, SAXException, TikaException {
               
  try {
      TikaInputStream tstream = TikaInputStream.cast(stream);
                   
      if (tstream != null && tstream.hasFile()) {
          File f = tstream.getFile();
          Process ps = Runtime.getRuntime().exec("/hwp2xml.bin", null, f);
          new XMLParser().parse(ps.getInputStream(), handler, metadata, context);
      }
  } finally {
      stream.close();
  }

  metadata.set(Metadata.CONTENT_TYPE, HWP_MIME_TYPE);
               
  XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
  xhtml.startDocument();
  xhtml.endDocument();
}


Based on my findings, it seems that casting InputStream into TikaInputStream is failing.
So tstream variable becomes null, which results in error.
I'm not sure what's going wrong in here as made my parser similar to the PDF's
Any help please?

Also, I'm not sure whether

File f = tstream.getFile();
Process ps = Runtime.getRuntime().exec("/hwp2xml.bin", null, f);
new XMLParser().parse(ps.getInputStream(), handler, metadata, context);

I wrote this part correctly...
Reply | Threaded
Open this post in threaded view
|

Re: Custom parser error

Nick Burch-2
On Tue, 31 Jul 2012, 122jxgcn wrote:
>  try {
>      TikaInputStream tstream = TikaInputStream.cast(stream);

You probably want TikaInputStream.get rather than cast. Cast casts it if
possible, get wraps it

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Custom parser error

122jxgcn
Hi Nick,
I tried TikaInputStream.get() and tstream is no longer null.
But it seems that tstream.hasFile() is null.
I'm pretty sure I'm loading the file right, as I did same thing with parser for pdf.
Reply | Threaded
Open this post in threaded view
|

Re: Custom parser error

Nick Burch-2
On Tue, 31 Jul 2012, 122jxgcn wrote:
> I tried TikaInputStream.get() and tstream is no longer null.
> But it seems that tstream.hasFile() is null.

If you create a TikaInputStream with an InputStream, then initially
hasFile will be false. If you create it with a file, it'll be true

If your TikaInputStream lacks a file, and getFile is called, one will
automatically be created for you. (That's part of the point!)

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Custom parser error

122jxgcn
Hi Nick, sorry to bother again but I'm not quite sure of what you have said.

Nick Burch-2 wrote
On Tue, 31 Jul 2012, 122jxgcn wrote:
If your TikaInputStream lacks a file, and getFile is called, one will
automatically be created for you. (That's part of the point!)
I believe created file will be empty. Then how can I process the input file without its data?

So basically, my file is converted to InputStream by

InputStream stream = HWPParserTest.class.getResourceAsStream(
                "/test-documents/testHWP.hwp");

After that, InputStream stream is passed to parser() of HWPParser
and it should be converted to TikaInputStream tstream without the loss of input file data.
I'm currently doing

TikaInputStream tstream = TikaInputStream.get(stream);

right now.
I believe tstream.hasFile() should true right away in order to my parser class to work.

Thanks a lot.
Reply | Threaded
Open this post in threaded view
|

RE: Custom parser error

Uwe Schindler
Hi,

> Hi Nick, sorry to bother again but I'm not quite sure of what you have
said.
>
>
> Nick Burch-2 wrote
> >
> > On Tue, 31 Jul 2012, 122jxgcn wrote:
> > If your TikaInputStream lacks a file, and getFile is called, one will
> > automatically be created for you. (That's part of the point!)
> >
> I believe created file will be empty. Then how can I process the input
file
> without its data?

It will not be empty. It seems there is some misunderstanding here. Of
cource a ResourceAsStream InputStream has no file backed (or the file is not
easy reachable). The main idea behin TikeInputStream is to provide the file
on request. If hasFile() returns false, TikaInputStream will do the
following when you call getFile():
- create temporary file
- copy the whole stream to the temporary file

After that you can process the contents. If the InputStream passed to
TikaInputStream has a possibility to get the file backed, it will return it
directly, but in most cases it will create a temporary one and copy the
contents into it. Because of this its always better to make your parser work
on a InputStream and only use a file, if the parser cannot (e.g. because it
needs random access).

> So basically, my file is converted to InputStream by
>
> InputStream stream = HWPParserTest.class.getResourceAsStream(
>                 "/test-documents/testHWP.hwp");
>
> After that, InputStream stream is passed to parser() of HWPParser and it
should
> be converted to TikaInputStream tstream without the loss of input file
data.
> I'm currently doing
>
> TikaInputStream tstream = TikaInputStream.get(stream);
>
> right now.
> I believe tstream.hasFile() should true right away in order to my parser
class to
> work.

No, hasFile only tells you if the wrapped InputStream has a backing file,
for resource streams this is not the case. If you cann getFile() it will
emulate a backing file by copying to a temporary one. After that the stream
is exhausted.

> Thanks a lot.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Custom-
> parser-error-tp3998302p3998536.html
> Sent from the Apache Tika - Development mailing list archive at
Nabble.com.