PDF parser (two more questions)

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

PDF parser (two more questions)

Stefano Fornari
Hi,
I have two more questions on PDFParser:

1. is the use of PDF2XHTML necessary? why is the pdf turned into an XHTML?
for the purpose of indexing, wouldn't just the text be enough?
2. I need to limit the index of the content to files whose size is below to
a certain threshold; I was wondering if this could be a parser
configuration option and thus if you would accept this change.

Thanks in advance,
Ste
Reply | Threaded
Open this post in threaded view
|

Re: PDF parser (two more questions)

Jukka Zitting
Hi,

On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari
<[hidden email]> wrote:
> 1. is the use of PDF2XHTML necessary? why is the pdf turned into an XHTML?
> for the purpose of indexing, wouldn't just the text be enough?

The XHTML output allows us to annotate the extracted text with
structural information (like "this is a heading", "here's a
hyperlink", etc.) that would be difficult to express with text-only
output. A client that needs just the text content can get it easily
with the BodyContentHandler class.

> 2. I need to limit the index of the content to files whose size is below to
> a certain threshold; I was wondering if this could be a parser
> configuration option and thus if you would accept this change.

Do you want to entirely exclude too large files, or just index the
first few pages of such files (which is more common in many indexing
use cases)?

The latter use case be implemented with the writeLimit parameter of
the WriteOutContentHandler class, like this:

    // Extract up to 100k characters from a given document
    WriteOutContentHandler out = new WriteOutContentHandler(100_000);
    try {
        parser.parse(..., new BodyContentHandler(out), ...);
    } catch (SAXException e) {
        if (!out.isWriteLimitReached(e)) {
            throw e;
        }
    }
    String content = out.toString();

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: PDF parser (two more questions)

Stefano Fornari
Hi Jukka,
thanks a lot for your reply.

On #1 I am still wondering why for indexing we need structure information.
is there any particular reason? wouldn't make more sense to get just the
text by default and only optionally getting the structure?

On #2, I expected the code you presented would not work. And in fact the
pattern is quite odd, isn't it? What is the reason of throwing the
exception if limiting the text read is a legal use case? (I am asking just
to understand the background).

Ste

Ste


On Thu, Mar 27, 2014 at 11:55 PM, Jukka Zitting <[hidden email]>wrote:

> Hi,
>
> On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari
> <[hidden email]> wrote:
> > 1. is the use of PDF2XHTML necessary? why is the pdf turned into an
> XHTML?
> > for the purpose of indexing, wouldn't just the text be enough?
>
> The XHTML output allows us to annotate the extracted text with
> structural information (like "this is a heading", "here's a
> hyperlink", etc.) that would be difficult to express with text-only
> output. A client that needs just the text content can get it easily
> with the BodyContentHandler class.
>
> > 2. I need to limit the index of the content to files whose size is below
> to
> > a certain threshold; I was wondering if this could be a parser
> > configuration option and thus if you would accept this change.
>
> Do you want to entirely exclude too large files, or just index the
> first few pages of such files (which is more common in many indexing
> use cases)?
>
> The latter use case be implemented with the writeLimit parameter of
> the WriteOutContentHandler class, like this:
>
>     // Extract up to 100k characters from a given document
>     WriteOutContentHandler out = new WriteOutContentHandler(100_000);
>     try {
>         parser.parse(..., new BodyContentHandler(out), ...);
>     } catch (SAXException e) {
>         if (!out.isWriteLimitReached(e)) {
>             throw e;
>         }
>     }
>     String content = out.toString();
>
> BR,
>
> Jukka Zitting
>
Reply | Threaded
Open this post in threaded view
|

Re: PDF parser (two more questions)

Konstantin Gribov
Exception is rethrown only if write limit not reached. So if exception was
on first 100k chars it affects the result. If exception is thrown after
that -- it will be suppressed.

--
Best regards,
Konstantin Gribov.
28.03.2014 13:32 пользователь "Stefano Fornari" <[hidden email]>
написал:

> Hi Jukka,
> thanks a lot for your reply.
>
> On #1 I am still wondering why for indexing we need structure information.
> is there any particular reason? wouldn't make more sense to get just the
> text by default and only optionally getting the structure?
>
> On #2, I expected the code you presented would not work. And in fact the
> pattern is quite odd, isn't it? What is the reason of throwing the
> exception if limiting the text read is a legal use case? (I am asking just
> to understand the background).
>
> Ste
>
> Ste
>
>
> On Thu, Mar 27, 2014 at 11:55 PM, Jukka Zitting <[hidden email]
> >wrote:
>
> > Hi,
> >
> > On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari
> > <[hidden email]> wrote:
> > > 1. is the use of PDF2XHTML necessary? why is the pdf turned into an
> > XHTML?
> > > for the purpose of indexing, wouldn't just the text be enough?
> >
> > The XHTML output allows us to annotate the extracted text with
> > structural information (like "this is a heading", "here's a
> > hyperlink", etc.) that would be difficult to express with text-only
> > output. A client that needs just the text content can get it easily
> > with the BodyContentHandler class.
> >
> > > 2. I need to limit the index of the content to files whose size is
> below
> > to
> > > a certain threshold; I was wondering if this could be a parser
> > > configuration option and thus if you would accept this change.
> >
> > Do you want to entirely exclude too large files, or just index the
> > first few pages of such files (which is more common in many indexing
> > use cases)?
> >
> > The latter use case be implemented with the writeLimit parameter of
> > the WriteOutContentHandler class, like this:
> >
> >     // Extract up to 100k characters from a given document
> >     WriteOutContentHandler out = new WriteOutContentHandler(100_000);
> >     try {
> >         parser.parse(..., new BodyContentHandler(out), ...);
> >     } catch (SAXException e) {
> >         if (!out.isWriteLimitReached(e)) {
> >             throw e;
> >         }
> >     }
> >     String content = out.toString();
> >
> > BR,
> >
> > Jukka Zitting
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: PDF parser (two more questions)

Stefano Fornari
Yes, got it. Which is a strange use case: if I set the limit, first I would
not expect an exception (which represents an unexpected error condition);
secondly, I would not expect to rethrow it only under certain conditions. I
understood the trick, but I am trying to understand this is done in this
way (that at a first glance does not seem clean).
Reply | Threaded
Open this post in threaded view
|

Re: PDF parser (two more questions)

Stefano Fornari
On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari <[hidden email]
> wrote:

> I understood the trick, but I am trying to understand this is done in this
> way (that at a first glance does not seem clean).
>
> ... trying to understand why this is done in this way...
Reply | Threaded
Open this post in threaded view
|

Re: PDF parser (two more questions)

Konstantin Gribov
SAXException is checked, so you have to catch it or add to method throws
list (or javac wouldn't compile it). Tika usually rethrows exceptions
enveloping them into TikaException. In case of code above method throws
SAXException.

Suppressing the exception is done to avoid parser fail after parsing
valuable amount of data.

--
Best regards,
Konstantin Gribov.
28.03.2014 14:27 пользователь "Stefano Fornari" <[hidden email]>
написал:

> On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari <
> [hidden email]
> > wrote:
>
> > I understood the trick, but I am trying to understand this is done in
> this
> > way (that at a first glance does not seem clean).
> >
> > ... trying to understand why this is done in this way...
>
Reply | Threaded
Open this post in threaded view
|

Re: PDF parser (two more questions)

Stefano Fornari
well, I should look at the code, I can't do it now, but I guess my point is
that BodyContentHandler should not throw the exception (and most probably
not a SAXException in any case) in the case the limit is reached. This
means that the limit should not put on the WriteOutContentHandler, but on
BodyContentHandler.

Ste


On Fri, Mar 28, 2014 at 11:52 AM, Konstantin Gribov <[hidden email]>wrote:

> SAXException is checked, so you have to catch it or add to method throws
> list (or javac wouldn't compile it). Tika usually rethrows exceptions
> enveloping them into TikaException. In case of code above method throws
> SAXException.
>
> Suppressing the exception is done to avoid parser fail after parsing
> valuable amount of data.
>
> --
> Best regards,
> Konstantin Gribov.
> 28.03.2014 14:27 пользователь "Stefano Fornari" <[hidden email]
> >
> написал:
>
> > On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari <
> > [hidden email]
> > > wrote:
> >
> > > I understood the trick, but I am trying to understand this is done in
> > this
> > > way (that at a first glance does not seem clean).
> > >
> > > ... trying to understand why this is done in this way...
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: PDF parser (two more questions)

Konstantin Gribov
All such handlers are implementation of org.xml.sax.ContentHandler
interface, so thier methods throws SAXException. But in code above none of
contentHandler methods are invoked (only in parser.parse where content
handler is passed).

You can take a look at org.apache.tika.Tika.parseToString(InputSteam,
Metadata, int) as a reference. It has code similar to Jukka's code above.


--
Best regards,
Konstantin Gribov.


2014-03-28 15:47 GMT+04:00 Stefano Fornari <[hidden email]>:

> well, I should look at the code, I can't do it now, but I guess my point is
> that BodyContentHandler should not throw the exception (and most probably
> not a SAXException in any case) in the case the limit is reached. This
> means that the limit should not put on the WriteOutContentHandler, but on
> BodyContentHandler.
>
> Ste
>
>
> On Fri, Mar 28, 2014 at 11:52 AM, Konstantin Gribov <[hidden email]
> >wrote:
>
> > SAXException is checked, so you have to catch it or add to method throws
> > list (or javac wouldn't compile it). Tika usually rethrows exceptions
> > enveloping them into TikaException. In case of code above method throws
> > SAXException.
> >
> > Suppressing the exception is done to avoid parser fail after parsing
> > valuable amount of data.
> >
> > --
> > Best regards,
> > Konstantin Gribov.
> > 28.03.2014 14:27 пользователь "Stefano Fornari" <
> [hidden email]
> > >
> > написал:
> >
> > > On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari <
> > > [hidden email]
> > > > wrote:
> > >
> > > > I understood the trick, but I am trying to understand this is done in
> > > this
> > > > way (that at a first glance does not seem clean).
> > > >
> > > > ... trying to understand why this is done in this way...
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: PDF parser (two more questions)

Jukka Zitting
In reply to this post by Stefano Fornari
Hi,

On Fri, Mar 28, 2014 at 5:32 AM, Stefano Fornari
<[hidden email]> wrote:
> On #1 I am still wondering why for indexing we need structure information.
> is there any particular reason? wouldn't make more sense to get just the
> text by default and only optionally getting the structure?

The trouble is that then each parser would need to have code for
producing both text and XHTML. Since the overhead of producing XHTML
instead of just text is pretty low, and since it's very easy for
clients that only care about the text output to just strip out the
markup, it made more sense to design the system to always produce
XHTML.

The same applies for document metadata. All parsers produce as much
metadata as they can, but must clients will just ignore most or all of
the returned metadata fields. However, since the overhead of producing
all the information is lower than that of adding explicit options to
control which metadata needs to be extracted and returned, it makes
sense to to just let clients filter out those bits that they don't
care about.

> On #2, I expected the code you presented would not work. And in fact the
> pattern is quite odd, isn't it? What is the reason of throwing the
> exception if limiting the text read is a legal use case? (I am asking just
> to understand the background).

Yes, the pattern is a bit awkward and generally shouldn't be
recommended as it uses an exception to control the flow of the
program. However, in this case we considered it worth doing as the
alternative would have been far more complicated.

Basically we wanted to avoid having to modify each parser
implementation (even those implemented outside Tika...) to keep track
of how much content has already been extracted and instead do that
just once in the WriteOutContentHandler class. However, the only way
for the WriteOutContentHandler to signal that parsing should be
stopped is by throwing a SAXException, which is what we're doing here.
By catching the exception and inspecting it with isWriteLimitReached()
the client can determine whether this is what happened.

BR,

Jukka Zitting