[jira] Created: (TIKA-612) Specify PDFBox options via ParseContext

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
Specify PDFBox options via ParseContext
----------------------------------------

                 Key: TIKA-612
                 URL: https://issues.apache.org/jira/browse/TIKA-612
             Project: Tika
          Issue Type: New Feature
          Components: parser
    Affects Versions: 0.9
            Reporter: Julien Nioche
            Assignee: Julien Nioche
            Priority: Minor


See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047143#comment-13047143 ]

Lau Brino commented on TIKA-612:
--------------------------------

Hi. Due to this serious bug in PDFBox https://issues.apache.org/jira/browse/PDFBOX-956 I would appreciate if you can implement this.

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
Reply | Threaded
Open this post in threaded view
|

[jira] [Issue Comment Edited] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047143#comment-13047143 ]

Lau Brino edited comment on TIKA-612 at 6/10/11 11:33 AM:
----------------------------------------------------------

Hi. Due to this serious bug in PDFBox https://issues.apache.org/jira/browse/PDFBOX-956 I would appreciate if you can implement this. It would be then possible to turn the suppressDuplicateOverlappingText off.

      was (Author: laubrino):
    Hi. Due to this serious bug in PDFBox https://issues.apache.org/jira/browse/PDFBOX-956 I would appreciate if you can implement this.
 

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated TIKA-612:
-------------------------------

    Attachment: Tika-612.patch

Patch which allows to specify the options via the Context object. WDYT?

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>         Attachments: Tika-612.patch
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-612:
------------------------------------

    Attachment: testPDFTwoColumns.pdf
                TIKA-612-testcase.patch

I'm attaching a test case (it passes), showing a PDF w/ 2 columns and verifying the text within a single column is kept contiguous.

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>         Attachments: TIKA-612-testcase.patch, Tika-612.patch, testPDFTwoColumns.pdf
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096266#comment-13096266 ]

Jukka Zitting commented on TIKA-612:
------------------------------------

+1 looks good to me.

A possible design improvement could be to make PDFParseOptions an interface like the following:

{code}
public interface PDFParseOptions {
    void apply(PDFTextStripper stripper);
}
{code}

The proposed bean class would implement that interface like this:

{code}
    public void apply(PDFTextStripper stripper) {
        stripper.setForceParsing(getForceParsing());
        stripper.setSortByPosition(getSortByPosition());
    }
{code}

This would make it easy for client applications to apply also other PDF parsing settings not currently known by Tika.

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>         Attachments: TIKA-612-testcase.patch, Tika-612.patch, testPDFTwoColumns.pdf
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146397#comment-13146397 ]

Michael McCandless commented on TIKA-612:
-----------------------------------------

bq. This would make it easy for client applications to apply also other PDF parsing settings not currently known by Tika.

+1, this seems like it'd be more general.  EG, we could fold in get/setSuppressDuplicateOverlappingText (and move it off of PDFParser), and maybe also get/setEnableAutoSpace.

In general, since there are so many options on PDFTextStripper, and the "right" settings seems to vary PDF by PDF, it means it's important that we expose full control...
               

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>         Attachments: TIKA-612-testcase.patch, Tika-612.patch, testPDFTwoColumns.pdf
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146751#comment-13146751 ]

Gregory Kanevsky commented on TIKA-612:
---------------------------------------

Just a design comment: is it really appropriate to expose implementation class PDFTextStripper (from PDFBox) in Tika general-purpose interface like PDFParseOptions?

Alternatively, one could specify ParseContext key MULTI_COLUMN_PDF (true or false) that would be used to setSortByPosition in PDFParser.
               

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>         Attachments: TIKA-612-testcase.patch, Tika-612.patch, testPDFTwoColumns.pdf
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gregory Kanevsky updated TIKA-612:
----------------------------------

    Comment: was deleted

(was: Just a design comment: is it really appropriate to expose implementation class PDFTextStripper (from PDFBox) in Tika general-purpose interface like PDFParseOptions?

Alternatively, one could specify ParseContext key MULTI_COLUMN_PDF (true or false) that would be used to setSortByPosition in PDFParser.)
   

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>         Attachments: TIKA-612-testcase.patch, Tika-612.patch, testPDFTwoColumns.pdf
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148695#comment-13148695 ]

Michael McCandless commented on TIKA-612:
-----------------------------------------

I agree, we probably shouldn't just directly expose PDFTextStripper
directly; it'd be better (less API surface area) if we pick certain
options and expose them ourselves.  Then if PDFTextStripper changes
things, or if we somehow switch to a different PDF lib, we won't break
our users.

Alternatively, can just expose options on PDFParser directly?  This is
more intuitive and direct (you just use setters on the parser), and we
can name/genericize the options, and choose which to expose?  (This is
what I've been doing on the last few PDF issues....).

               

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>         Attachments: TIKA-612-testcase.patch, Tika-612.patch, testPDFTwoColumns.pdf
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-612:
------------------------------------

    Attachment: TIKA-612.patch

Patch, just adding setSortByPosition to PDFParser.  I think this is more straightforward and lets us control what/how we expose...
               

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>         Attachments: TIKA-612-testcase.patch, TIKA-612.patch, Tika-612.patch, testPDFTwoColumns.pdf
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-612.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1
         Assignee: Michael McCandless  (was: Julien Nioche)

I committed the last patch; let's open separate issues for other options that need exposing...
               

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: TIKA-612-testcase.patch, TIKA-612.patch, Tika-612.patch, testPDFTwoColumns.pdf
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205370#comment-13205370 ]

Jan Høydahl commented on TIKA-612:
----------------------------------

So how do we set a PDFBox option via ParseContext in practice? Say we want to {{setEnableAutoSpace(false)}}.
The test case attached to this issue calls {{parser.setEnableAutoSpace(false)}} directly on the parser, not via parseContext.
               

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: TIKA-612-testcase.patch, TIKA-612.patch, Tika-612.patch, testPDFTwoColumns.pdf
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205376#comment-13205376 ]

Nick Burch commented on TIKA-612:
---------------------------------

The conclusion was to expose the options on the PDFParser directly instead. setEnableAutoSpace is already supported by PDFParser

If you know you have a PDF, create a PDFParser, set the options, then parse

If you want to use something like AutoDetectParser but with special PDF options, you have two options. One is to fetch the parsers from the AutoDetectParser, possibly recursing, until you find the PDFParser, and set. The other is to create a new AutoDetectParser on an explicitly created PDFParser, with the DefaultParser as a fallback
               

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: TIKA-612-testcase.patch, TIKA-612.patch, Tika-612.patch, testPDFTwoColumns.pdf
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-612) Specify PDFBox options via ParseContext

Clark Perkins (Jira)
In reply to this post by Clark Perkins (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205394#comment-13205394 ]

Jan Høydahl commented on TIKA-612:
----------------------------------

Hmm, that's kind of awkward to use from e.g. SolrCell. Any chance of considering a PDFParseOptions on the Context as an alternative?
               

> Specify PDFBox options via ParseContext
> ----------------------------------------
>
>                 Key: TIKA-612
>                 URL: https://issues.apache.org/jira/browse/TIKA-612
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: TIKA-612-testcase.patch, TIKA-612.patch, Tika-612.patch, testPDFTwoColumns.pdf
>
>
> See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira