[jira] [Created] (TIKA-861) Parse links in PDF

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (TIKA-861) Parse links in PDF

Radim Rehurek (Jira)
Parse links in PDF
------------------

                 Key: TIKA-861
                 URL: https://issues.apache.org/jira/browse/TIKA-861
             Project: Tika
          Issue Type: New Feature
          Components: parser
    Affects Versions: 1.0
            Reporter: Sasha Goodman
            Priority: Minor
             Fix For: 1.1


Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.

The PDF2XHTML method loops through the annotations.

See:
{code:java}
136: for(Object o : page.getAnnotations()) {
{code}

 I found some code for dealing with links in annotations:
http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link

It involves checking the class.
{code:java}
if( annotation instanceof PDAnnotationLink ) {
                PDAnnotationLink link = (PDAnnotationLink)annotation;
{code}

I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (TIKA-861) Parse links in PDF

Radim Rehurek (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-861:
-----------------------------------

    Fix Version/s:     (was: 1.1)
                   1.2

- push out to 1.2
               

> Parse links in PDF
> ------------------
>
>                 Key: TIKA-861
>                 URL: https://issues.apache.org/jira/browse/TIKA-861
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Sasha Goodman
>            Priority: Minor
>              Labels: links, pdfbox
>             Fix For: 1.2
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.
> The PDF2XHTML method loops through the annotations.
> See:
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
>  I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class.
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
>                 PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (TIKA-861) Parse links in PDF

Radim Rehurek (Jira)
In reply to this post by Radim Rehurek (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan Quam updated TIKA-861:
---------------------------

    Attachment: TIKA-861.patch

Patch that adds PDF links to the DOM.
               

> Parse links in PDF
> ------------------
>
>                 Key: TIKA-861
>                 URL: https://issues.apache.org/jira/browse/TIKA-861
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Sasha Goodman
>            Priority: Minor
>              Labels: links, pdfbox
>             Fix For: 1.2
>
>         Attachments: TIKA-861.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.
> The PDF2XHTML method loops through the annotations.
> See:
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
>  I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class.
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
>                 PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-861) Parse links in PDF

Radim Rehurek (Jira)
In reply to this post by Radim Rehurek (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13260429#comment-13260429 ]

Nick Burch commented on TIKA-861:
---------------------------------

testPDFVarious.pdf in /tika-parsers/src/test/resources/test-documents/ contains a hyperlink on page one, so would be a good file to use for a unit test

Is anyone able to work up a unit test for link parsing to go with this patch? (PDFParserTest already has some xhtml based tests, which could be used as a pattern.)
               

> Parse links in PDF
> ------------------
>
>                 Key: TIKA-861
>                 URL: https://issues.apache.org/jira/browse/TIKA-861
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Sasha Goodman
>            Priority: Minor
>              Labels: links, pdfbox
>             Fix For: 1.2
>
>         Attachments: TIKA-861.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.
> The PDF2XHTML method loops through the annotations.
> See:
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
>  I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class.
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
>                 PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (TIKA-861) Parse links in PDF

Radim Rehurek (Jira)
In reply to this post by Radim Rehurek (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan Quam updated TIKA-861:
---------------------------

    Attachment: TIKA-861-test.patch

Here is a simple unit test for the PDF link parsing.
               

> Parse links in PDF
> ------------------
>
>                 Key: TIKA-861
>                 URL: https://issues.apache.org/jira/browse/TIKA-861
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Sasha Goodman
>            Priority: Minor
>              Labels: links, pdfbox
>             Fix For: 1.2
>
>         Attachments: TIKA-861-test.patch, TIKA-861.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.
> The PDF2XHTML method loops through the annotations.
> See:
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
>  I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class.
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
>                 PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (TIKA-861) Parse links in PDF

Radim Rehurek (Jira)
In reply to this post by Radim Rehurek (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-861.
-----------------------------

    Resolution: Fixed

Thanks, patches committed in r1331434.

One thing to note is that links are extracted for now at the end of the page. Further work may be wanted in future, in order to match them to the text they apply to
               

> Parse links in PDF
> ------------------
>
>                 Key: TIKA-861
>                 URL: https://issues.apache.org/jira/browse/TIKA-861
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Sasha Goodman
>            Priority: Minor
>              Labels: links, pdfbox
>             Fix For: 1.2
>
>         Attachments: TIKA-861-test.patch, TIKA-861.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours.
> The PDF2XHTML method loops through the annotations.
> See:
> {code:java}
> 136: for(Object o : page.getAnnotations()) {
> {code}
>  I found some code for dealing with links in annotations:
> http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link
> It involves checking the class.
> {code:java}
> if( annotation instanceof PDAnnotationLink ) {
>                 PDAnnotationLink link = (PDAnnotationLink)annotation;
> {code}
> I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira