[jira] Created: (TIKA-193) PDFParser adds mime-type twice

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-193) PDFParser adds mime-type twice

Tim Allison (Jira)
PDFParser adds mime-type twice
------------------------------

                 Key: TIKA-193
                 URL: https://issues.apache.org/jira/browse/TIKA-193
             Project: Tika
          Issue Type: Bug
    Affects Versions: 0.3
            Reporter: Jonathan Koren


Using AutoDetectParser to call PDFParser causes the mime-type to be added twice.  It should be added exactly once.

Proposed Fix:
parser/pdf/PDFParser.java should be changed from:
metadata.add(Metadata.CONTENT_TYPE, "application/pdf");
to:
metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
as per other Tika bundled parsers.



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-193) PDFParser adds mime-type twice

Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Koren updated TIKA-193:
--------------------------------

    Priority: Minor  (was: Major)

> PDFParser adds mime-type twice
> ------------------------------
>
>                 Key: TIKA-193
>                 URL: https://issues.apache.org/jira/browse/TIKA-193
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Jonathan Koren
>            Priority: Minor
>
> Using AutoDetectParser to call PDFParser causes the mime-type to be added twice.  It should be added exactly once.
> Proposed Fix:
> parser/pdf/PDFParser.java should be changed from:
> metadata.add(Metadata.CONTENT_TYPE, "application/pdf");
> to:
> metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
> as per other Tika bundled parsers.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-193) PDFParser adds mime-type twice

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667612#action_12667612 ]

Sami Siren commented on TIKA-193:
---------------------------------

I see some of the parsers currently set (or add) content type and some do not. Should we perhaps remove that functionality from parsers instead and rely on AutoDetectParser for setting it.

> PDFParser adds mime-type twice
> ------------------------------
>
>                 Key: TIKA-193
>                 URL: https://issues.apache.org/jira/browse/TIKA-193
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Jonathan Koren
>            Priority: Minor
>
> Using AutoDetectParser to call PDFParser causes the mime-type to be added twice.  It should be added exactly once.
> Proposed Fix:
> parser/pdf/PDFParser.java should be changed from:
> metadata.add(Metadata.CONTENT_TYPE, "application/pdf");
> to:
> metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
> as per other Tika bundled parsers.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-193) PDFParser adds mime-type twice

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-193:
-----------------------------------

    Component/s: parser

- set component type to parser

> PDFParser adds mime-type twice
> ------------------------------
>
>                 Key: TIKA-193
>                 URL: https://issues.apache.org/jira/browse/TIKA-193
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Jonathan Koren
>            Priority: Minor
>
> Using AutoDetectParser to call PDFParser causes the mime-type to be added twice.  It should be added exactly once.
> Proposed Fix:
> parser/pdf/PDFParser.java should be changed from:
> metadata.add(Metadata.CONTENT_TYPE, "application/pdf");
> to:
> metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
> as per other Tika bundled parsers.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-193) PDFParser adds mime-type twice

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Koren updated TIKA-193:
--------------------------------

    Attachment: patch

Patch for PDFParser.java that converts metadata.add() to metadata.set()

> PDFParser adds mime-type twice
> ------------------------------
>
>                 Key: TIKA-193
>                 URL: https://issues.apache.org/jira/browse/TIKA-193
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Jonathan Koren
>            Priority: Minor
>         Attachments: patch
>
>
> Using AutoDetectParser to call PDFParser causes the mime-type to be added twice.  It should be added exactly once.
> Proposed Fix:
> parser/pdf/PDFParser.java should be changed from:
> metadata.add(Metadata.CONTENT_TYPE, "application/pdf");
> to:
> metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
> as per other Tika bundled parsers.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-193) PDFParser adds mime-type twice

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-193.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.4
         Assignee: Jukka Zitting

Patch committed in revision 779269, thanks! Resolving as Fixed.

Re: Setting the type only in AutoDetectParser
there are cases where the specific parser classes are used directly, and even in those cases it would be useful to have the content type metadata set. Also, in some cases the specific parser implementation may have more information than AutoDetectParser and can thus provide a more accurate content type.

> PDFParser adds mime-type twice
> ------------------------------
>
>                 Key: TIKA-193
>                 URL: https://issues.apache.org/jira/browse/TIKA-193
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Jonathan Koren
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: patch
>
>
> Using AutoDetectParser to call PDFParser causes the mime-type to be added twice.  It should be added exactly once.
> Proposed Fix:
> parser/pdf/PDFParser.java should be changed from:
> metadata.add(Metadata.CONTENT_TYPE, "application/pdf");
> to:
> metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
> as per other Tika bundled parsers.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-193) PDFParser adds mime-type twice

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753683#action_12753683 ]

Yonik Seeley commented on TIKA-193:
-----------------------------------

Hmmm, I'm testing Solr Cell from the current solr-trunk (which has Tika 0.4), and I'm seeing Content-Type added twice, for PDFs only.

<arr name="attr_Content-Type">
  <str>application/pdf</str>
  <str>application/pdf</str>
</arr>


> PDFParser adds mime-type twice
> ------------------------------
>
>                 Key: TIKA-193
>                 URL: https://issues.apache.org/jira/browse/TIKA-193
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Jonathan Koren
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: patch
>
>
> Using AutoDetectParser to call PDFParser causes the mime-type to be added twice.  It should be added exactly once.
> Proposed Fix:
> parser/pdf/PDFParser.java should be changed from:
> metadata.add(Metadata.CONTENT_TYPE, "application/pdf");
> to:
> metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
> as per other Tika bundled parsers.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (TIKA-193) PDFParser adds mime-type twice

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753683#action_12753683 ]

Yonik Seeley edited comment on TIKA-193 at 9/10/09 9:26 AM:
------------------------------------------------------------

Hmmm, I'm testing Solr Cell from the current solr-trunk (which has Tika 0.4), and I'm seeing Content-Type added twice, for PDFs only.

<arr name="attr_Content-Type">
  <str>application/pdf</str>
  <str>application/pdf</str>
</arr>

EDIT: false alarm - there was an old tika jar in the classpath.

      was (Author: [hidden email]):
    Hmmm, I'm testing Solr Cell from the current solr-trunk (which has Tika 0.4), and I'm seeing Content-Type added twice, for PDFs only.

<arr name="attr_Content-Type">
  <str>application/pdf</str>
  <str>application/pdf</str>
</arr>

 

> PDFParser adds mime-type twice
> ------------------------------
>
>                 Key: TIKA-193
>                 URL: https://issues.apache.org/jira/browse/TIKA-193
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Jonathan Koren
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: patch
>
>
> Using AutoDetectParser to call PDFParser causes the mime-type to be added twice.  It should be added exactly once.
> Proposed Fix:
> parser/pdf/PDFParser.java should be changed from:
> metadata.add(Metadata.CONTENT_TYPE, "application/pdf");
> to:
> metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
> as per other Tika bundled parsers.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.