[jira] [Comment Edited] (TIKA-2680) Email attachments to an email are not extracted

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (TIKA-2680) Email attachments to an email are not extracted

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535351#comment-16535351 ]

Yury Kats edited comment on TIKA-2680 at 7/6/18 9:07 PM:
---------------------------------------------------------

Indeed, the first embedded rfc822 is not an attachment. I believe this is because it's an Exchange journaled email, see the presence of X-MS-Journal-Report header at the very top.
In this case, the original message is wrapped in another message that can provide additional headers, such as Bcc and expanded distribution lists.


was (Author: yurykats):
Indeed, the first embedded rfc822 is not an attachment. I believe this is because it's an Exchange journaled email, see the presence of X-MS-Journal-Report header at the very top.

> Email attachments to an email are not extracted
> -----------------------------------------------
>
>                 Key: TIKA-2680
>                 URL: https://issues.apache.org/jira/browse/TIKA-2680
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.18
>            Reporter: Yury Kats
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: nested.eml
>
>
> I have a number of email messages that contain other email messages as attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level attached email, 2nd level attached email), but I only get 1 email and 1 unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) <[hidden email]>",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) <[hidden email]>",
> "Message-To": [
> "fm.SAN Management Team <[hidden email]>",
> "Smith Van der, H (Henry) <[hidden email]>"
> ],
> "Message:From-Email": "[hidden email]",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": "<[hidden email]>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": "<[hidden email]>",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": "<[hidden email]>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) <[hidden email]>",
> "dc:creator": "Smith Van der, H (Henry) <[hidden email]>",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) <[hidden email]>",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)