Indexing information on number of attachments and their names in EML file

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing information on number of attachments and their names in EML file

Zheng Lin Edwin Yeo
Hi,

Would like to check, Is there anyway which we can detect the number of
attachments and their names during indexing of EML files in Solr, and index
those information into Solr?

Currently, Solr is able to use Tika and Tesseract OCR to extract the
contents of the attachments. However, I could not find the information
about the number of attachments in the EML file and what are their filename.

I am using Solr 7.6.0 in production, and also trying out on the new Solr
8.2.0.

Regards,
Edwin
Reply | Threaded
Open this post in threaded view
|

Re: Indexing information on number of attachments and their names in EML file

Zheng Lin Edwin Yeo
Hi,

Does anyone knows if this can be done on the Solr side?
Or it has to be done on the Tika side?

Regards,
Edwin

On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo <[hidden email]>
wrote:

> Hi,
>
> Would like to check, Is there anyway which we can detect the number of
> attachments and their names during indexing of EML files in Solr, and index
> those information into Solr?
>
> Currently, Solr is able to use Tika and Tesseract OCR to extract the
> contents of the attachments. However, I could not find the information
> about the number of attachments in the EML file and what are their filename.
>
> I am using Solr 7.6.0 in production, and also trying out on the new Solr
> 8.2.0.
>
> Regards,
> Edwin
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing information on number of attachments and their names in EML file

Jan Høydahl / Cominvent
Try the Apache Tika mailing list.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo <[hidden email]>:
>
> Hi,
>
> Does anyone knows if this can be done on the Solr side?
> Or it has to be done on the Tika side?
>
> Regards,
> Edwin
>
> On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo <[hidden email]>
> wrote:
>
>> Hi,
>>
>> Would like to check, Is there anyway which we can detect the number of
>> attachments and their names during indexing of EML files in Solr, and index
>> those information into Solr?
>>
>> Currently, Solr is able to use Tika and Tesseract OCR to extract the
>> contents of the attachments. However, I could not find the information
>> about the number of attachments in the EML file and what are their filename.
>>
>> I am using Solr 7.6.0 in production, and also trying out on the new Solr
>> 8.2.0.
>>
>> Regards,
>> Edwin
>>

Reply | Threaded
Open this post in threaded view
|

Re: Indexing information on number of attachments and their names in EML file

Tim Allison
I'd strongly recommend rolling your own ingest code.  See Erick's
superb: https://lucidworks.com/post/indexing-with-solrj/

You can easily get attachments via the RecursiveParserWrapper, e.g.
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java#L351

This will return a list of Metadata objects; the first one will be the
main/container, each other entry will be an attachment.  Let us know
if you have any questions/surprises.  There are a couple of todos for
.eml...

On Fri, Aug 2, 2019 at 3:43 AM Jan Høydahl <[hidden email]> wrote:

>
> Try the Apache Tika mailing list.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo <[hidden email]>:
> >
> > Hi,
> >
> > Does anyone knows if this can be done on the Solr side?
> > Or it has to be done on the Tika side?
> >
> > Regards,
> > Edwin
> >
> > On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo <[hidden email]>
> > wrote:
> >
> >> Hi,
> >>
> >> Would like to check, Is there anyway which we can detect the number of
> >> attachments and their names during indexing of EML files in Solr, and index
> >> those information into Solr?
> >>
> >> Currently, Solr is able to use Tika and Tesseract OCR to extract the
> >> contents of the attachments. However, I could not find the information
> >> about the number of attachments in the EML file and what are their filename.
> >>
> >> I am using Solr 7.6.0 in production, and also trying out on the new Solr
> >> 8.2.0.
> >>
> >> Regards,
> >> Edwin
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing information on number of attachments and their names in EML file

Zheng Lin Edwin Yeo
Thanks for the reply, will find out more about it.

Currently I am able to retrieve the normal Metadata of the email, but not
the Metadata of the attachments which are part of the contents in the EML
file, which looks something like this.

--000000000000d8b77b057d59ca19--

--000000000000d8b77e057d59ca1b
Content-Type: application/pdf; name="file1.pdf"
Content-Disposition: attachment; filename="file1.pdf"
Content-Transfer-Encoding: base64
Content-ID: <f_jpurtpnk0>
X-Attachment-Id: f_jpurtpnk0

Regards,
Edwin

On Sat, 3 Aug 2019 at 05:38, Tim Allison <[hidden email]> wrote:

> I'd strongly recommend rolling your own ingest code.  See Erick's
> superb: https://lucidworks.com/post/indexing-with-solrj/
>
> You can easily get attachments via the RecursiveParserWrapper, e.g.
>
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java#L351
>
> This will return a list of Metadata objects; the first one will be the
> main/container, each other entry will be an attachment.  Let us know
> if you have any questions/surprises.  There are a couple of todos for
> .eml...
>
> On Fri, Aug 2, 2019 at 3:43 AM Jan Høydahl <[hidden email]> wrote:
> >
> > Try the Apache Tika mailing list.
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> >
> > > 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo <[hidden email]
> >:
> > >
> > > Hi,
> > >
> > > Does anyone knows if this can be done on the Solr side?
> > > Or it has to be done on the Tika side?
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo <[hidden email]
> >
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> Would like to check, Is there anyway which we can detect the number of
> > >> attachments and their names during indexing of EML files in Solr, and
> index
> > >> those information into Solr?
> > >>
> > >> Currently, Solr is able to use Tika and Tesseract OCR to extract the
> > >> contents of the attachments. However, I could not find the information
> > >> about the number of attachments in the EML file and what are their
> filename.
> > >>
> > >> I am using Solr 7.6.0 in production, and also trying out on the new
> Solr
> > >> 8.2.0.
> > >>
> > >> Regards,
> > >> Edwin
> > >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing information on number of attachments and their names in EML file

Zheng Lin Edwin Yeo
Hi Tim,

Regarding the returning of the list of Metadata objects, is the code
suppose to include the information on the number of attachments in the
particular email and/or the name of the attachment?
For example, if there are 3 attachments in the email, we should be able to
see immediately from the Metadata that there are attachments, and there are
3 of them.

Thank you.

Regards,
Edwin

On Sat, 3 Aug 2019 at 07:19, Zheng Lin Edwin Yeo <[hidden email]>
wrote:

> Thanks for the reply, will find out more about it.
>
> Currently I am able to retrieve the normal Metadata of the email, but not
> the Metadata of the attachments which are part of the contents in the EML
> file, which looks something like this.
>
> --000000000000d8b77b057d59ca19--
>
> --000000000000d8b77e057d59ca1b
> Content-Type: application/pdf; name="file1.pdf"
> Content-Disposition: attachment; filename="file1.pdf"
> Content-Transfer-Encoding: base64
> Content-ID: <f_jpurtpnk0>
> X-Attachment-Id: f_jpurtpnk0
>
> Regards,
> Edwin
>
> On Sat, 3 Aug 2019 at 05:38, Tim Allison <[hidden email]> wrote:
>
>> I'd strongly recommend rolling your own ingest code.  See Erick's
>> superb: https://lucidworks.com/post/indexing-with-solrj/
>>
>> You can easily get attachments via the RecursiveParserWrapper, e.g.
>>
>> https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java#L351
>>
>> This will return a list of Metadata objects; the first one will be the
>> main/container, each other entry will be an attachment.  Let us know
>> if you have any questions/surprises.  There are a couple of todos for
>> .eml...
>>
>> On Fri, Aug 2, 2019 at 3:43 AM Jan Høydahl <[hidden email]> wrote:
>> >
>> > Try the Apache Tika mailing list.
>> >
>> > --
>> > Jan Høydahl, search solution architect
>> > Cominvent AS - www.cominvent.com
>> >
>> > > 2. aug. 2019 kl. 05:01 skrev Zheng Lin Edwin Yeo <
>> [hidden email]>:
>> > >
>> > > Hi,
>> > >
>> > > Does anyone knows if this can be done on the Solr side?
>> > > Or it has to be done on the Tika side?
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>> > > On Thu, 1 Aug 2019 at 09:38, Zheng Lin Edwin Yeo <
>> [hidden email]>
>> > > wrote:
>> > >
>> > >> Hi,
>> > >>
>> > >> Would like to check, Is there anyway which we can detect the number
>> of
>> > >> attachments and their names during indexing of EML files in Solr,
>> and index
>> > >> those information into Solr?
>> > >>
>> > >> Currently, Solr is able to use Tika and Tesseract OCR to extract the
>> > >> contents of the attachments. However, I could not find the
>> information
>> > >> about the number of attachments in the EML file and what are their
>> filename.
>> > >>
>> > >> I am using Solr 7.6.0 in production, and also trying out on the new
>> Solr
>> > >> 8.2.0.
>> > >>
>> > >> Regards,
>> > >> Edwin
>> > >>
>> >
>>
>