MagicDetector don't work for all RFC882 message Types.

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

MagicDetector don't work for all RFC882 message Types.

Kai-Uwe Schmidt
Hello folks,

I am trying to use Tika to extract metadata from eml's created via Novell Groupwise. By this I ran into  a problem with the dedection of "message/rfc822". The MagicDetector (working with the default tika-mimetypes.xml) compares the "match" values binary. RFC822 describes the header attributes are case independent (see http://www.ietf.org/rfc/rfc0822.txt 3.4.7). So MIME-Version is the same than Mime-Version.

Is there a different way to get those EML's detected correctly?

Regards
Kai-Uwe
Reply | Threaded
Open this post in threaded view
|

Re: MagicDetector don't work for all RFC882 message Types.

Nick Burch-2
On Thu, 11 Jul 2013, Kai-Uwe Schmidt wrote:
> I am trying to use Tika to extract metadata from eml's created via
> Novell Groupwise. By this I ran into a problem with the dedection of
> "message/rfc822". The MagicDetector (working with the default
> tika-mimetypes.xml) compares the "match" values binary. RFC822 describes
> the header attributes are case independent (see
> http://www.ietf.org/rfc/rfc0822.txt 3.4.7). So MIME-Version is the same
> than Mime-Version

Best bet is to open a bug in jira, and upload a (small!) sample file that
shows the problem. We'll need to tweak the mime rules to include that case
combination too. (IIRC, the mime magic rules don't support case
insensitive matching)

Nick
Reply | Threaded
Open this post in threaded view
|

AW: MagicDetector don't work for all RFC882 message Types.

Kai-Uwe Schmidt
Where can I read how to provide a path?

-----Ursprüngliche Nachricht-----
Von: Nick Burch [mailto:[hidden email]]
Gesendet: Donnerstag, 11. Juli 2013 12:48
An: [hidden email]
Betreff: Re: MagicDetector don't work for all RFC882 message Types.

On Thu, 11 Jul 2013, Kai-Uwe Schmidt wrote:
> I am trying to use Tika to extract metadata from eml's created via
> Novell Groupwise. By this I ran into a problem with the dedection of
> "message/rfc822". The MagicDetector (working with the default
> tika-mimetypes.xml) compares the "match" values binary. RFC822
> describes the header attributes are case independent (see
> http://www.ietf.org/rfc/rfc0822.txt 3.4.7). So MIME-Version is the
> same than Mime-Version

Best bet is to open a bug in jira, and upload a (small!) sample file that shows the problem. We'll need to tweak the mime rules to include that case combination too. (IIRC, the mime magic rules don't support case insensitive matching)

Nick
Reply | Threaded
Open this post in threaded view
|

AW: MagicDetector don't work for all RFC882 message Types.

Kai-Uwe Schmidt
Sorry patch was meant :-/

-----Ursprüngliche Nachricht-----
Von: Kai-Uwe Schmidt [mailto:[hidden email]]
Gesendet: Donnerstag, 11. Juli 2013 16:42
An: [hidden email]
Betreff: AW: MagicDetector don't work for all RFC882 message Types.

Where can I read how to provide a path?

-----Ursprüngliche Nachricht-----
Von: Nick Burch [mailto:[hidden email]]
Gesendet: Donnerstag, 11. Juli 2013 12:48
An: [hidden email]
Betreff: Re: MagicDetector don't work for all RFC882 message Types.

On Thu, 11 Jul 2013, Kai-Uwe Schmidt wrote:
> I am trying to use Tika to extract metadata from eml's created via
> Novell Groupwise. By this I ran into a problem with the dedection of
> "message/rfc822". The MagicDetector (working with the default
> tika-mimetypes.xml) compares the "match" values binary. RFC822
> describes the header attributes are case independent (see
> http://www.ietf.org/rfc/rfc0822.txt 3.4.7). So MIME-Version is the
> same than Mime-Version

Best bet is to open a bug in jira, and upload a (small!) sample file that shows the problem. We'll need to tweak the mime rules to include that case combination too. (IIRC, the mime magic rules don't support case insensitive matching)

Nick
Reply | Threaded
Open this post in threaded view
|

Re: AW: MagicDetector don't work for all RFC882 message Types.

Nick Burch-2
In reply to this post by Kai-Uwe Schmidt
On Thu, 11 Jul 2013, Kai-Uwe Schmidt wrote:
> Where can I read how to provide a patch?

Hmm. I was going to say:
* Go to the website at ???? and follow the link to download the source
* Look on the website at ??? and see the contribution instructions

However, unless I'm missing something, we don't have either of those
things on our website :(

Anyone fancy fixing that gap?


Kai-Uwe - for now, svn is http://svn.apache.org/repos/asf/tika/trunk .
Edit the tika-mimetypes file to list the additional string, then run "mvn
test" to check it all still works. Then, add in your test file, and put a
new method in the detector test to check your file too. mvn test again to
make sure the new test passes too. Finally, do "svn diff" to produce a
patch, and sling that on the issue you opened earlier.

Nick
Reply | Threaded
Open this post in threaded view
|

RE: MagicDetector don't work for all RFC882 message Types.

Allison, Timothy B.
In reply to this post by Kai-Uwe Schmidt
I think I may be uniquely qualified to answer this from an Idiot's guide/newish to Tika perspective. :)  Apologies if I'm missing out on more obvious answers!

SVN info:
http://tika.apache.org/source-repository.html 

Generally how to contribute (Lucene has a good description):
http://wiki.apache.org/lucene-java/HowToContribute 

POI does too:
http://poi.apache.org/guidelines.html 

If you're adding binary files, I found POI's patch task to be very useful.  Grab "patch.xml" from POI's svn and run:
ant -f patch.xml

-----Original Message-----
From: Kai-Uwe Schmidt [mailto:[hidden email]]
Sent: Thursday, July 11, 2013 10:45 AM
To: [hidden email]
Subject: AW: MagicDetector don't work for all RFC882 message Types.

Sorry patch was meant :-/

-----Ursprüngliche Nachricht-----
Von: Kai-Uwe Schmidt [mailto:[hidden email]]
Gesendet: Donnerstag, 11. Juli 2013 16:42
An: [hidden email]
Betreff: AW: MagicDetector don't work for all RFC882 message Types.

Where can I read how to provide a path?

-----Ursprüngliche Nachricht-----
Von: Nick Burch [mailto:[hidden email]]
Gesendet: Donnerstag, 11. Juli 2013 12:48
An: [hidden email]
Betreff: Re: MagicDetector don't work for all RFC882 message Types.

On Thu, 11 Jul 2013, Kai-Uwe Schmidt wrote:
> I am trying to use Tika to extract metadata from eml's created via
> Novell Groupwise. By this I ran into a problem with the dedection of
> "message/rfc822". The MagicDetector (working with the default
> tika-mimetypes.xml) compares the "match" values binary. RFC822
> describes the header attributes are case independent (see
> http://www.ietf.org/rfc/rfc0822.txt 3.4.7). So MIME-Version is the
> same than Mime-Version

Best bet is to open a bug in jira, and upload a (small!) sample file that shows the problem. We'll need to tweak the mime rules to include that case combination too. (IIRC, the mime magic rules don't support case insensitive matching)

Nick
Reply | Threaded
Open this post in threaded view
|

RE: MagicDetector don't work for all RFC882 message Types.

Allison, Timothy B.
Nick,
  I'm sorry that I missed your response (wound up in my spambox).  I'd be happy to draft a section on how to contribute for Tika's website.  How do I contribute that?  Open an issue and submit html?  Should I create a separate html or modify the http://tika.apache.org/source-repository.html site?
  Thank you.

         Best,

              Tim

-----Original Message-----
From: Allison, Timothy B. [mailto:[hidden email]]
Sent: Thursday, July 11, 2013 10:53 AM
To: [hidden email]
Subject: RE: MagicDetector don't work for all RFC882 message Types.

I think I may be uniquely qualified to answer this from an Idiot's guide/newish to Tika perspective. :)  Apologies if I'm missing out on more obvious answers!

SVN info:
http://tika.apache.org/source-repository.html 

Generally how to contribute (Lucene has a good description):
http://wiki.apache.org/lucene-java/HowToContribute 

POI does too:
http://poi.apache.org/guidelines.html 

If you're adding binary files, I found POI's patch task to be very useful.  Grab "patch.xml" from POI's svn and run:
ant -f patch.xml

-----Original Message-----
From: Kai-Uwe Schmidt [mailto:[hidden email]]
Sent: Thursday, July 11, 2013 10:45 AM
To: [hidden email]
Subject: AW: MagicDetector don't work for all RFC882 message Types.

Sorry patch was meant :-/

-----Ursprüngliche Nachricht-----
Von: Kai-Uwe Schmidt [mailto:[hidden email]]
Gesendet: Donnerstag, 11. Juli 2013 16:42
An: [hidden email]
Betreff: AW: MagicDetector don't work for all RFC882 message Types.

Where can I read how to provide a path?

-----Ursprüngliche Nachricht-----
Von: Nick Burch [mailto:[hidden email]]
Gesendet: Donnerstag, 11. Juli 2013 12:48
An: [hidden email]
Betreff: Re: MagicDetector don't work for all RFC882 message Types.

On Thu, 11 Jul 2013, Kai-Uwe Schmidt wrote:
> I am trying to use Tika to extract metadata from eml's created via
> Novell Groupwise. By this I ran into a problem with the dedection of
> "message/rfc822". The MagicDetector (working with the default
> tika-mimetypes.xml) compares the "match" values binary. RFC822
> describes the header attributes are case independent (see
> http://www.ietf.org/rfc/rfc0822.txt 3.4.7). So MIME-Version is the
> same than Mime-Version

Best bet is to open a bug in jira, and upload a (small!) sample file that shows the problem. We'll need to tweak the mime rules to include that case combination too. (IIRC, the mime magic rules don't support case insensitive matching)

Nick
Reply | Threaded
Open this post in threaded view
|

RE: MagicDetector don't work for all RFC882 message Types.

Nick Burch-2
In reply to this post by Allison, Timothy B.
On Thu, 11 Jul 2013, Allison, Timothy B. wrote:
> I think I may be uniquely qualified to answer this from an Idiot's
> guide/newish to Tika perspective. :)  Apologies if I'm missing out on
> more obvious answers!

Feedback from people like you is exactly what we need! I've been around
too long to be able to give it the fresh set of eyes someone like Kai-Uwe
will be hitting

> SVN info:
> http://tika.apache.org/source-repository.html

Ah, that's what I wanted! Where did you find that linked? We might need to
make it more obvious

> Generally how to contribute (Lucene has a good description):
> http://wiki.apache.org/lucene-java/HowToContribute
>
> POI does too:
> http://poi.apache.org/guidelines.html

We maybe just want to add a small page that lists the Tika rules, then
links to those other resources elsewhere in Apache for more help.

(For example, Tika always requires a jira issue for all changes, while POI
sometimes lets small changes sneak through without one if the developer
themselves spots it. POI is stricter on adding things to the changelog,
Tika tends to just be for changes a user will notice, with the JIRA log
available for details. There are others too, that's just what springs
straight to mind)

Nick
Reply | Threaded
Open this post in threaded view
|

RE: MagicDetector don't work for all RFC882 message Types.

Nick Burch-2
In reply to this post by Allison, Timothy B.
On Thu, 11 Jul 2013, Allison, Timothy B. wrote:
>  I'm sorry that I missed your response (wound up in my spambox).  I'd be
> happy to draft a section on how to contribute for Tika's website.  How
> do I contribute that?  Open an issue and submit html?  Should I create a
> separate html or modify the
> http://tika.apache.org/source-repository.html site?

The site lives in a different bit of svn, it's in
https://svn.apache.org/repos/asf/tika/site . Best bet is to open an issue,
take a best guess at where to put it, add a patch, and await feedback

Nick
Reply | Threaded
Open this post in threaded view
|

AW: MagicDetector don't work for all RFC882 message Types.

Kai-Uwe Schmidt
Hi,


Is there a chance to get feedback for the TIKA-1146 patch? Anything left I can do?


regards
Kai-Uwe

-----Ursprüngliche Nachricht-----
Von: Nick Burch [mailto:[hidden email]]
Gesendet: Donnerstag, 11. Juli 2013 17:32
An: [hidden email]
Betreff: RE: MagicDetector don't work for all RFC882 message Types.

On Thu, 11 Jul 2013, Allison, Timothy B. wrote:
>  I'm sorry that I missed your response (wound up in my spambox).  I'd
> be happy to draft a section on how to contribute for Tika's website.  
> How do I contribute that?  Open an issue and submit html?  Should I
> create a separate html or modify the
> http://tika.apache.org/source-repository.html site?

The site lives in a different bit of svn, it's in https://svn.apache.org/repos/asf/tika/site . Best bet is to open an issue, take a best guess at where to put it, add a patch, and await feedback

Nick