Tika Error ?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Tika Error ?

Emmanuel JOKE
Hi Guys,

I've updated my nutch version to use the latest trunk with the new TIKA jar.

I run a crawl and i've got a lot of error like that
2008-02-14 22:02:51,494 INFO  conf.Configuration - found resource
tika-mimetypes.xml at file:/data/sengine/search/conf/tika-mimetypes.xml
2008-02-14 22:02:51,499 WARN  mime.MimeTypesReader - Invalid media type
alias: text/xml
org.apache.tika.mime.MimeTypeException: Media type alias already exists:
text/xml
        at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
        at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
        at org.apache.tika.mime.MimeTypesReader.readMimeType(
MimeTypesReader.java:168)
        at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
:138)
        at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
:121)
        at org.apache.tika.mime.MimeTypesFactory.create(
MimeTypesFactory.java:56)
        at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:58)
        at org.apache.nutch.protocol.Content.<init>(Content.java:85)
        at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
HttpBase.java:226)
        at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java
:523)
2008-02-14 22:02:51,500 WARN  mime.MimeTypesReader - Invalid media type
alias: application/x-dosexec;exe
org.apache.tika.mime.MimeTypeException: Invalid media type alias:
application/x-dosexec;exe
        at org.apache.tika.mime.MimeType.addAlias(MimeType.java:242)
        at org.apache.tika.mime.MimeTypesReader.readMimeType(
MimeTypesReader.java:168)
        at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
:138)
        at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
:121)
        at org.apache.tika.mime.MimeTypesFactory.create(
MimeTypesFactory.java:56)
        at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:58)
        at org.apache.nutch.protocol.Content.<init>(Content.java:85)
        at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
HttpBase.java:226)
        at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java
:523)

Is that normal ?
Do i miss something ?
Reply | Threaded
Open this post in threaded view
|

Re: Tika Error ?

chrismattmann
Hi Emmanuel,

Could you please post your /data/sengine/search/conf/tika-mimetypes.xml
file?

Thanks,
 Chris



On 2/14/08 6:07 AM, "Emmanuel" <[hidden email]> wrote:

> Hi Guys,
>
> I've updated my nutch version to use the latest trunk with the new TIKA jar.
>
> I run a crawl and i've got a lot of error like that
> 2008-02-14 22:02:51,494 INFO  conf.Configuration - found resource
> tika-mimetypes.xml at file:/data/sengine/search/conf/tika-mimetypes.xml
> 2008-02-14 22:02:51,499 WARN  mime.MimeTypesReader - Invalid media type
> alias: text/xml
> org.apache.tika.mime.MimeTypeException: Media type alias already exists:
> text/xml
>         at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
>         at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
>         at org.apache.tika.mime.MimeTypesReader.readMimeType(
> MimeTypesReader.java:168)
>         at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
> :138)
>         at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
> :121)
>         at org.apache.tika.mime.MimeTypesFactory.create(
> MimeTypesFactory.java:56)
>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:58)
>         at org.apache.nutch.protocol.Content.<init>(Content.java:85)
>         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
> HttpBase.java:226)
>         at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java
> :523)
> 2008-02-14 22:02:51,500 WARN  mime.MimeTypesReader - Invalid media type
> alias: application/x-dosexec;exe
> org.apache.tika.mime.MimeTypeException: Invalid media type alias:
> application/x-dosexec;exe
>         at org.apache.tika.mime.MimeType.addAlias(MimeType.java:242)
>         at org.apache.tika.mime.MimeTypesReader.readMimeType(
> MimeTypesReader.java:168)
>         at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
> :138)
>         at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
> :121)
>         at org.apache.tika.mime.MimeTypesFactory.create(
> MimeTypesFactory.java:56)
>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:58)
>         at org.apache.nutch.protocol.Content.<init>(Content.java:85)
>         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
> HttpBase.java:226)
>         at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java
> :523)
>
> Is that normal ?
> Do i miss something ?

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Tika Error ?

Emmanuel JOKE
In reply to this post by Emmanuel JOKE
Hi Chris,

FYI, i used the version provided by nutch without changing it.

Anyway please find it attached.

Thanks,
E
> Hi Emmanuel,
>
> Could you please post your /data/sengine/search/conf/tika-mimetypes.xml
> file?
>
> Thanks,
>  Chris
>
>
>
> On 2/14/08 6:07 AM, "Emmanuel" <[hidden email]> wrote:
>
>> Hi Guys,
>>
>> I've updated my nutch version to use the latest trunk with the new TIKA
>> jar.
>>
>> I run a crawl and i've got a lot of error like that
>> 2008-02-14 22:02:51,494 INFO  conf.Configuration - found resource
>> tika-mimetypes.xml at file:/data/sengine/search/conf/tika-mimetypes.xml
>> 2008-02-14 22:02:51,499 WARN  mime.MimeTypesReader - Invalid media type
>> alias: text/xml
>> org.apache.tika.mime.MimeTypeException: Media type alias already exists:
>> text/xml
>>         at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
>>         at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
>>         at org.apache.tika.mime.MimeTypesReader.readMimeType(
>> MimeTypesReader.java:168)
>>         at
>> org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
>> :138)
>>         at
>> org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
>> :121)
>>         at org.apache.tika.mime.MimeTypesFactory.create(
>> MimeTypesFactory.java:56)
>>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:58)
>>         at org.apache.nutch.protocol.Content.<init>(Content.java:85)
>>         at
>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
>> HttpBase.java:226)
>>         at
>> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java
>> :523)
>> 2008-02-14 22:02:51,500 WARN  mime.MimeTypesReader - Invalid media type
>> alias: application/x-dosexec;exe
>> org.apache.tika.mime.MimeTypeException: Invalid media type alias:
>> application/x-dosexec;exe
>>         at org.apache.tika.mime.MimeType.addAlias(MimeType.java:242)
>>         at org.apache.tika.mime.MimeTypesReader.readMimeType(
>> MimeTypesReader.java:168)
>>         at
>> org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
>> :138)
>>         at
>> org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
>> :121)
>>         at org.apache.tika.mime.MimeTypesFactory.create(
>> MimeTypesFactory.java:56)
>>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:58)
>>         at org.apache.nutch.protocol.Content.<init>(Content.java:85)
>>         at
>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
>> HttpBase.java:226)
>>         at
>> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java
>> :523)
>>
>> Is that normal ?
>> Do i miss something ?
>
> ______________________________________________
> Chris Mattmann, Ph.D.
> [hidden email]
> Cognizant Development Engineer
> Early Detection Research Network Project
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>


tika-mimetypes.xml (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Tika Error ?

Tkach
I get the same error (same setup as you-no changes to default with
Nutch).  I doubt it's the way to do it, but I did find just now that if
I extract the tikal-mimetypes.xml from the jar file and copy it over the
one in nutch-trunk/conf at least I don't see those errors any more.

Emmanuel wrote:

> Hi Chris,
>
> FYI, i used the version provided by nutch without changing it.
>
> Anyway please find it attached.
>
> Thanks,
> E
>  > Hi Emmanuel,
>  >
>  > Could you please post your /data/sengine/search/conf/tika-mimetypes.xml
>  > file?
>  >
>  > Thanks,
>  >  Chris
>  >
>  >
>  >
>  > On 2/14/08 6:07 AM, "Emmanuel" <[hidden email]
> <mailto:[hidden email]>> wrote:
>  >
>  >> Hi Guys,
>  >>
>  >> I've updated my nutch version to use the latest trunk with the new TIKA
>  >> jar.
>  >>
>  >> I run a crawl and i've got a lot of error like that
>  >> 2008-02-14 22:02:51,494 INFO  conf.Configuration - found resource
>  >> tika-mimetypes.xml at file:/data/sengine/search/conf/tika-mimetypes.xml
>  >> 2008-02-14 22:02:51,499 WARN  mime.MimeTypesReader - Invalid media type
>  >> alias: text/xml
>  >> org.apache.tika.mime.MimeTypeException: Media type alias already exists:
>  >> text/xml
>  >>         at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
>  >>         at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
>  >>         at org.apache.tika.mime.MimeTypesReader.readMimeType(
>  >> MimeTypesReader.java:168)
>  >>         at
>  >> org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
>  >> :138)
>  >>         at
>  >> org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
>  >> :121)
>  >>         at org.apache.tika.mime.MimeTypesFactory.create(
>  >> MimeTypesFactory.java:56)
>  >>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:58)
>  >>         at org.apache.nutch.protocol.Content.<init>(Content.java:85)
>  >>         at
>  >> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
>  >> HttpBase.java:226)
>  >>         at
>  >> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java
>  >> :523)
>  >> 2008-02-14 22:02:51,500 WARN  mime.MimeTypesReader - Invalid media type
>  >> alias: application/x-dosexec;exe
>  >> org.apache.tika.mime.MimeTypeException: Invalid media type alias:
>  >> application/x-dosexec;exe
>  >>         at org.apache.tika.mime.MimeType.addAlias(MimeType.java:242)
>  >>         at org.apache.tika.mime.MimeTypesReader.readMimeType(
>  >> MimeTypesReader.java:168)
>  >>         at
>  >> org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
>  >> :138)
>  >>         at
>  >> org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
>  >> :121)
>  >>         at org.apache.tika.mime.MimeTypesFactory.create(
>  >> MimeTypesFactory.java:56)
>  >>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:58)
>  >>         at org.apache.nutch.protocol.Content.<init>(Content.java:85)
>  >>         at
>  >> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
>  >> HttpBase.java:226)
>  >>         at
>  >> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java
>  >> :523)
>  >>
>  >> Is that normal ?
>  >> Do i miss something ?
>  >
>  > ______________________________________________
>  > Chris Mattmann, Ph.D.
>  > [hidden email] <mailto:[hidden email]>
>  > Cognizant Development Engineer
>  > Early Detection Research Network Project
>  > _________________________________________________
>  > Jet Propulsion Laboratory            Pasadena, CA
>  > Office: 171-266B                     Mailstop:  171-246
>  > _______________________________________________________
>  >
>  > Disclaimer:  The opinions presented within are my own and do not reflect
>  > those of either NASA, JPL, or the California Institute of Technology.
>  >
>  >
>  >
>

--
This email message and any attachments are for the sole use of the intended
recipient(s) and may contain information that is proprietary to Ahold and/or
its subsidiaries ("Ahold") or otherwise confidential or legally privileged.
If you have received this message in error, please notify the sender by
reply, and delete all copies of this message and any attachments.  If you
are the intended recipient you may use the information contained in this
message and any files attached to this message only as authorized by Ahold.
Files attached to this message may only be transmitted using secure systems
and appropriate means of encryption, and must be secured using the same
level of password and security protection with which the file was provided
to you.  Any unauthorized use, dissemination or disclosure of this message
or its attachments is strictly prohibited.
Reply | Threaded
Open this post in threaded view
|

Re: Tika Error ?

Emmanuel JOKE
In reply to this post by Emmanuel JOKE
Thanks it helps to solve my problem too.

Does it means we need to update the config file in the trunk ?

> I get the same error (same setup as you-no changes to default with
> Nutch).  I doubt it's the way to do it, but I did find just now that if
> I extract the tikal-mimetypes.xml from the jar file and copy it over the
> one in nutch-trunk/conf at least I don't see those errors any more.
>
> Emmanuel wrote:
>> Hi Chris,
>>
>> FYI, i used the version provided by nutch without changing it.
>>
>> Anyway please find it attached.
>>
>> Thanks,
>> E
>>  > Hi Emmanuel,
>>  >
>>  > Could you please post your
>> /data/sengine/search/conf/tika-mimetypes.xml
>>  > file?
>>  >
>>  > Thanks,
>>  >  Chris
>>  >
>>  >
>>  >
>>  > On 2/14/08 6:07 AM, "Emmanuel" <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>  >
>>  >> Hi Guys,
>>  >>
>>  >> I've updated my nutch version to use the latest trunk with the new
>> TIKA
>>  >> jar.
>>  >>
>>  >> I run a crawl and i've got a lot of error like that
>>  >> 2008-02-14 22:02:51,494 INFO  conf.Configuration - found resource
>>  >> tika-mimetypes.xml at
>> file:/data/sengine/search/conf/tika-mimetypes.xml
>>  >> 2008-02-14 22:02:51,499 WARN  mime.MimeTypesReader - Invalid media
>> type
>>  >> alias: text/xml
>>  >> org.apache.tika.mime.MimeTypeException: Media type alias already
>> exists:
>>  >> text/xml
>>  >>         at
>> org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
>>  >>         at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
>>  >>         at org.apache.tika.mime.MimeTypesReader.readMimeType(
>>  >> MimeTypesReader.java:168)
>>  >>         at
>>  >> org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
>>  >> :138)
>>  >>         at
>>  >> org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
>>  >> :121)
>>  >>         at org.apache.tika.mime.MimeTypesFactory.create(
>>  >> MimeTypesFactory.java:56)
>>  >>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:58)
>>  >>         at org.apache.nutch.protocol.Content.<init>(Content.java:85)
>>  >>         at
>>  >> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
>>  >> HttpBase.java:226)
>>  >>         at
>>  >> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java
>>  >> :523)
>>  >> 2008-02-14 22:02:51,500 WARN  mime.MimeTypesReader - Invalid media
>> type
>>  >> alias: application/x-dosexec;exe
>>  >> org.apache.tika.mime.MimeTypeException: Invalid media type alias:
>>  >> application/x-dosexec;exe
>>  >>         at org.apache.tika.mime.MimeType.addAlias(MimeType.java:242)
>>  >>         at org.apache.tika.mime.MimeTypesReader.readMimeType(
>>  >> MimeTypesReader.java:168)
>>  >>         at
>>  >> org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
>>  >> :138)
>>  >>         at
>>  >> org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
>>  >> :121)
>>  >>         at org.apache.tika.mime.MimeTypesFactory.create(
>>  >> MimeTypesFactory.java:56)
>>  >>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:58)
>>  >>         at org.apache.nutch.protocol.Content.<init>(Content.java:85)
>>  >>         at
>>  >> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
>>  >> HttpBase.java:226)
>>  >>         at
>>  >> org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java
>>  >> :523)
>>  >>
>>  >> Is that normal ?
>>  >> Do i miss something ?
>>  >
>>  > ______________________________________________
>>  > Chris Mattmann, Ph.D.
>>  > [hidden email] <mailto:[hidden email]>
>>  > Cognizant Development Engineer
>>  > Early Detection Research Network Project
>>  > _________________________________________________
>>  > Jet Propulsion Laboratory            Pasadena, CA
>>  > Office: 171-266B                     Mailstop:  171-246
>>  > _______________________________________________________
>>  >
>>  > Disclaimer:  The opinions presented within are my own and do not
>> reflect
>>  > those of either NASA, JPL, or the California Institute of Technology.
>>  >
>>  >
>>  >
>>
>
> --
> This email message and any attachments are for the sole use of the
> intended
> recipient(s) and may contain information that is proprietary to Ahold
> and/or
> its subsidiaries ("Ahold") or otherwise confidential or legally
> privileged.
> If you have received this message in error, please notify the sender by
> reply, and delete all copies of this message and any attachments.  If you
> are the intended recipient you may use the information contained in this
> message and any files attached to this message only as authorized by
> Ahold.
> Files attached to this message may only be transmitted using secure
> systems
> and appropriate means of encryption, and must be secured using the same
> level of password and security protection with which the file was provided
> to you.  Any unauthorized use, dissemination or disclosure of this message
> or its attachments is strictly prohibited.
>