Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

Bradford Stephens
Greetings,

I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected to a
multiple T-3 line. Although it works fine, the fetch portion of the
crawls seems to be awfully slow. The status message at one point is
"157 pages, 1 errors, 1.7 pages/s, 487 kb/s". Less than one page a
second seems to be awfully slow, given the environment I'm in. Is it a
configuration issue? I'm using 200 threads per fetcher. I've also
tried only 10 threads :)

It also only seems to have 2 fetch/map tasks running, even though I
have four slaves and one namenode.

I'm also seeing my hadoop.logs rapidly filled with the error message
mentioned in [NUTCH-618], which states:

2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader:
Invalid media type alias: text/xml
org.apache.tika.mime.MimeTypeException: Media type alias already
exists: text/xml

Is this impacting the performance? I've tried removing
conf/tika-mimetypes.xml on all my machines, but that doesn't seem to
resolve the error message.

Much thanks in advance :)

Cheers,
Bradford
Reply | Threaded
Open this post in threaded view
|

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

chrismattmann
Hi Bradford,

> I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected to a
> multiple T-3 line. Although it works fine, the fetch portion of the
> crawls seems to be awfully slow. The status message at one point is
> "157 pages, 1 errors, 1.7 pages/s, 487 kb/s". Less than one page a
> second seems to be awfully slow, given the environment I'm in. Is it a
> configuration issue? I'm using 200 threads per fetcher. I've also
> tried only 10 threads :)

There are other parameters that control the speed of the fetch. What is your
value for speculative execution? I remember seeing something on the list
that this should parameter should be turned off to optimize fetch speed.
Give that a try, and let me know how it works out.

> I'm also seeing my hadoop.logs rapidly filled with the error message
> mentioned in [NUTCH-618], which states:
>
> 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader:
> Invalid media type alias: text/xml
> org.apache.tika.mime.MimeTypeException: Media type alias already
> exists: text/xml
>
> Is this impacting the performance? I've tried removing
> conf/tika-mimetypes.xml on all my machines, but that doesn't seem to
> resolve the error message.

Though definitely annoying I am fairly sure it's not directly affecting your
performance since the message is a simple WARNING that a media type detected
has been added multiple times to the time mime types registry. I certainly
need to address this issue though, so thanks for giving me some motivation.

Let me know what the results of the speculative execution adjustment is.
Also, it may help to vocalize (here on the list) any other configuration
adjustments you have (or will have) made.

HTH,
 Chris

>
> Much thanks in advance :)
>
> Cheers,
> Bradford

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

Otis Gospodnetic-2
In reply to this post by Bradford Stephens
Regarding the Tika error message, I've seen that, too..... if you need motivation, Chris. :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Chris Mattmann <[hidden email]>
To: [hidden email]
Sent: Saturday, April 5, 2008 2:58:33 AM
Subject: Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

Hi Bradford,

> I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected to a
> multiple T-3 line. Although it works fine, the fetch portion of the
> crawls seems to be awfully slow. The status message at one point is
> "157 pages, 1 errors, 1.7 pages/s, 487 kb/s". Less than one page a
> second seems to be awfully slow, given the environment I'm in. Is it a
> configuration issue? I'm using 200 threads per fetcher. I've also
> tried only 10 threads :)

There are other parameters that control the speed of the fetch. What is your
value for speculative execution? I remember seeing something on the list
that this should parameter should be turned off to optimize fetch speed.
Give that a try, and let me know how it works out.

> I'm also seeing my hadoop.logs rapidly filled with the error message
> mentioned in [NUTCH-618], which states:
>
> 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader:
> Invalid media type alias: text/xml
> org.apache.tika.mime.MimeTypeException: Media type alias already
> exists: text/xml
>
> Is this impacting the performance? I've tried removing
> conf/tika-mimetypes.xml on all my machines, but that doesn't seem to
> resolve the error message.

Though definitely annoying I am fairly sure it's not directly affecting your
performance since the message is a simple WARNING that a media type detected
has been added multiple times to the time mime types registry. I certainly
need to address this issue though, so thanks for giving me some motivation.

Let me know what the results of the speculative execution adjustment is.
Also, it may help to vocalize (here on the list) any other configuration
adjustments you have (or will have) made.

HTH,
 Chris

>
> Much thanks in advance :)
>
> Cheers,
> Bradford

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.





Reply | Threaded
Open this post in threaded view
|

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

Bradford Stephens
Greetings again,

Just wanted to let you know that I did increase the threads to 400 per
server, and 3 per host. I was seeing about 15 pages/second. I didn't
get a chance to implement the other suggestions because I'll eat all
of the office's bandwidth and get yelled at :)

Maybe I'll make a "Nutch Speed Improvements" entry in the Wiki.

Cheers,
Bradford Stephens

On Sun, Apr 6, 2008 at 10:06 PM, Otis Gospodnetic
<[hidden email]> wrote:

> Regarding the Tika error message, I've seen that, too..... if you need motivation, Chris. :)
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>  ----- Original Message ----
>  From: Chris Mattmann <[hidden email]>
>  To: [hidden email]
>  Sent: Saturday, April 5, 2008 2:58:33 AM
>  Subject: Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml
>
>  Hi Bradford,
>
>  > I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected to a
>  > multiple T-3 line. Although it works fine, the fetch portion of the
>  > crawls seems to be awfully slow. The status message at one point is
>  > "157 pages, 1 errors, 1.7 pages/s, 487 kb/s". Less than one page a
>  > second seems to be awfully slow, given the environment I'm in. Is it a
>  > configuration issue? I'm using 200 threads per fetcher. I've also
>  > tried only 10 threads :)
>
>  There are other parameters that control the speed of the fetch. What is your
>  value for speculative execution? I remember seeing something on the list
>  that this should parameter should be turned off to optimize fetch speed.
>  Give that a try, and let me know how it works out.
>
>  > I'm also seeing my hadoop.logs rapidly filled with the error message
>  > mentioned in [NUTCH-618], which states:
>  >
>  > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader:
>  > Invalid media type alias: text/xml
>  > org.apache.tika.mime.MimeTypeException: Media type alias already
>  > exists: text/xml
>  >
>  > Is this impacting the performance? I've tried removing
>  > conf/tika-mimetypes.xml on all my machines, but that doesn't seem to
>  > resolve the error message.
>
>  Though definitely annoying I am fairly sure it's not directly affecting your
>  performance since the message is a simple WARNING that a media type detected
>  has been added multiple times to the time mime types registry. I certainly
>  need to address this issue though, so thanks for giving me some motivation.
>
>  Let me know what the results of the speculative execution adjustment is.
>  Also, it may help to vocalize (here on the list) any other configuration
>  adjustments you have (or will have) made.
>
>  HTH,
>   Chris
>
>  >
>  > Much thanks in advance :)
>  >
>  > Cheers,
>  > Bradford
>
>  ______________________________________________
>  Chris Mattmann, Ph.D.
>  [hidden email]
>  Cognizant Development Engineer
>  Early Detection Research Network Project
>  _________________________________________________
>  Jet Propulsion Laboratory            Pasadena, CA
>  Office: 171-266B                     Mailstop:  171-246
>  _______________________________________________________
>
>  Disclaimer:  The opinions presented within are my own and do not reflect
>  those of either NASA, JPL, or the California Institute of Technology.
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

Sebastian Steinmetz
Hi,

That's a great idea. As I would really like to know, which set-screws  
you did tinker with.

thanks,
Sebastian Steinmetz

Am 07.04.2008 um 18:52 schrieb Bradford Stephens:

> Greetings again,
>
> Just wanted to let you know that I did increase the threads to 400 per
> server, and 3 per host. I was seeing about 15 pages/second. I didn't
> get a chance to implement the other suggestions because I'll eat all
> of the office's bandwidth and get yelled at :)
>
> Maybe I'll make a "Nutch Speed Improvements" entry in the Wiki.
>
> Cheers,
> Bradford Stephens
>
> On Sun, Apr 6, 2008 at 10:06 PM, Otis Gospodnetic
> <[hidden email]> wrote:
>> Regarding the Tika error message, I've seen that, too..... if you  
>> need motivation, Chris. :)
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>> From: Chris Mattmann <[hidden email]>
>> To: [hidden email]
>> Sent: Saturday, April 5, 2008 2:58:33 AM
>> Subject: Re: Slow Crawl Speed and Tika Error Media type alias  
>> already exists: text/xml
>>
>> Hi Bradford,
>>
>>> I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected  
>>> to a
>>> multiple T-3 line. Although it works fine, the fetch portion of the
>>> crawls seems to be awfully slow. The status message at one point is
>>> "157 pages, 1 errors, 1.7 pages/s, 487 kb/s". Less than one page a
>>> second seems to be awfully slow, given the environment I'm in. Is  
>>> it a
>>> configuration issue? I'm using 200 threads per fetcher. I've also
>>> tried only 10 threads :)
>>
>> There are other parameters that control the speed of the fetch.  
>> What is your
>> value for speculative execution? I remember seeing something on the  
>> list
>> that this should parameter should be turned off to optimize fetch  
>> speed.
>> Give that a try, and let me know how it works out.
>>
>>> I'm also seeing my hadoop.logs rapidly filled with the error message
>>> mentioned in [NUTCH-618], which states:
>>>
>>> 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader:
>>> Invalid media type alias: text/xml
>>> org.apache.tika.mime.MimeTypeException: Media type alias already
>>> exists: text/xml
>>>
>>> Is this impacting the performance? I've tried removing
>>> conf/tika-mimetypes.xml on all my machines, but that doesn't seem to
>>> resolve the error message.
>>
>> Though definitely annoying I am fairly sure it's not directly  
>> affecting your
>> performance since the message is a simple WARNING that a media type  
>> detected
>> has been added multiple times to the time mime types registry. I  
>> certainly
>> need to address this issue though, so thanks for giving me some  
>> motivation.
>>
>> Let me know what the results of the speculative execution  
>> adjustment is.
>> Also, it may help to vocalize (here on the list) any other  
>> configuration
>> adjustments you have (or will have) made.
>>
>> HTH,
>>  Chris
>>
>>>
>>> Much thanks in advance :)
>>>
>>> Cheers,
>>> Bradford
>>
>> ______________________________________________
>> Chris Mattmann, Ph.D.
>> [hidden email]
>> Cognizant Development Engineer
>> Early Detection Research Network Project
>> _________________________________________________
>> Jet Propulsion Laboratory            Pasadena, CA
>> Office: 171-266B                     Mailstop:  171-246
>> _______________________________________________________
>>
>> Disclaimer:  The opinions presented within are my own and do not  
>> reflect
>> those of either NASA, JPL, or the California Institute of Technology.
>>
>>
>>
>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

Otis Gospodnetic-2
In reply to this post by Bradford Stephens
Brad, "Nutch Speed Improvements" would be great.

Regarding your changes - by setting "3 threads per host" things should go faster indeed, but aren't you being "inpolite"?

How many URLs and how many distinct hosts did you have in your fetchlist?
Did you use Fetcher or Fetcher2?
Did you turn off parsing during fetching?
What was the setting for the delay between subsequent requests to the same server? (ah, probably doesn't matter if ou let 3 threads hit the same server concurrently)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Bradford Stephens <[hidden email]>
To: [hidden email]
Sent: Monday, April 7, 2008 12:52:56 PM
Subject: Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

Greetings again,

Just wanted to let you know that I did increase the threads to 400 per
server, and 3 per host. I was seeing about 15 pages/second. I didn't
get a chance to implement the other suggestions because I'll eat all
of the office's bandwidth and get yelled at :)

Maybe I'll make a "Nutch Speed Improvements" entry in the Wiki.

Cheers,
Bradford Stephens

On Sun, Apr 6, 2008 at 10:06 PM, Otis Gospodnetic
<[hidden email]> wrote:

> Regarding the Tika error message, I've seen that, too..... if you need motivation, Chris. :)
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>  ----- Original Message ----
>  From: Chris Mattmann <[hidden email]>
>  To: [hidden email]
>  Sent: Saturday, April 5, 2008 2:58:33 AM
>  Subject: Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml
>
>  Hi Bradford,
>
>  > I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected to a
>  > multiple T-3 line. Although it works fine, the fetch portion of the
>  > crawls seems to be awfully slow. The status message at one point is
>  > "157 pages, 1 errors, 1.7 pages/s, 487 kb/s". Less than one page a
>  > second seems to be awfully slow, given the environment I'm in. Is it a
>  > configuration issue? I'm using 200 threads per fetcher. I've also
>  > tried only 10 threads :)
>
>  There are other parameters that control the speed of the fetch. What is your
>  value for speculative execution? I remember seeing something on the list
>  that this should parameter should be turned off to optimize fetch speed.
>  Give that a try, and let me know how it works out.
>
>  > I'm also seeing my hadoop.logs rapidly filled with the error message
>  > mentioned in [NUTCH-618], which states:
>  >
>  > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader:
>  > Invalid media type alias: text/xml
>  > org.apache.tika.mime.MimeTypeException: Media type alias already
>  > exists: text/xml
>  >
>  > Is this impacting the performance? I've tried removing
>  > conf/tika-mimetypes.xml on all my machines, but that doesn't seem to
>  > resolve the error message.
>
>  Though definitely annoying I am fairly sure it's not directly affecting your
>  performance since the message is a simple WARNING that a media type detected
>  has been added multiple times to the time mime types registry. I certainly
>  need to address this issue though, so thanks for giving me some motivation.
>
>  Let me know what the results of the speculative execution adjustment is.
>  Also, it may help to vocalize (here on the list) any other configuration
>  adjustments you have (or will have) made.
>
>  HTH,
>   Chris
>
>  >
>  > Much thanks in advance :)
>  >
>  > Cheers,
>  > Bradford



Reply | Threaded
Open this post in threaded view
|

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

Bradford Stephens
Thanks for keeping the help coming!

I had about 10 URLs/distinct hosts in my initial list. I *think* I'm
using Fetcher -- it's whatever comes with the 0.9 trunk that I checked
out.  Is Fetcher2 faster?

I did not turn off parsing during fetching explicitly, I used whatever
setting was the default.

I did set 3 threads per host, but we're only running through a few T3s
here, I didn't think I was overwhelming them. I'm not sure about the
delay, I think that was the default as well.


On Tue, Apr 8, 2008 at 11:11 PM, Otis Gospodnetic
<[hidden email]> wrote:

> Brad, "Nutch Speed Improvements" would be great.
>
>  Regarding your changes - by setting "3 threads per host" things should go faster indeed, but aren't you being "inpolite"?
>
>  How many URLs and how many distinct hosts did you have in your fetchlist?
>  Did you use Fetcher or Fetcher2?
>  Did you turn off parsing during fetching?
>  What was the setting for the delay between subsequent requests to the same server? (ah, probably doesn't matter if ou let 3 threads hit the same server concurrently)
>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  ----- Original Message ----
>
> From: Bradford Stephens <[hidden email]>
>  To: [hidden email]
>
>
> Sent: Monday, April 7, 2008 12:52:56 PM
>  Subject: Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml
>
>  Greetings again,
>
>  Just wanted to let you know that I did increase the threads to 400 per
>  server, and 3 per host. I was seeing about 15 pages/second. I didn't
>  get a chance to implement the other suggestions because I'll eat all
>  of the office's bandwidth and get yelled at :)
>
>  Maybe I'll make a "Nutch Speed Improvements" entry in the Wiki.
>
>  Cheers,
>  Bradford Stephens
>
>  On Sun, Apr 6, 2008 at 10:06 PM, Otis Gospodnetic
>  <[hidden email]> wrote:
>  > Regarding the Tika error message, I've seen that, too..... if you need motivation, Chris. :)
>  >
>  >  Otis
>  >  --
>  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >
>  >
>  >
>  >  ----- Original Message ----
>  >  From: Chris Mattmann <[hidden email]>
>  >  To: [hidden email]
>  >  Sent: Saturday, April 5, 2008 2:58:33 AM
>  >  Subject: Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml
>  >
>  >  Hi Bradford,
>  >
>  >  > I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected to a
>  >  > multiple T-3 line. Although it works fine, the fetch portion of the
>  >  > crawls seems to be awfully slow. The status message at one point is
>  >  > "157 pages, 1 errors, 1.7 pages/s, 487 kb/s". Less than one page a
>  >  > second seems to be awfully slow, given the environment I'm in. Is it a
>  >  > configuration issue? I'm using 200 threads per fetcher. I've also
>  >  > tried only 10 threads :)
>  >
>  >  There are other parameters that control the speed of the fetch. What is your
>  >  value for speculative execution? I remember seeing something on the list
>  >  that this should parameter should be turned off to optimize fetch speed.
>  >  Give that a try, and let me know how it works out.
>  >
>  >  > I'm also seeing my hadoop.logs rapidly filled with the error message
>  >  > mentioned in [NUTCH-618], which states:
>  >  >
>  >  > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader:
>  >  > Invalid media type alias: text/xml
>  >  > org.apache.tika.mime.MimeTypeException: Media type alias already
>  >  > exists: text/xml
>  >  >
>  >  > Is this impacting the performance? I've tried removing
>  >  > conf/tika-mimetypes.xml on all my machines, but that doesn't seem to
>  >  > resolve the error message.
>  >
>  >  Though definitely annoying I am fairly sure it's not directly affecting your
>  >  performance since the message is a simple WARNING that a media type detected
>  >  has been added multiple times to the time mime types registry. I certainly
>  >  need to address this issue though, so thanks for giving me some motivation.
>  >
>  >  Let me know what the results of the speculative execution adjustment is.
>  >  Also, it may help to vocalize (here on the list) any other configuration
>  >  adjustments you have (or will have) made.
>  >
>  >  HTH,
>  >   Chris
>  >
>  >  >
>  >  > Much thanks in advance :)
>  >  >
>  >  > Cheers,
>  >  > Bradford
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

Otis Gospodnetic-2-2
In reply to this post by Bradford Stephens
Brad,
Regarding 3 threads per host - it is not about your end, it's about their end.  With 3 threads per host you can be hitting some foobar.com server with 3 concurrent requests, and that may be considered inpolite (though Google, Yahoo, etc. are super inpolite all the time).  Try with 1 thread per host and I think you'll see your fetch rate go down a bit.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Bradford Stephens <[hidden email]>
To: [hidden email]
Sent: Wednesday, April 9, 2008 7:29:07 PM
Subject: Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

Thanks for keeping the help coming!

I had about 10 URLs/distinct hosts in my initial list. I *think* I'm
using Fetcher -- it's whatever comes with the 0.9 trunk that I checked
out.  Is Fetcher2 faster?

I did not turn off parsing during fetching explicitly, I used whatever
setting was the default.

I did set 3 threads per host, but we're only running through a few T3s
here, I didn't think I was overwhelming them. I'm not sure about the
delay, I think that was the default as well.


On Tue, Apr 8, 2008 at 11:11 PM, Otis Gospodnetic
<[hidden email]> wrote:

> Brad, "Nutch Speed Improvements" would be great.
>
>  Regarding your changes - by setting "3 threads per host" things should go faster indeed, but aren't you being "inpolite"?
>
>  How many URLs and how many distinct hosts did you have in your fetchlist?
>  Did you use Fetcher or Fetcher2?
>  Did you turn off parsing during fetching?
>  What was the setting for the delay between subsequent requests to the same server? (ah, probably doesn't matter if ou let 3 threads hit the same server concurrently)
>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  ----- Original Message ----
>
> From: Bradford Stephens <[hidden email]>
>  To: [hidden email]
>
>
> Sent: Monday, April 7, 2008 12:52:56 PM
>  Subject: Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml
>
>  Greetings again,
>
>  Just wanted to let you know that I did increase the threads to 400 per
>  server, and 3 per host. I was seeing about 15 pages/second. I didn't
>  get a chance to implement the other suggestions because I'll eat all
>  of the office's bandwidth and get yelled at :)
>
>  Maybe I'll make a "Nutch Speed Improvements" entry in the Wiki.
>
>  Cheers,
>  Bradford Stephens
>
>  On Sun, Apr 6, 2008 at 10:06 PM, Otis Gospodnetic
>  <[hidden email]> wrote:
>  > Regarding the Tika error message, I've seen that, too..... if you need motivation, Chris. :)
>  >
>  >  Otis
>  >  --
>  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >
>  >
>  >
>  >  ----- Original Message ----
>  >  From: Chris Mattmann <[hidden email]>
>  >  To: [hidden email]
>  >  Sent: Saturday, April 5, 2008 2:58:33 AM
>  >  Subject: Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml
>  >
>  >  Hi Bradford,
>  >
>  >  > I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected to a
>  >  > multiple T-3 line. Although it works fine, the fetch portion of the
>  >  > crawls seems to be awfully slow. The status message at one point is
>  >  > "157 pages, 1 errors, 1.7 pages/s, 487 kb/s". Less than one page a
>  >  > second seems to be awfully slow, given the environment I'm in. Is it a
>  >  > configuration issue? I'm using 200 threads per fetcher. I've also
>  >  > tried only 10 threads :)
>  >
>  >  There are other parameters that control the speed of the fetch. What is your
>  >  value for speculative execution? I remember seeing something on the list
>  >  that this should parameter should be turned off to optimize fetch speed.
>  >  Give that a try, and let me know how it works out.
>  >
>  >  > I'm also seeing my hadoop.logs rapidly filled with the error message
>  >  > mentioned in [NUTCH-618], which states:
>  >  >
>  >  > 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader:
>  >  > Invalid media type alias: text/xml
>  >  > org.apache.tika.mime.MimeTypeException: Media type alias already
>  >  > exists: text/xml
>  >  >
>  >  > Is this impacting the performance? I've tried removing
>  >  > conf/tika-mimetypes.xml on all my machines, but that doesn't seem to
>  >  > resolve the error message.
>  >
>  >  Though definitely annoying I am fairly sure it's not directly affecting your
>  >  performance since the message is a simple WARNING that a media type detected
>  >  has been added multiple times to the time mime types registry. I certainly
>  >  need to address this issue though, so thanks for giving me some motivation.
>  >
>  >  Let me know what the results of the speculative execution adjustment is.
>  >  Also, it may help to vocalize (here on the list) any other configuration
>  >  adjustments you have (or will have) made.
>  >
>  >  HTH,
>  >   Chris
>  >
>  >  >
>  >  > Much thanks in advance :)
>  >  >
>  >  > Cheers,
>  >  > Bradford
>
>
>
>