merge mapred to trunk

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

merge mapred to trunk

Doug Cutting-2
Currently we have three versions of nutch: trunk, 0.7 and mapred.  This
increases the chances for conflicts.  I would thus like to merge the
mapred branch into trunk soon.  The soonest I could actually start this
is next week.  Are there any objections?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: merge mapred to trunk

Piotr Kosiorowski
Doug Cutting wrote:
> Currently we have three versions of nutch: trunk, 0.7 and mapred.  This
> increases the chances for conflicts.  I would thus like to merge the
> mapred branch into trunk soon.  The soonest I could actually start this
> is next week.  Are there any objections?
>
> Doug
>
+1
P.

Reply | Threaded
Open this post in threaded view
|

Re: merge mapred to trunk

Jérôme Charron
On 8/31/05, Piotr Kosiorowski <[hidden email]> wrote:
>
> Doug Cutting wrote:
> > Currently we have three versions of nutch: trunk, 0.7 and mapred. This
> > increases the chances for conflicts. I would thus like to merge the
> > mapred branch into trunk soon. The soonest I could actually start this
> > is next week. Are there any objections?

+1
I don't take a look yet at mapred branch.
It will going to be a good surprise to discover it in the trunk... ;-)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: merge mapred to trunk

Doug Cutting-2
J?r?me Charron wrote:
> I don't take a look yet at mapred branch.
> It will going to be a good surprise to discover it in the trunk... ;-)

I will make some effort to document things more before I merge to trunk,
so that folks know what they're getting.  Many things have changed
(e.g., segment format).  Several things have not yet been fully worked
out and/or implemented (e.g., segment merging).  But the basics are all
working (intranet and & whole-web crawling, indexing & search), both in
standalone and distributed configurations.  My focus has been stress
testing the distributed infrastructure (NDFS & MapReduce).  We've
discovered and fixed a number of bugs in this over recent weeks, so it
is getting ever more stable.  I'm hoping that others can help fill in
the gaps in tools.

Once the merge is done I'd like to make a few other changes.

These are:

   1. Remove most static references to NutchConf outside of main()
routines.  The MapReduce-based versions of the command line tools have
no such references.  The biggest change here will be to plugins.
Plugins APIs should probably all be modified to use a factory, and the
factory should be constructed from a NutchConf, e.g., something like:
   public static PluginXFactory PluginXFactory.getFactory(NutchConf);
   public PluginX PluginXFactory.getPlugin(...);
This should permit folks to more easily configure things programatically
(think JMX) and to run multiple configurations in a single JVM.

   2. FetchListEntry has been mostly replaced with a new, simpler
datastructure called a CrawlDatum.  FetchListEntry is used in the
IndexingFilter API to pass the url, fetch date and incoming anchors.
Currently, in the mapred branch, the indexer creates a dummy
FetchListEntry to pass to plugins.  But instead the IndexingFilter API
should probably be altered to pass the CrawlDatum, anchors and url.

I have avoided making these changes since they would make it difficult
to merge improvements to plugins into the mapred branch.  But, once we
have moved mapred to trunk, we should make them soon.  Incompatible API
changes are best made early, so that folks have more time to work with them.

Does this all sound reasonable?

Doug

Reply | Threaded
Open this post in threaded view
|

Re: merge mapred to trunk

Andrzej Białecki-2
In reply to this post by Doug Cutting-2
Doug Cutting wrote:
> Currently we have three versions of nutch: trunk, 0.7 and mapred.  This
> increases the chances for conflicts.  I would thus like to merge the
> mapred branch into trunk soon.  The soonest I could actually start this
> is next week.  Are there any objections?

++1 :-)


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: merge mapred to trunk

Otis Gospodnetic-2-2
In reply to this post by Doug Cutting-2
> Currently we have three versions of nutch: trunk, 0.7 and mapred.
> This
> increases the chances for conflicts.  I would thus like to merge the
> mapred branch into trunk soon.  The soonest I could actually start
> this is next week.  Are there any objections?

I, too, am looking forward to this, but I am wondering what that will
do to Kelvin Tan's recent contribution, especially since I saw that
both MapReduce and Kelvin's code change how FetchListEntry works.  If
merging mapred to trunk means losing Kelvin's changes, then I suggest
one of Nutch developers evaluates Kelvin's modifications and, if they
are good, commits them to trunk, and then makes the final pre-mapred
release (e.g. release-0.8).

Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it.
Reply | Threaded
Open this post in threaded view
|

Re: merge mapred to trunk

Doug Cutting-2
[hidden email] wrote:
> I, too, am looking forward to this, but I am wondering what that will
> do to Kelvin Tan's recent contribution, especially since I saw that
> both MapReduce and Kelvin's code change how FetchListEntry works.  If
> merging mapred to trunk means losing Kelvin's changes, then I suggest
> one of Nutch developers evaluates Kelvin's modifications and, if they
> are good, commits them to trunk, and then makes the final pre-mapred
> release (e.g. release-0.8).

It won't lose Kelvin's patch: it will still be a patch to 0.7.

What I worry about is the alternate scenario: that Kelvin & others
invest a lot of effort making this work with 0.7, while the mapred-based
code diverges even further.  It would be best if Kelvin's patch is
ported to the mapred branch sooner rather than later, then maintained there.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: merge mapred to trunk

Otis Gospodnetic-2-2
--- Doug Cutting <[hidden email]> wrote:

> [hidden email] wrote:
> > I, too, am looking forward to this, but I am wondering what that
> will
> > do to Kelvin Tan's recent contribution, especially since I saw that
> > both MapReduce and Kelvin's code change how FetchListEntry works.
> If
> > merging mapred to trunk means losing Kelvin's changes, then I
> suggest
> > one of Nutch developers evaluates Kelvin's modifications and, if
> they
> > are good, commits them to trunk, and then makes the final
> pre-mapred
> > release (e.g. release-0.8).
>
> It won't lose Kelvin's patch: it will still be a patch to 0.7.

Ah, right, we could always make a 0.7.* release from release 0.7.

> What I worry about is the alternate scenario: that Kelvin & others
> invest a lot of effort making this work with 0.7, while the
> mapred-based
> code diverges even further.  It would be best if Kelvin's patch is
> ported to the mapred branch sooner rather than later, then maintained
> there.

I agree.  I'll actually see Kelvin in person tomorrow, so we'll see if
this is something he can do.  It looks like he added some much-needed
functionality in his patch, so it'd good to keep it.

Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it.
Reply | Threaded
Open this post in threaded view
|

Re: merge mapred to trunk

Kelvin Tan
In reply to this post by Doug Cutting-2


On Wed, 31 Aug 2005 14:37:54 -0700, Doug Cutting wrote:

[hidden email] wrote:
>> I, too, am looking forward to this, but I am wondering what that
>> will do to Kelvin Tan's recent contribution, especially since I
>> saw that both MapReduce and Kelvin's code change how
>> FetchListEntry works.  If merging mapred to trunk means losing
>> Kelvin's changes, then I suggest one of Nutch developers
>> evaluates Kelvin's modifications and, if they are good, commits
>> them to trunk, and then makes the final pre-mapred release (e.g.
>> release-0.8).
>>
>
> It won't lose Kelvin's patch: it will still be a patch to 0.7.
>
> What I worry about is the alternate scenario: that Kelvin & others
> invest a lot of effort making this work with 0.7, while the mapred-
> based code diverges even further.  It would be best if Kelvin's
> patch is ported to the mapred branch sooner rather than later, then
> maintained there.
>
> Doug

Agreed. I have some time in the coming weeks, and will work fulltime to evolve the patch to be more compatible with Nutch especially map-red..

k

Reply | Threaded
Open this post in threaded view
|

nutch 0.7 bug?

luti
In reply to this post by Doug Cutting-2
Dear Developers!

I tested  nutch 0.7 with all the parser plugins, and found the followings:

-------------------------------------------------------------------------
The fetch broken by with e.g. followings:
-------------------------------------------------------------------------
050901 110915 fetch okay, but can't parse
http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc,
reason: failed
(2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved
files are unsupported at this time
050901 110915 fetching http://en.mimi.hu/fishing/scad.html
050901 110917 SEVERE error writing output:java.lang.NullPointerException
java.lang.NullPointerException
        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
        at
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110917 SEVERE error writing output:java.io.IOException: key out
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged.  Exiting fetcher.
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
050901 110921 SEVERE error writing output:java.io.IOException: key out
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110921 SEVERE error writing output:java.io.IOException: key out
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110921 SEVERE error writing output:java.io.IOException: key out
of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
etc.

---------------------------------------------------------------------------
There are the differences between nutch-site.xml and nutch-default.xml:
---------------------------------------------------------------------------
 ***** nutch-default.xml
  <name>http.timeout</name>
  <value>10000</value>
  <description>The default network timeout, in milliseconds.</description>
***** NUTCH-SITE.XML
  <name>http.timeout</name>
  <value>30000</value>
  <description>The default network timeout, in milliseconds.</description>
*****

***** nutch-default.xml
  <name>http.max.delays</name>
  <value>3</value>
  <description>The number of times a thread will delay when trying to
***** NUTCH-SITE.XML
  <name>http.max.delays</name>
  <value>6</value>
  <description>The number of times a thread will delay when trying to
*****

***** nutch-default.xml
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
  <name>http.content.limit</name>
  <value>130000</value>
  <description>The length limit for downloaded content, in bytes.
*****

***** nutch-default.xml
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
  <name>file.content.limit</name>
  <value>130000</value>
  <description>The length limit for downloaded content, in bytes.
*****

***** nutch-default.xml
  <name>ftp.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
  <name>ftp.content.limit</name>
  <value>130000</value>
  <description>The length limit for downloaded content, in bytes.
*****

***** nutch-default.xml
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
***** NUTCH-SITE.XML
  <name>db.max.outlinks.per.page</name>
  <value>200</value>
  <description>The maximum number of outlinks that we'll process for a page.
*****

***** nutch-default.xml
  <name>db.fetch.retry.max</name>
  <value>3</value>
  <description>The maximum number of times a url that has encountered
***** NUTCH-SITE.XML
  <name>db.fetch.retry.max</name>
  <value>6</value>
  <description>The maximum number of times a url that has encountered
*****

***** nutch-default.xml
  <name>fetcher.server.delay</name>
  <value>5.0</value>
  <description>The number of seconds the fetcher will delay between
***** NUTCH-SITE.XML
  <name>fetcher.server.delay</name>
  <value>30.0</value>
  <description>The number of seconds the fetcher will delay between
*****

***** nutch-default.xml
  <name>fetcher.threads.fetch</name>
  <value>10</value>
  <description>The number of FetcherThreads the fetcher should use.
***** NUTCH-SITE.XML
  <name>fetcher.threads.fetch</name>
  <value>100</value>
  <description>The number of FetcherThreads the fetcher should use.
*****

***** nutch-default.xml
  <name>fetcher.threads.per.host</name>
  <value>1</value>
  <description>This number is the maximum number of threads that
***** NUTCH-SITE.XML
  <name>fetcher.threads.per.host</name>
  <value>100</value>
  <description>This number is the maximum number of threads that
*****

***** nutch-default.xml
  <name>parser.threads.parse</name>
  <value>10</value>
  <description>Number of ParserThreads ParseSegment should
use.</description>
***** NUTCH-SITE.XML
  <name>parser.threads.parse</name>
  <value>100</value>
  <description>Number of ParserThreads ParseSegment should
use.</description>
*****

***** nutch-default.xml
  <name>indexer.minMergeDocs</name>
  <value>50</value>
  <description>This number determines the minimum number of Lucene
***** NUTCH-SITE.XML
  <name>indexer.minMergeDocs</name>
  <value>10000</value>
  <description>This number determines the minimum number of Lucene
*****

***** nutch-default.xml
  <name>indexer.maxMergeDocs</name>
  <value>50</value>
  <description>This number determines the maximum number of Lucene
***** NUTCH-SITE.XML
  <name>indexer.maxMergeDocs</name>
  <value>10000000</value>
  <description>This number determines the maximum number of Lucene
*****

***** nutch-default.xml
  <name>searcher.dir</name>
  <value>.</value>
  <description>
***** NUTCH-SITE.XML
  <name>searcher.dir</name>
  <value>/srv/db/</value>
  <description>
*****

***** nutch-default.xml
  <name>ipc.client.timeout</name>
  <value>10000</value>
  <description>Defines the timeout for IPC calls in milliseconds.
</description>
***** NUTCH-SITE.XML
  <name>ipc.client.timeout</name>
  <value>20000</value>
  <description>Defines the timeout for IPC calls in milliseconds.
</description>
*****

***** nutch-default.xml
  <name>plugin.includes</name>
 
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
***** NUTCH-SITE.XML
  <name>plugin.includes</name>
 
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
basic|more|site|url)</value>
  <description>Regular expression naming plugin directory names to
*****

***** nutch-default.xml
  <name>parser.character.encoding.default</name>
  <value>windows-1252</value>
  <description>The character encoding to fall back to when no other
information
***** NUTCH-SITE.XML
  <name>parser.character.encoding.default</name>
  <value>iso-8859-2</value>
  <description>The character encoding to fall back to when no other
information
*****

Any idea what is the problem source?

Best Regards:
    Ferenc
Reply | Threaded
Open this post in threaded view
|

Event queues vs threads

Kelvin Tan
I'm toying around with the idea of implementing the fetcher as a series of event queues (ala SEDA) instead of with threads. This is done by breaking up the fetching operation into a series of stages connected by queues, instead of one fetcherthread per task.

The stages I see are:

1. CrawlStarter (url injection)
2. URL filtering and normalizing
3. HttpRequest
4. HttpResponse
5. DB of fetched MD5 hashes
6. DB of fetched URLs
7. Parse and link extraction
8. Output
9. Link/Page Scoring

Each of these stages will be handled in its own thread (except for HTML parsing and scoring, which may actually benefit from having multiple threads). With the introduction of non-blocking IO, I think threads should be used only where parallel computation offers performance advantages.

Breaking up HttpRequest and HttpResponse, will also pave the way for a non-blocking HTTP implementation.

A big advantage also arises from a decrease in programmatic complexity (and possibly performance). With most of the stages being guaranteed to be single-threaded, threading/synchronization issues are dramatically reduced. This may not be so evident in the current/map-red fetch code, but because of the completely online nature of nutch-84/OC, this does simplify things considerably.

I'll need to dig abit more to see how this can be conceptually translated into map-reduce, but I imagine its do-able. Perhaps each stage gets mapped then reduced?

Any thoughts?

Reply | Threaded
Open this post in threaded view
|

Re: Event queues vs threads

Doug Cutting-2
Kelvin Tan wrote:
> Each of these stages will be handled in its own thread (except for HTML parsing and scoring, which may actually benefit from having multiple threads). With the introduction of non-blocking IO, I think threads should be used only where parallel computation offers performance advantages.
>
> Breaking up HttpRequest and HttpResponse, will also pave the way for a non-blocking HTTP implementation.

I have never been able to write a async version of things with Java's
nio that outperforms a threaded version.  In theory it is possible,
since you can avoid thread switching overheads.  But in practice I have
found it difficult.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Event queues vs threads

Kelvin Tan


On Thu, 01 Sep 2005 09:58:49 -0700, Doug Cutting wrote:

> Kelvin Tan wrote:
>> Each of these stages will be handled in its own thread (except
>> for HTML parsing and scoring, which may actually benefit from
>> having multiple threads). With the introduction of non-blocking
>> IO, I think threads should be used only where parallel
>> computation offers performance advantages.
>>
>> Breaking up HttpRequest and HttpResponse, will also pave the way
>> for a non-blocking HTTP implementation.
>>
> I have never been able to write a async version of things with
> Java's nio that outperforms a threaded version.  In theory it is
> possible, since you can avoid thread switching overheads.  But in
> practice I have found it difficult.
>
> Doug

Interesting. I haven't tried it myself. Do you have any code/benchmarks for this? Are you aware of others facing the same problem?

k

Reply | Threaded
Open this post in threaded view
|

Re: Event queues vs threads

Doug Cutting-2
Kelvin Tan wrote:
> Interesting. I haven't tried it myself. Do you have any code/benchmarks for this?

I never committed it anywhere.  I initially tried to write Nutch's IPC
mechanism with nio and it was slow and buggy.  One problem was that I
needed to switch streams to non-blocking mode in order to read
arbitrarily large objects, then switch them back to blocking mode in
order to select() on them.  But you can't change this state and remove
them from the selector without going through the scheduler.  So the
benefit of skipping the scheduler wasn't there.  If I was willing to
fragment objects into fixed size chunks then it might have worked, but
that's a lot of work.  It's a strange limitation, since with native
sockets one can select and then perform arbitrary stream i/o, not
limited to a single buffer.

Also, there's an nio version of Lucene's Directory that's a bit slower
than the non-nio version, but this is not using select() or anything.

> Are you aware of others facing the same problem?

How much non-blocking nio code do you find in real Java code?  I have
not seen a lot.

I did find that Sun has implemented a high-performance HTTP client using
nio.  This is documented at:

http://blogs.sun.com/roller/resources/fp/grizzly.pdf

 From what I can tell the primary benefit is in number of simultaneous
clients, not in throughput.  Does a crawler require 1000's of
simultaneous connections?  If so, then it looks like careful use of nio
could offer some real benefits.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Event queues vs threads

Piotr Kosiorowski
Hi,
I think some old blog entries are quite interesting - if somone wants to
find out some details about nio.
http://jroller.com/page/pyrasun/20040426
Regards,
Piotr

Doug Cutting wrote:

> Kelvin Tan wrote:
>
>> Interesting. I haven't tried it myself. Do you have any
>> code/benchmarks for this?
>
>
> I never committed it anywhere.  I initially tried to write Nutch's IPC
> mechanism with nio and it was slow and buggy.  One problem was that I
> needed to switch streams to non-blocking mode in order to read
> arbitrarily large objects, then switch them back to blocking mode in
> order to select() on them.  But you can't change this state and remove
> them from the selector without going through the scheduler.  So the
> benefit of skipping the scheduler wasn't there.  If I was willing to
> fragment objects into fixed size chunks then it might have worked, but
> that's a lot of work.  It's a strange limitation, since with native
> sockets one can select and then perform arbitrary stream i/o, not
> limited to a single buffer.
>
> Also, there's an nio version of Lucene's Directory that's a bit slower
> than the non-nio version, but this is not using select() or anything.
>
>> Are you aware of others facing the same problem?
>
>
> How much non-blocking nio code do you find in real Java code?  I have
> not seen a lot.
>
> I did find that Sun has implemented a high-performance HTTP client using
> nio.  This is documented at:
>
> http://blogs.sun.com/roller/resources/fp/grizzly.pdf
>
>  From what I can tell the primary benefit is in number of simultaneous
> clients, not in throughput.  Does a crawler require 1000's of
> simultaneous connections?  If so, then it looks like careful use of nio
> could offer some real benefits.
>
> Doug
>

Reply | Threaded
Open this post in threaded view
|

Re: nutch 0.7 bug?

Michael Nebel
In reply to this post by luti
Hi Ferenc,

I see the same errors. As I've seen a running installation yesterday, I
think it's a configuration mistake. By now I have no idea where. Have
you made any progress?

Regards

        Michael


[hidden email] wrote:

> Dear Developers!
>
> I tested  nutch 0.7 with all the parser plugins, and found the followings:
>
> -------------------------------------------------------------------------
> The fetch broken by with e.g. followings:
> -------------------------------------------------------------------------
> 050901 110915 fetch okay, but can't parse
> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc,
> reason: failed
> (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved
> files are unsupported at this time
> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
> 050901 110917 SEVERE error writing output:java.lang.NullPointerException
> java.lang.NullPointerException
>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>        at
> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> 050901 110917 SEVERE error writing output:java.io.IOException: key out
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> Exception in thread "main" java.lang.RuntimeException: SEVERE error
> logged.  Exiting fetcher.
>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
> 050901 110921 SEVERE error writing output:java.io.IOException: key out
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> 050901 110921 SEVERE error writing output:java.io.IOException: key out
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> 050901 110921 SEVERE error writing output:java.io.IOException: key out
> of order: 319 after 319
> java.io.IOException: key out of order: 319 after 319
>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
> etc.
>
> ---------------------------------------------------------------------------
> There are the differences between nutch-site.xml and nutch-default.xml:
> ---------------------------------------------------------------------------
> ***** nutch-default.xml
>  <name>http.timeout</name>
>  <value>10000</value>
>  <description>The default network timeout, in milliseconds.</description>
> ***** NUTCH-SITE.XML
>  <name>http.timeout</name>
>  <value>30000</value>
>  <description>The default network timeout, in milliseconds.</description>
> *****
>
> ***** nutch-default.xml
>  <name>http.max.delays</name>
>  <value>3</value>
>  <description>The number of times a thread will delay when trying to
> ***** NUTCH-SITE.XML
>  <name>http.max.delays</name>
>  <value>6</value>
>  <description>The number of times a thread will delay when trying to
> *****
>
> ***** nutch-default.xml
>  <name>http.content.limit</name>
>  <value>65536</value>
>  <description>The length limit for downloaded content, in bytes.
> ***** NUTCH-SITE.XML
>  <name>http.content.limit</name>
>  <value>130000</value>
>  <description>The length limit for downloaded content, in bytes.
> *****
>
> ***** nutch-default.xml
>  <name>file.content.limit</name>
>  <value>65536</value>
>  <description>The length limit for downloaded content, in bytes.
> ***** NUTCH-SITE.XML
>  <name>file.content.limit</name>
>  <value>130000</value>
>  <description>The length limit for downloaded content, in bytes.
> *****
>
> ***** nutch-default.xml
>  <name>ftp.content.limit</name>
>  <value>65536</value>
>  <description>The length limit for downloaded content, in bytes.
> ***** NUTCH-SITE.XML
>  <name>ftp.content.limit</name>
>  <value>130000</value>
>  <description>The length limit for downloaded content, in bytes.
> *****
>
> ***** nutch-default.xml
>  <name>db.max.outlinks.per.page</name>
>  <value>100</value>
>  <description>The maximum number of outlinks that we'll process for a page.
> ***** NUTCH-SITE.XML
>  <name>db.max.outlinks.per.page</name>
>  <value>200</value>
>  <description>The maximum number of outlinks that we'll process for a page.
> *****
>
> ***** nutch-default.xml
>  <name>db.fetch.retry.max</name>
>  <value>3</value>
>  <description>The maximum number of times a url that has encountered
> ***** NUTCH-SITE.XML
>  <name>db.fetch.retry.max</name>
>  <value>6</value>
>  <description>The maximum number of times a url that has encountered
> *****
>
> ***** nutch-default.xml
>  <name>fetcher.server.delay</name>
>  <value>5.0</value>
>  <description>The number of seconds the fetcher will delay between
> ***** NUTCH-SITE.XML
>  <name>fetcher.server.delay</name>
>  <value>30.0</value>
>  <description>The number of seconds the fetcher will delay between
> *****
>
> ***** nutch-default.xml
>  <name>fetcher.threads.fetch</name>
>  <value>10</value>
>  <description>The number of FetcherThreads the fetcher should use.
> ***** NUTCH-SITE.XML
>  <name>fetcher.threads.fetch</name>
>  <value>100</value>
>  <description>The number of FetcherThreads the fetcher should use.
> *****
>
> ***** nutch-default.xml
>  <name>fetcher.threads.per.host</name>
>  <value>1</value>
>  <description>This number is the maximum number of threads that
> ***** NUTCH-SITE.XML
>  <name>fetcher.threads.per.host</name>
>  <value>100</value>
>  <description>This number is the maximum number of threads that
> *****
>
> ***** nutch-default.xml
>  <name>parser.threads.parse</name>
>  <value>10</value>
>  <description>Number of ParserThreads ParseSegment should
> use.</description>
> ***** NUTCH-SITE.XML
>  <name>parser.threads.parse</name>
>  <value>100</value>
>  <description>Number of ParserThreads ParseSegment should
> use.</description>
> *****
>
> ***** nutch-default.xml
>  <name>indexer.minMergeDocs</name>
>  <value>50</value>
>  <description>This number determines the minimum number of Lucene
> ***** NUTCH-SITE.XML
>  <name>indexer.minMergeDocs</name>
>  <value>10000</value>
>  <description>This number determines the minimum number of Lucene
> *****
>
> ***** nutch-default.xml
>  <name>indexer.maxMergeDocs</name>
>  <value>50</value>
>  <description>This number determines the maximum number of Lucene
> ***** NUTCH-SITE.XML
>  <name>indexer.maxMergeDocs</name>
>  <value>10000000</value>
>  <description>This number determines the maximum number of Lucene
> *****
>
> ***** nutch-default.xml
>  <name>searcher.dir</name>
>  <value>.</value>
>  <description>
> ***** NUTCH-SITE.XML
>  <name>searcher.dir</name>
>  <value>/srv/db/</value>
>  <description>
> *****
>
> ***** nutch-default.xml
>  <name>ipc.client.timeout</name>
>  <value>10000</value>
>  <description>Defines the timeout for IPC calls in milliseconds.
> </description>
> ***** NUTCH-SITE.XML
>  <name>ipc.client.timeout</name>
>  <value>20000</value>
>  <description>Defines the timeout for IPC calls in milliseconds.
> </description>
> *****
>
> ***** nutch-default.xml
>  <name>plugin.includes</name>
>  
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
>
>  <description>Regular expression naming plugin directory names to
> ***** NUTCH-SITE.XML
>  <name>plugin.includes</name>
>  
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
>
> basic|more|site|url)</value>
>  <description>Regular expression naming plugin directory names to
> *****
>
> ***** nutch-default.xml
>  <name>parser.character.encoding.default</name>
>  <value>windows-1252</value>
>  <description>The character encoding to fall back to when no other
> information
> ***** NUTCH-SITE.XML
>  <name>parser.character.encoding.default</name>
>  <value>iso-8859-2</value>
>  <description>The character encoding to fall back to when no other
> information
> *****
>
> Any idea what is the problem source?
>
> Best Regards:
>    Ferenc


--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/

Reply | Threaded
Open this post in threaded view
|

Re: nutch 0.7 bug?

luti
Hi Michael,

I going back to a nigthly build.
I think this problem is related to 'fetcher.threads.per.host' value,
when it is bigger than 1.
There is another possible sources: fetcher.threads.fetch or
fetcher.threads.per.host or parser.threads.parse.

Best Regards,
    Ferenc

> Hi Ferenc,
>
> I see the same errors. As I've seen a running installation yesterday,
> I think it's a configuration mistake. By now I have no idea where.
> Have you made any progress?
>
> Regards
>
>     Michael
>
>
> [hidden email] wrote:
>
>> Dear Developers!
>>
>> I tested  nutch 0.7 with all the parser plugins, and found the
>> followings:
>>
>> -------------------------------------------------------------------------
>>
>> The fetch broken by with e.g. followings:
>> -------------------------------------------------------------------------
>>
>> 050901 110915 fetch okay, but can't parse
>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc,
>> reason: failed
>> (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved
>> files are unsupported at this time
>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>> 050901 110917 SEVERE error writing output:java.lang.NullPointerException
>> java.lang.NullPointerException
>>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>        at
>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>        at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>
>>        at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>
>>        at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> 050901 110917 SEVERE error writing output:java.io.IOException: key
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>        at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>
>>        at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>
>>        at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> Exception in thread "main" java.lang.RuntimeException: SEVERE error
>> logged.  Exiting fetcher.
>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>        at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>
>>        at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>
>>        at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>        at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>
>>        at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>
>>        at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>> out of order: 319 after 319
>> java.io.IOException: key out of order: 319 after 319
>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>> etc.
>>
>> ---------------------------------------------------------------------------
>>
>> There are the differences between nutch-site.xml and nutch-default.xml:
>> ---------------------------------------------------------------------------
>>
>> ***** nutch-default.xml
>>  <name>http.timeout</name>
>>  <value>10000</value>
>>  <description>The default network timeout, in
>> milliseconds.</description>
>> ***** NUTCH-SITE.XML
>>  <name>http.timeout</name>
>>  <value>30000</value>
>>  <description>The default network timeout, in
>> milliseconds.</description>
>> *****
>>
>> ***** nutch-default.xml
>>  <name>http.max.delays</name>
>>  <value>3</value>
>>  <description>The number of times a thread will delay when trying to
>> ***** NUTCH-SITE.XML
>>  <name>http.max.delays</name>
>>  <value>6</value>
>>  <description>The number of times a thread will delay when trying to
>> *****
>>
>> ***** nutch-default.xml
>>  <name>http.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content, in bytes.
>> ***** NUTCH-SITE.XML
>>  <name>http.content.limit</name>
>>  <value>130000</value>
>>  <description>The length limit for downloaded content, in bytes.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>file.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content, in bytes.
>> ***** NUTCH-SITE.XML
>>  <name>file.content.limit</name>
>>  <value>130000</value>
>>  <description>The length limit for downloaded content, in bytes.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>ftp.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content, in bytes.
>> ***** NUTCH-SITE.XML
>>  <name>ftp.content.limit</name>
>>  <value>130000</value>
>>  <description>The length limit for downloaded content, in bytes.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>db.max.outlinks.per.page</name>
>>  <value>100</value>
>>  <description>The maximum number of outlinks that we'll process for a
>> page.
>> ***** NUTCH-SITE.XML
>>  <name>db.max.outlinks.per.page</name>
>>  <value>200</value>
>>  <description>The maximum number of outlinks that we'll process for a
>> page.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>db.fetch.retry.max</name>
>>  <value>3</value>
>>  <description>The maximum number of times a url that has encountered
>> ***** NUTCH-SITE.XML
>>  <name>db.fetch.retry.max</name>
>>  <value>6</value>
>>  <description>The maximum number of times a url that has encountered
>> *****
>>
>> ***** nutch-default.xml
>>  <name>fetcher.server.delay</name>
>>  <value>5.0</value>
>>  <description>The number of seconds the fetcher will delay between
>> ***** NUTCH-SITE.XML
>>  <name>fetcher.server.delay</name>
>>  <value>30.0</value>
>>  <description>The number of seconds the fetcher will delay between
>> *****
>>
>> ***** nutch-default.xml
>>  <name>fetcher.threads.fetch</name>
>>  <value>10</value>
>>  <description>The number of FetcherThreads the fetcher should use.
>> ***** NUTCH-SITE.XML
>>  <name>fetcher.threads.fetch</name>
>>  <value>100</value>
>>  <description>The number of FetcherThreads the fetcher should use.
>> *****
>>
>> ***** nutch-default.xml
>>  <name>fetcher.threads.per.host</name>
>>  <value>1</value>
>>  <description>This number is the maximum number of threads that
>> ***** NUTCH-SITE.XML
>>  <name>fetcher.threads.per.host</name>
>>  <value>100</value>
>>  <description>This number is the maximum number of threads that
>> *****
>>
>> ***** nutch-default.xml
>>  <name>parser.threads.parse</name>
>>  <value>10</value>
>>  <description>Number of ParserThreads ParseSegment should
>> use.</description>
>> ***** NUTCH-SITE.XML
>>  <name>parser.threads.parse</name>
>>  <value>100</value>
>>  <description>Number of ParserThreads ParseSegment should
>> use.</description>
>> *****
>>
>> ***** nutch-default.xml
>>  <name>indexer.minMergeDocs</name>
>>  <value>50</value>
>>  <description>This number determines the minimum number of Lucene
>> ***** NUTCH-SITE.XML
>>  <name>indexer.minMergeDocs</name>
>>  <value>10000</value>
>>  <description>This number determines the minimum number of Lucene
>> *****
>>
>> ***** nutch-default.xml
>>  <name>indexer.maxMergeDocs</name>
>>  <value>50</value>
>>  <description>This number determines the maximum number of Lucene
>> ***** NUTCH-SITE.XML
>>  <name>indexer.maxMergeDocs</name>
>>  <value>10000000</value>
>>  <description>This number determines the maximum number of Lucene
>> *****
>>
>> ***** nutch-default.xml
>>  <name>searcher.dir</name>
>>  <value>.</value>
>>  <description>
>> ***** NUTCH-SITE.XML
>>  <name>searcher.dir</name>
>>  <value>/srv/db/</value>
>>  <description>
>> *****
>>
>> ***** nutch-default.xml
>>  <name>ipc.client.timeout</name>
>>  <value>10000</value>
>>  <description>Defines the timeout for IPC calls in milliseconds.
>> </description>
>> ***** NUTCH-SITE.XML
>>  <name>ipc.client.timeout</name>
>>  <value>20000</value>
>>  <description>Defines the timeout for IPC calls in milliseconds.
>> </description>
>> *****
>>
>> ***** nutch-default.xml
>>  <name>plugin.includes</name>
>>  
>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
>>
>>  <description>Regular expression naming plugin directory names to
>> ***** NUTCH-SITE.XML
>>  <name>plugin.includes</name>
>>  
>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
>>
>> basic|more|site|url)</value>
>>  <description>Regular expression naming plugin directory names to
>> *****
>>
>> ***** nutch-default.xml
>>  <name>parser.character.encoding.default</name>
>>  <value>windows-1252</value>
>>  <description>The character encoding to fall back to when no other
>> information
>> ***** NUTCH-SITE.XML
>>  <name>parser.character.encoding.default</name>
>>  <value>iso-8859-2</value>
>>  <description>The character encoding to fall back to when no other
>> information
>> *****
>>
>> Any idea what is the problem source?
>>
>> Best Regards:
>>    Ferenc
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: nutch 0.7 bug?

Michael Nebel
Just for the mail archives: please see also NUTCH-89.

Thread closed?

Michael



[hidden email] wrote:

> Hi Michael,
>
> I going back to a nigthly build.
> I think this problem is related to 'fetcher.threads.per.host' value,
> when it is bigger than 1.
> There is another possible sources: fetcher.threads.fetch or
> fetcher.threads.per.host or parser.threads.parse.
>
> Best Regards,
>    Ferenc
>
>> Hi Ferenc,
>>
>> I see the same errors. As I've seen a running installation yesterday,
>> I think it's a configuration mistake. By now I have no idea where.
>> Have you made any progress?
>>
>> Regards
>>
>>     Michael
>>
>>
>> [hidden email] wrote:
>>
>>> Dear Developers!
>>>
>>> I tested  nutch 0.7 with all the parser plugins, and found the
>>> followings:
>>>
>>> -------------------------------------------------------------------------
>>>
>>> The fetch broken by with e.g. followings:
>>> -------------------------------------------------------------------------
>>>
>>> 050901 110915 fetch okay, but can't parse
>>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc,
>>> reason: failed
>>> (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved
>>> files are unsupported at this time
>>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>>> 050901 110917 SEVERE error writing output:java.lang.NullPointerException
>>> java.lang.NullPointerException
>>>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>>        at
>>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>        at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>
>>>        at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>
>>>        at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> 050901 110917 SEVERE error writing output:java.io.IOException: key
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>        at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>
>>>        at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>
>>>        at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> Exception in thread "main" java.lang.RuntimeException: SEVERE error
>>> logged.  Exiting fetcher.
>>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>        at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>
>>>        at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>
>>>        at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>        at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>
>>>        at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>
>>>        at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>> out of order: 319 after 319
>>> java.io.IOException: key out of order: 319 after 319
>>>        at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>> etc.
>>>
>>> ---------------------------------------------------------------------------
>>>
>>> There are the differences between nutch-site.xml and nutch-default.xml:
>>> ---------------------------------------------------------------------------
>>>
>>> ***** nutch-default.xml
>>>  <name>http.timeout</name>
>>>  <value>10000</value>
>>>  <description>The default network timeout, in
>>> milliseconds.</description>
>>> ***** NUTCH-SITE.XML
>>>  <name>http.timeout</name>
>>>  <value>30000</value>
>>>  <description>The default network timeout, in
>>> milliseconds.</description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>http.max.delays</name>
>>>  <value>3</value>
>>>  <description>The number of times a thread will delay when trying to
>>> ***** NUTCH-SITE.XML
>>>  <name>http.max.delays</name>
>>>  <value>6</value>
>>>  <description>The number of times a thread will delay when trying to
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>http.content.limit</name>
>>>  <value>65536</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> ***** NUTCH-SITE.XML
>>>  <name>http.content.limit</name>
>>>  <value>130000</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>file.content.limit</name>
>>>  <value>65536</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> ***** NUTCH-SITE.XML
>>>  <name>file.content.limit</name>
>>>  <value>130000</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>ftp.content.limit</name>
>>>  <value>65536</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> ***** NUTCH-SITE.XML
>>>  <name>ftp.content.limit</name>
>>>  <value>130000</value>
>>>  <description>The length limit for downloaded content, in bytes.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>db.max.outlinks.per.page</name>
>>>  <value>100</value>
>>>  <description>The maximum number of outlinks that we'll process for a
>>> page.
>>> ***** NUTCH-SITE.XML
>>>  <name>db.max.outlinks.per.page</name>
>>>  <value>200</value>
>>>  <description>The maximum number of outlinks that we'll process for a
>>> page.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>db.fetch.retry.max</name>
>>>  <value>3</value>
>>>  <description>The maximum number of times a url that has encountered
>>> ***** NUTCH-SITE.XML
>>>  <name>db.fetch.retry.max</name>
>>>  <value>6</value>
>>>  <description>The maximum number of times a url that has encountered
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>fetcher.server.delay</name>
>>>  <value>5.0</value>
>>>  <description>The number of seconds the fetcher will delay between
>>> ***** NUTCH-SITE.XML
>>>  <name>fetcher.server.delay</name>
>>>  <value>30.0</value>
>>>  <description>The number of seconds the fetcher will delay between
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>fetcher.threads.fetch</name>
>>>  <value>10</value>
>>>  <description>The number of FetcherThreads the fetcher should use.
>>> ***** NUTCH-SITE.XML
>>>  <name>fetcher.threads.fetch</name>
>>>  <value>100</value>
>>>  <description>The number of FetcherThreads the fetcher should use.
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>fetcher.threads.per.host</name>
>>>  <value>1</value>
>>>  <description>This number is the maximum number of threads that
>>> ***** NUTCH-SITE.XML
>>>  <name>fetcher.threads.per.host</name>
>>>  <value>100</value>
>>>  <description>This number is the maximum number of threads that
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>parser.threads.parse</name>
>>>  <value>10</value>
>>>  <description>Number of ParserThreads ParseSegment should
>>> use.</description>
>>> ***** NUTCH-SITE.XML
>>>  <name>parser.threads.parse</name>
>>>  <value>100</value>
>>>  <description>Number of ParserThreads ParseSegment should
>>> use.</description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>indexer.minMergeDocs</name>
>>>  <value>50</value>
>>>  <description>This number determines the minimum number of Lucene
>>> ***** NUTCH-SITE.XML
>>>  <name>indexer.minMergeDocs</name>
>>>  <value>10000</value>
>>>  <description>This number determines the minimum number of Lucene
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>indexer.maxMergeDocs</name>
>>>  <value>50</value>
>>>  <description>This number determines the maximum number of Lucene
>>> ***** NUTCH-SITE.XML
>>>  <name>indexer.maxMergeDocs</name>
>>>  <value>10000000</value>
>>>  <description>This number determines the maximum number of Lucene
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>searcher.dir</name>
>>>  <value>.</value>
>>>  <description>
>>> ***** NUTCH-SITE.XML
>>>  <name>searcher.dir</name>
>>>  <value>/srv/db/</value>
>>>  <description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>ipc.client.timeout</name>
>>>  <value>10000</value>
>>>  <description>Defines the timeout for IPC calls in milliseconds.
>>> </description>
>>> ***** NUTCH-SITE.XML
>>>  <name>ipc.client.timeout</name>
>>>  <value>20000</value>
>>>  <description>Defines the timeout for IPC calls in milliseconds.
>>> </description>
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>plugin.includes</name>
>>>  
>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
>>>
>>>  <description>Regular expression naming plugin directory names to
>>> ***** NUTCH-SITE.XML
>>>  <name>plugin.includes</name>
>>>  
>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
>>>
>>> basic|more|site|url)</value>
>>>  <description>Regular expression naming plugin directory names to
>>> *****
>>>
>>> ***** nutch-default.xml
>>>  <name>parser.character.encoding.default</name>
>>>  <value>windows-1252</value>
>>>  <description>The character encoding to fall back to when no other
>>> information
>>> ***** NUTCH-SITE.XML
>>>  <name>parser.character.encoding.default</name>
>>>  <value>iso-8859-2</value>
>>>  <description>The character encoding to fall back to when no other
>>> information
>>> *****
>>>
>>> Any idea what is the problem source?
>>>
>>> Best Regards:
>>>    Ferenc
>>
>>
>>
>>


--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/

Reply | Threaded
Open this post in threaded view
|

Re: Re: nutch 0.7 bug?

luti
Dear Michael,

Thanks, for your mail. But I think there are 2 different problem. I
don't use the rss parser.

Ferenc

Michael Nebel wrotte:

> Just for the mail archives: please see also NUTCH-89.
>
> Thread closed?
>
> Michael
>
>
>
> [hidden email] wrote:
>
>> Hi Michael,
>>
>> I going back to a nigthly build.
>> I think this problem is related to 'fetcher.threads.per.host' value,
>> when it is bigger than 1.
>> There is another possible sources: fetcher.threads.fetch or
>> fetcher.threads.per.host or parser.threads.parse.
>>
>> Best Regards,
>>    Ferenc
>>
>>> Hi Ferenc,
>>>
>>> I see the same errors. As I've seen a running installation
>>> yesterday, I think it's a configuration mistake. By now I have no
>>> idea where. Have you made any progress?
>>>
>>> Regards
>>>
>>>     Michael
>>>
>>>
>>> [hidden email] wrote:
>>>
>>>> Dear Developers!
>>>>
>>>> I tested  nutch 0.7 with all the parser plugins, and found the
>>>> followings:
>>>>
>>>> -------------------------------------------------------------------------
>>>>
>>>> The fetch broken by with e.g. followings:
>>>> -------------------------------------------------------------------------
>>>>
>>>> 050901 110915 fetch okay, but can't parse
>>>> http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc,
>>>> reason: failed
>>>> (2,200): org.apache.nutch.parse.msword.FastSavedException:
>>>> Fast-saved files are unsupported at this time
>>>> 050901 110915 fetching http://en.mimi.hu/fishing/scad.html
>>>> 050901 110917 SEVERE error writing
>>>> output:java.lang.NullPointerException
>>>> java.lang.NullPointerException
>>>>        at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
>>>>        at
>>>> org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
>>>>        at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>>
>>>>        at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>>
>>>>        at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110917 SEVERE error writing output:java.io.IOException: key
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>>
>>>>        at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>>
>>>>        at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> Exception in thread "main" java.lang.RuntimeException: SEVERE error
>>>> logged.  Exiting fetcher.
>>>>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
>>>>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>>
>>>>        at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>>
>>>>        at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>>        at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
>>>>
>>>>        at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
>>>>
>>>>        at
>>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
>>>> 050901 110921 SEVERE error writing output:java.io.IOException: key
>>>> out of order: 319 after 319
>>>> java.io.IOException: key out of order: 319 after 319
>>>>        at
>>>> org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
>>>>        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
>>>>        at
>>>> org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
>>>> etc.
>>>>
>>>> ---------------------------------------------------------------------------
>>>>
>>>> There are the differences between nutch-site.xml and
>>>> nutch-default.xml:
>>>> ---------------------------------------------------------------------------
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>http.timeout</name>
>>>>  <value>10000</value>
>>>>  <description>The default network timeout, in
>>>> milliseconds.</description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>http.timeout</name>
>>>>  <value>30000</value>
>>>>  <description>The default network timeout, in
>>>> milliseconds.</description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>http.max.delays</name>
>>>>  <value>3</value>
>>>>  <description>The number of times a thread will delay when trying to
>>>> ***** NUTCH-SITE.XML
>>>>  <name>http.max.delays</name>
>>>>  <value>6</value>
>>>>  <description>The number of times a thread will delay when trying to
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>http.content.limit</name>
>>>>  <value>65536</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>http.content.limit</name>
>>>>  <value>130000</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>file.content.limit</name>
>>>>  <value>65536</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>file.content.limit</name>
>>>>  <value>130000</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>ftp.content.limit</name>
>>>>  <value>65536</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>ftp.content.limit</name>
>>>>  <value>130000</value>
>>>>  <description>The length limit for downloaded content, in bytes.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>db.max.outlinks.per.page</name>
>>>>  <value>100</value>
>>>>  <description>The maximum number of outlinks that we'll process for
>>>> a page.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>db.max.outlinks.per.page</name>
>>>>  <value>200</value>
>>>>  <description>The maximum number of outlinks that we'll process for
>>>> a page.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>db.fetch.retry.max</name>
>>>>  <value>3</value>
>>>>  <description>The maximum number of times a url that has encountered
>>>> ***** NUTCH-SITE.XML
>>>>  <name>db.fetch.retry.max</name>
>>>>  <value>6</value>
>>>>  <description>The maximum number of times a url that has encountered
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>fetcher.server.delay</name>
>>>>  <value>5.0</value>
>>>>  <description>The number of seconds the fetcher will delay between
>>>> ***** NUTCH-SITE.XML
>>>>  <name>fetcher.server.delay</name>
>>>>  <value>30.0</value>
>>>>  <description>The number of seconds the fetcher will delay between
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>fetcher.threads.fetch</name>
>>>>  <value>10</value>
>>>>  <description>The number of FetcherThreads the fetcher should use.
>>>> ***** NUTCH-SITE.XML
>>>>  <name>fetcher.threads.fetch</name>
>>>>  <value>100</value>
>>>>  <description>The number of FetcherThreads the fetcher should use.
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>fetcher.threads.per.host</name>
>>>>  <value>1</value>
>>>>  <description>This number is the maximum number of threads that
>>>> ***** NUTCH-SITE.XML
>>>>  <name>fetcher.threads.per.host</name>
>>>>  <value>100</value>
>>>>  <description>This number is the maximum number of threads that
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>parser.threads.parse</name>
>>>>  <value>10</value>
>>>>  <description>Number of ParserThreads ParseSegment should
>>>> use.</description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>parser.threads.parse</name>
>>>>  <value>100</value>
>>>>  <description>Number of ParserThreads ParseSegment should
>>>> use.</description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>indexer.minMergeDocs</name>
>>>>  <value>50</value>
>>>>  <description>This number determines the minimum number of Lucene
>>>> ***** NUTCH-SITE.XML
>>>>  <name>indexer.minMergeDocs</name>
>>>>  <value>10000</value>
>>>>  <description>This number determines the minimum number of Lucene
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>indexer.maxMergeDocs</name>
>>>>  <value>50</value>
>>>>  <description>This number determines the maximum number of Lucene
>>>> ***** NUTCH-SITE.XML
>>>>  <name>indexer.maxMergeDocs</name>
>>>>  <value>10000000</value>
>>>>  <description>This number determines the maximum number of Lucene
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>searcher.dir</name>
>>>>  <value>.</value>
>>>>  <description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>searcher.dir</name>
>>>>  <value>/srv/db/</value>
>>>>  <description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>ipc.client.timeout</name>
>>>>  <value>10000</value>
>>>>  <description>Defines the timeout for IPC calls in milliseconds.
>>>> </description>
>>>> ***** NUTCH-SITE.XML
>>>>  <name>ipc.client.timeout</name>
>>>>  <value>20000</value>
>>>>  <description>Defines the timeout for IPC calls in milliseconds.
>>>> </description>
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>plugin.includes</name>
>>>>  
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
>>>>
>>>>  <description>Regular expression naming plugin directory names to
>>>> ***** NUTCH-SITE.XML
>>>>  <name>plugin.includes</name>
>>>>  
>>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
>>>>
>>>> basic|more|site|url)</value>
>>>>  <description>Regular expression naming plugin directory names to
>>>> *****
>>>>
>>>> ***** nutch-default.xml
>>>>  <name>parser.character.encoding.default</name>
>>>>  <value>windows-1252</value>
>>>>  <description>The character encoding to fall back to when no other
>>>> information
>>>> ***** NUTCH-SITE.XML
>>>>  <name>parser.character.encoding.default</name>
>>>>  <value>iso-8859-2</value>
>>>>  <description>The character encoding to fall back to when no other
>>>> information
>>>> *****
>>>>
>>>> Any idea what is the problem source?
>>>>
>>>> Best Regards:
>>>>    Ferenc
>>>
>>>
>>>
>>>
>>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: merge mapred to trunk

Doug Cutting-2
In reply to this post by Doug Cutting-2
I will postpone the merge of the mapred branch into trunk until I have a
chance to (a) add some MapReduce documentation; and (b) implement
MapReduce-based dedup.

Doug

Doug Cutting wrote:
> Currently we have three versions of nutch: trunk, 0.7 and mapred.  This
> increases the chances for conflicts.  I would thus like to merge the
> mapred branch into trunk soon.  The soonest I could actually start this
> is next week.  Are there any objections?
>
> Doug
12