No node available for block <blockID> errors

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

No node available for block <blockID> errors

Chris Schneider-2
Gang,

We're now running into "No node available for block <blockID>"
errors, which are killing our MapReduce-based crawling jobs. I did
some digging through our logs after one of these events to build a
context for the problem, so I thought I'd share the results with the
rest of you.

Our crawler died during the reduce part of the crawldb update (third
time around fetch loop). The ipc.client.timeout value was set to 1.8M
(30 minutes). The job eventually dies because a map task is unable to
get a block that it needs. However, this block apparently does exist
on the DataNode and is successfully transferred for some reason 13
minutes after the job gets aborted. A number of "Read timed out" and
"EOF reading from Socket" errors do appear in the DataNode log before
the problem first surfaces. However, there's a lot of successful
activity as well.

Afterward, I was able to successfully update the crawldb using the
command-line tool (after a stop-all.sh/start-all.sh sequence for good
measure), and I restarted the crawler. It ran for a couple of
generate/fetch/update loops before it finally died during URL
partitioning, again due to "No node available for block <blockID>"
errors. These errors referred to different blocks on a different
DataNode slave, but the blocks were apparently there on the DataNode,
as evidenced by later log activity. Again, a stop-all.sh/start-all.sh
sequence apparently put the crawler back in business, and it's still
running now.

Any light that could be shed on the underlying problem(s) would be
greatly appreciated.

Thanks,

- Chris

11:45:24pm - The crawl begins:

bin/do_crawl.sh started at 20060204234524...

12:17:17am - A series of  read timeout errors appear in the DataNode
log on s2. No evidence of these problems appears in the NameNode log.

060205 001717 2457 DataXCeiver
java.net.SocketTimeoutException: Read timed out
         at java.net.SocketInputStream.socketRead0(Native Method)
         at java.net.SocketInputStream.read(SocketInputStream.java:129)
         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
         at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
         at java.io.FilterInputStream.read(FilterInputStream.java:66)
         at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:293)
         at java.lang.Thread.run(Thread.java:595)

...snip...

060205 001803 2523 DataXCeiver
java.net.SocketTimeoutException: Read timed out
         at java.net.SocketInputStream.socketRead0(Native Method)
         at java.net.SocketInputStream.read(SocketInputStream.java:129)
         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
         at java.io.BufferedInputStream.read(BufferedInputStream.java:235)  
         at java.io.FilterInputStream.read(FilterInputStream.java:66)
         at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:293)
         at java.lang.Thread.run(Thread.java:595)

12:19:56am - An EOF error appears in the DataNode log on s2. No
evidence of this problem appears in the NameNode log:

060205 001956 2709 DataXCeiver
java.io.EOFException: EOF reading from
Socket[addr=/192.168.1.9,port=42878,localport=50010]
         at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:383)
         at java.lang.Thread.run(Thread.java:595)

12:50:57am - Another series of  read timeout errors appear in the
DataNode log on s2.

12:53:40am - Another EOF error appears in the DataNode log on s2.

1:27:21am - Another series of  read timeout errors appear in the
DataNode log on s2. A few EOF errors are sprinkled in as well.

1:29:13am - We see the first mention of blk_1084437997772174917 in
the DataNode log on s2. It appears to have received this block from
192.168.1.5  (s4):

060205 012913 4178 Received block blk_1084437997772174917 from /192.168.1.5

1:29:22am - A series of "Block <blockID> is valid, and cannot be
written to" errors appear in the DataNode log on s2. None of these
refer to blk_1084437997772174917, however:

060205 012922 4205 DataXCeiver
java.io.IOException: Block blk_-4636625784756138174 is valid, and
cannot be written to.
         at org.apache.nutch.ndfs.FSDataset.writeToBlock(FSDataset.java:264)
         at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:329)
         at java.lang.Thread.run(Thread.java:595)

...snip...

060205 012923 4212 DataXCeiver
java.io.EOFException: EOF reading from
Socket[addr=/192.168.1.4,port=44881,localport=50010]
         at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:383)
         at java.lang.Thread.run(Thread.java:595)
060205 012924 4213 Received block blk_-7873577862877309136 from /192.168.1.2
060205 012925 4214 DataXCeiver
java.io.EOFException: EOF reading from
Socket[addr=/192.168.1.5,port=49342,localport=50010]
         at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:383)
         at java.lang.Thread.run(Thread.java:595)
060205 012926 4215 Received block blk_8833881170910103633 from
/192.168.1.9 and mirrored to s5/192.168.1.6:50010
060205 012926 4216 Received block blk_-9065065135965140034 from /192.168.1.1
060205 012926 4217 DataXCeiver
java.io.IOException: Block blk_-803324019234272520 is valid, and
cannot be written to.
         at org.apache.nutch.ndfs.FSDataset.writeToBlock(FSDataset.java:264)
         at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:329)
         at java.lang.Thread.run(Thread.java:595)
060205 012926 4218 Received block blk_1640893643911715360 from
/192.168.1.8 and mirrored to s9/192.168.1.10:50010
060205 012926 4219 Received block blk_8581877778682012452 from /192.168.1.10
060205 012926 4220 Received block blk_5105973706640312551 from /192.168.1.5
060205 012926 4221 DataXCeiver
java.io.IOException: Block blk_8383681341419474724 is valid, and
cannot be written to.
         at org.apache.nutch.ndfs.FSDataset.writeToBlock(FSDataset.java:264)
         at org.apache.nutch.ndfs.DataNode$DataXceiver.run(DataNode.java:329)
         at java.lang.Thread.run(Thread.java:595)

1:30:51am - The crawldb update starts, and it quickly completes the
map portion of the update:

060205 013011 CrawlDb update: starting
060205 013011 CrawlDb update: db: crawl-20060129091444/crawldb
060205 013011 CrawlDb update: segment:
crawl-20060129091444/segments/20060205005908
060205 013011 parsing file:/home/crawler/nutch/conf/nutch-default.xml
060205 013011 parsing file:/home/crawler/nutch/conf/crawl-tool.xml
060205 013011 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060205 013011 parsing file:/home/crawler/nutch/conf/mapred-default.xml
060205 013011 parsing file:/home/crawler/nutch/conf/nutch-site.xml
060205 013011 CrawlDb update: Merging segment data into db.
060205 013051 Running job: job_733lh6
060205 013052  map 0%

...snip...

060205 013228  map 100%

1:31:55am - One of the last map tasks for this job is assigned to
tracker_69547 (on s5):

060205 013155 Adding task 'task_m_1yvznp' to set for tracker 'tracker_69547'

1:32:07am - Tracker_69547 (on s5) starts complaining about not being
able to read blk_1084437997772174917:

060205 013207 task_m_1yvznp  No node available for block
blk_1084437997772174917
060205 013207 task_m_1yvznp  Could not obtain block from any node:
java.io.IOException: No live nodes contain current block
060205 013217 task_m_1yvznp  No node available for block
blk_1084437997772174917
060205 013217 task_m_1yvznp  Could not obtain block from any node:
java.io.IOException: No live nodes contain current block

...snip...

060205 014149 task_m_1yvznp  No node available for block
blk_1084437997772174917
060205 014149 task_m_1yvznp  Could not obtain block from any node:
java.io.IOException: No live nodes contain current block

1:41:55am - Tracker_69547 (on s5) gives up and kills its child process:

060205 014155 Task task_m_1yvznp timed out.  Killing.
060205 014156 Server connection on port 50050 from 127.0.0.1: exiting
060205 014156 task_m_1yvznp Child Error
java.io.IOException: Task process exit with nonzero status.
         at org.apache.nutch.mapred.TaskRunner.runChild(TaskRunner.java:139)
         at org.apache.nutch.mapred.TaskRunner.run(TaskRunner.java:92)
060205 014159 task_m_1yvznp done; removing files.

1:41:59am - The TaskTracker notices that it has lost the map task, so
it reassigns it to tracker_69547 (on s5) three more times, with the
exact same results in all of the logs. After the fourth failure, it
aborts the owning job:

060205 014159 Task 'task_m_1yvznp' has been lost.
060205 014159 Adding task 'task_m_1yvznp' to set for tracker 'tracker_69547'
060205 015205 Task 'task_m_1yvznp' has been lost.
060205 015205 Adding task 'task_m_1yvznp' to set for tracker 'tracker_69547'
060205 020211 Task 'task_m_1yvznp' has been lost.
060205 020212 Adding task 'task_m_1yvznp' to set for tracker 'tracker_69547'
060205 021218 Task 'task_m_1yvznp' has been lost.
060205 021218 Task task_m_1yvznp has failed 4 times.  Aborting owning
job job_733lh6
060205 021218 Server connection on port 8010 from 127.0.0.1: exiting

2:25:31am - We see the first mention of the "missing" block in the
NameNode log. This block was apparently hosted on s2:

060205 022531 Pending transfer (block blk_1084437997772174917) from
s2:50010 to 1 destinations

2:25:31am  - We see evidence of the "missing" block being
successfully transferred in the DataNode log on s2. There is
absolutely nothing in the log between a nondescript entry at
1:32:26am (which is after the time that tracker_69547 started
complaining about the missing block) and the block transfer messages
at 2:25:31am:

060205 013226 4479 Served block blk_4776786656991943447 to /192.168.1.8
060205 022531 11 Starting thread to transfer block
blk_-1870083282504339828 to
[Lorg.apache.nutch.ndfs.DatanodeInfo;@1561437
060205 022531 11 Starting thread to transfer block
blk_1084437997772174917 to
[Lorg.apache.nutch.ndfs.DatanodeInfo;@b6b2a5
060205 022531 4480 Transmitted block blk_-1870083282504339828 to
s9/192.168.1.10:50010
060205 022531 4481 Transmitted block blk_1084437997772174917 to
s9/192.168.1.10:50010


--
------------------------
Chris Schneider
TransPac Software, Inc.
[hidden email]
------------------------
Reply | Threaded
Open this post in threaded view
|

Re: No node available for block <blockID> errors

Stefan Groschupf-2
Chris,
this is may a issue for the developer list.
If you already check that your network is not the problem ( I had  
seen things like that with cheap switches) than you better send this  
request to the developer list or/and open a bug report.
Stefan
Am 06.02.2006 um 16:08 schrieb Chris Schneider:

> Gang,
>
> We're now running into "No node available for block <blockID>"  
> errors, which are killing our MapReduce-based crawling jobs. I did  
> some digging through our logs after one of these events to build a  
> context for the problem, so I thought I'd share the results with  
> the rest of you.
>
> Our crawler died during the reduce part of the crawldb update  
> (third time around fetch loop). The ipc.client.timeout value was  
> set to 1.8M (30 minutes). The job eventually dies because a map  
> task is unable to get a block that it needs. However, this block  
> apparently does exist on the DataNode and is successfully  
> transferred for some reason 13 minutes after the job gets aborted.  
> A number of "Read timed out" and "EOF reading from Socket" errors  
> do appear in the DataNode log before the problem first surfaces.  
> However, there's a lot of successful activity as well.
>
> Afterward, I was able to successfully update the crawldb using the  
> command-line tool (after a stop-all.sh/start-all.sh sequence for  
> good measure), and I restarted the crawler. It ran for a couple of  
> generate/fetch/update loops before it finally died during URL  
> partitioning, again due to "No node available for block <blockID>"  
> errors. These errors referred to different blocks on a different  
> DataNode slave, but the blocks were apparently there on the  
> DataNode, as evidenced by later log activity. Again, a stop-all.sh/
> start-all.sh sequence apparently put the crawler back in business,  
> and it's still running now.
>
> Any light that could be shed on the underlying problem(s) would be  
> greatly appreciated.
>
> Thanks,
>
> - Chris
>
> 11:45:24pm - The crawl begins:
>
> bin/do_crawl.sh started at 20060204234524...
>
> 12:17:17am - A series of  read timeout errors appear in the  
> DataNode log on s2. No evidence of these problems appears in the  
> NameNode log.
>
> 060205 001717 2457 DataXCeiver
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.fill
> (BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read
> (BufferedInputStream.java:235)
>         at java.io.FilterInputStream.read(FilterInputStream.java:66)
>         at org.apache.nutch.ndfs.DataNode$DataXceiver.run
> (DataNode.java:293)
>         at java.lang.Thread.run(Thread.java:595)
>
> ...snip...
>
> 060205 001803 2523 DataXCeiver
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.fill
> (BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read
> (BufferedInputStream.java:235)          at  
> java.io.FilterInputStream.read(FilterInputStream.java:66)
>         at org.apache.nutch.ndfs.DataNode$DataXceiver.run
> (DataNode.java:293)
>         at java.lang.Thread.run(Thread.java:595)
>
> 12:19:56am - An EOF error appears in the DataNode log on s2. No  
> evidence of this problem appears in the NameNode log:
>
> 060205 001956 2709 DataXCeiver
> java.io.EOFException: EOF reading from Socket[addr=/
> 192.168.1.9,port=42878,localport=50010]
>         at org.apache.nutch.ndfs.DataNode$DataXceiver.run
> (DataNode.java:383)
>         at java.lang.Thread.run(Thread.java:595)
>
> 12:50:57am - Another series of  read timeout errors appear in the  
> DataNode log on s2.
>
> 12:53:40am - Another EOF error appears in the DataNode log on s2.
>
> 1:27:21am - Another series of  read timeout errors appear in the  
> DataNode log on s2. A few EOF errors are sprinkled in as well.
>
> 1:29:13am - We see the first mention of blk_1084437997772174917 in  
> the DataNode log on s2. It appears to have received this block from  
> 192.168.1.5  (s4):
>
> 060205 012913 4178 Received block blk_1084437997772174917 from /
> 192.168.1.5
>
> 1:29:22am - A series of "Block <blockID> is valid, and cannot be  
> written to" errors appear in the DataNode log on s2. None of these  
> refer to blk_1084437997772174917, however:
>
> 060205 012922 4205 DataXCeiver
> java.io.IOException: Block blk_-4636625784756138174 is valid, and  
> cannot be written to.
>         at org.apache.nutch.ndfs.FSDataset.writeToBlock
> (FSDataset.java:264)
>         at org.apache.nutch.ndfs.DataNode$DataXceiver.run
> (DataNode.java:329)
>         at java.lang.Thread.run(Thread.java:595)
>
> ...snip...
>
> 060205 012923 4212 DataXCeiver
> java.io.EOFException: EOF reading from Socket[addr=/
> 192.168.1.4,port=44881,localport=50010]
>         at org.apache.nutch.ndfs.DataNode$DataXceiver.run
> (DataNode.java:383)
>         at java.lang.Thread.run(Thread.java:595)
> 060205 012924 4213 Received block blk_-7873577862877309136 from /
> 192.168.1.2
> 060205 012925 4214 DataXCeiver
> java.io.EOFException: EOF reading from Socket[addr=/
> 192.168.1.5,port=49342,localport=50010]
>         at org.apache.nutch.ndfs.DataNode$DataXceiver.run
> (DataNode.java:383)
>         at java.lang.Thread.run(Thread.java:595)
> 060205 012926 4215 Received block blk_8833881170910103633 from /
> 192.168.1.9 and mirrored to s5/192.168.1.6:50010
> 060205 012926 4216 Received block blk_-9065065135965140034 from /
> 192.168.1.1
> 060205 012926 4217 DataXCeiver
> java.io.IOException: Block blk_-803324019234272520 is valid, and  
> cannot be written to.
>         at org.apache.nutch.ndfs.FSDataset.writeToBlock
> (FSDataset.java:264)
>         at org.apache.nutch.ndfs.DataNode$DataXceiver.run
> (DataNode.java:329)
>         at java.lang.Thread.run(Thread.java:595)
> 060205 012926 4218 Received block blk_1640893643911715360 from /
> 192.168.1.8 and mirrored to s9/192.168.1.10:50010
> 060205 012926 4219 Received block blk_8581877778682012452 from /
> 192.168.1.10
> 060205 012926 4220 Received block blk_5105973706640312551 from /
> 192.168.1.5
> 060205 012926 4221 DataXCeiver
> java.io.IOException: Block blk_8383681341419474724 is valid, and  
> cannot be written to.
>         at org.apache.nutch.ndfs.FSDataset.writeToBlock
> (FSDataset.java:264)
>         at org.apache.nutch.ndfs.DataNode$DataXceiver.run
> (DataNode.java:329)
>         at java.lang.Thread.run(Thread.java:595)
>
> 1:30:51am - The crawldb update starts, and it quickly completes the  
> map portion of the update:
>
> 060205 013011 CrawlDb update: starting
> 060205 013011 CrawlDb update: db: crawl-20060129091444/crawldb
> 060205 013011 CrawlDb update: segment: crawl-20060129091444/
> segments/20060205005908
> 060205 013011 parsing file:/home/crawler/nutch/conf/nutch-default.xml
> 060205 013011 parsing file:/home/crawler/nutch/conf/crawl-tool.xml
> 060205 013011 parsing file:/home/crawler/nutch/conf/mapred-default.xml
> 060205 013011 parsing file:/home/crawler/nutch/conf/mapred-default.xml
> 060205 013011 parsing file:/home/crawler/nutch/conf/nutch-site.xml
> 060205 013011 CrawlDb update: Merging segment data into db.
> 060205 013051 Running job: job_733lh6
> 060205 013052  map 0%
>
> ...snip...
>
> 060205 013228  map 100%
>
> 1:31:55am - One of the last map tasks for this job is assigned to  
> tracker_69547 (on s5):
>
> 060205 013155 Adding task 'task_m_1yvznp' to set for tracker  
> 'tracker_69547'
>
> 1:32:07am - Tracker_69547 (on s5) starts complaining about not  
> being able to read blk_1084437997772174917:
>
> 060205 013207 task_m_1yvznp  No node available for block  
> blk_1084437997772174917
> 060205 013207 task_m_1yvznp  Could not obtain block from any node:  
> java.io.IOException: No live nodes contain current block
> 060205 013217 task_m_1yvznp  No node available for block  
> blk_1084437997772174917
> 060205 013217 task_m_1yvznp  Could not obtain block from any node:  
> java.io.IOException: No live nodes contain current block
>
> ...snip...
>
> 060205 014149 task_m_1yvznp  No node available for block  
> blk_1084437997772174917
> 060205 014149 task_m_1yvznp  Could not obtain block from any node:  
> java.io.IOException: No live nodes contain current block
>
> 1:41:55am - Tracker_69547 (on s5) gives up and kills its child  
> process:
>
> 060205 014155 Task task_m_1yvznp timed out.  Killing.
> 060205 014156 Server connection on port 50050 from 127.0.0.1: exiting
> 060205 014156 task_m_1yvznp Child Error
> java.io.IOException: Task process exit with nonzero status.
>         at org.apache.nutch.mapred.TaskRunner.runChild
> (TaskRunner.java:139)
>         at org.apache.nutch.mapred.TaskRunner.run(TaskRunner.java:92)
> 060205 014159 task_m_1yvznp done; removing files.
>
> 1:41:59am - The TaskTracker notices that it has lost the map task,  
> so it reassigns it to tracker_69547 (on s5) three more times, with  
> the exact same results in all of the logs. After the fourth  
> failure, it aborts the owning job:
>
> 060205 014159 Task 'task_m_1yvznp' has been lost.
> 060205 014159 Adding task 'task_m_1yvznp' to set for tracker  
> 'tracker_69547'
> 060205 015205 Task 'task_m_1yvznp' has been lost.
> 060205 015205 Adding task 'task_m_1yvznp' to set for tracker  
> 'tracker_69547'
> 060205 020211 Task 'task_m_1yvznp' has been lost.
> 060205 020212 Adding task 'task_m_1yvznp' to set for tracker  
> 'tracker_69547'
> 060205 021218 Task 'task_m_1yvznp' has been lost.
> 060205 021218 Task task_m_1yvznp has failed 4 times.  Aborting  
> owning job job_733lh6
> 060205 021218 Server connection on port 8010 from 127.0.0.1: exiting
>
> 2:25:31am - We see the first mention of the "missing" block in the  
> NameNode log. This block was apparently hosted on s2:
>
> 060205 022531 Pending transfer (block blk_1084437997772174917) from  
> s2:50010 to 1 destinations
>
> 2:25:31am  - We see evidence of the "missing" block being  
> successfully transferred in the DataNode log on s2. There is  
> absolutely nothing in the log between a nondescript entry at  
> 1:32:26am (which is after the time that tracker_69547 started  
> complaining about the missing block) and the block transfer  
> messages at 2:25:31am:
>
> 060205 013226 4479 Served block blk_4776786656991943447 to /
> 192.168.1.8
> 060205 022531 11 Starting thread to transfer block  
> blk_-1870083282504339828 to  
> [Lorg.apache.nutch.ndfs.DatanodeInfo;@1561437
> 060205 022531 11 Starting thread to transfer block  
> blk_1084437997772174917 to  
> [Lorg.apache.nutch.ndfs.DatanodeInfo;@b6b2a5
> 060205 022531 4480 Transmitted block blk_-1870083282504339828 to  
> s9/192.168.1.10:50010
> 060205 022531 4481 Transmitted block blk_1084437997772174917 to  
> s9/192.168.1.10:50010
>
>
> --
> ------------------------
> Chris Schneider
> TransPac Software, Inc.
> [hidden email]
> ------------------------

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net