NDFS java.io.IOException

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

NDFS java.io.IOException

Ordway, Ryan

        I'm doing some experimenting with NDFS to see if it will work
for my nutch cluster. It seems to do fine until my data nodes start
to run out of memory. This is a pilot project, so I'm using some
older systems -- P4/1.9ghz/512MB/20GB. Right now they're booting off
of a ramdisk image, but I'm going to be moving to a disk-based system
to see if that helps things.

        Anyhow, my database node was able to generate the database and
import about 4 million URLs just fine. It generated segments from the
database just fine. I generated it for 4 fetchers and ran a fetcher
on each of my compute nodes.

        To clarify, I've got one master node that maintains the
database,
and acts as the NDFS name node. I've got four compute nodes that perform
fetches, and act as data nodes.

        Things went fine for a good 36 hours or so, then I noticed that
the systems were starting to swap and their performance started to tank.
After awhile, each node's fetch process started to do die off with
errors like:

SEVERE error writing output:java.io.IOException: Could not obtain new
output block for file /user/root/segments/20050919121742-2/content/data

        ... stack trace ...

SEVERE error writing output:java.io.IOException: key out of order:
312670 after 312670

        ... stack trace ...

... A few more of these errors ...

Exception in thread "main" java.lang.RunTimeException: SEVERE error
logged. Exiting fetcher.

        And then the fetch dies.

        Is there anything that can be done to prevent this, short of
adding
more RAM to these systems?

        Thanks,

        Ryan

--
Ryan Ordway                          
Unix Systems Administrator            E-mail:
[hidden email]
Oregon State University Libraries
[hidden email]
121 The Valley Library
Corvallis, OR 97331                   Desk: 541.737.8972

Reply | Threaded
Open this post in threaded view
|

Re: NDFS java.io.IOException

Doug Cutting-2
What version of Nutch are you using?

The version of NDFS in the mapred branch is much improved.  The crawling
code in that branch has also been re-written to be MapReduce-based, and
will automatically manage multi-machine fetching, db updates, indexing, etc.

There's not yet much documentation for this version however.  Probably
the best documentation is in this pdf, and it is spartan:

http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf

Here's a quick cheat sheet:

svn co https://svn.apache.org/repos/asf/lucene/nutch/branches/mapred
cd mapred
ant

emacs conf/nutch-site.xml
# define fs.default.name to be masterHost:XXXX
# define mapred.job.tracker to be masterHost:YYYY

emacs conf/mapred-default.xml
# define mapred.map.tasks to be multiple of # of slave hosts
# define mapred.reduce tasks to be # of slave hosts

# make a file with slave host names
echo slave1 >> ~/.slaves
echo slave2 >> ~/.slaves
echo slave3 >> ~/.slaves

# start all ndfs & mapred daemons
bin/start-all.sh

# make a directory with seed list file
mkdir seeds
echo http://lucene.apache.org/nutch/ > seeds/urls

# put seed directory in ndfs
bin/nutch ndfs -put seeds seeds

# crawl a bit
bin/nutch crawl seeds -depth 3

# monitor things from adminstrative interface
firefox masterHost:7845

If you try this, please tell us how it goes.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: NDFS java.io.IOException

Rod Taylor-2
On Tue, 2005-09-20 at 19:07 -0700, Doug Cutting wrote:
> What version of Nutch are you using?
>
> The version of NDFS in the mapred branch is much improved.  The crawling
> code in that branch has also been re-written to be MapReduce-based, and
> will automatically manage multi-machine fetching, db updates, indexing, etc.

I haven't looked at it and wasn't concerned until I saw the
"automatically" but will we still be able to crawl and not index?

Secondly, will it still be possible to get the output dumped (ie.
segread -dump) to a flat file in large chunks?


We use Nutch as a crawler only, then after taking a dump of the data we
remove the segment from the filesystem. This means we only have a couple
hundred GB of data around at any given time.

We do our crawling, db updates, etc. in one environment then
post-process the HTML retrieved in large chunks (segments of about 200k
pages) within another environment.

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: NDFS java.io.IOException

Doug Cutting-2
Rod Taylor wrote:
> I haven't looked at it and wasn't concerned until I saw the
> "automatically" but will we still be able to crawl and not index?

Yes, but the sequence of commands has changed slightly.  Look at Crawl.java.

> Secondly, will it still be possible to get the output dumped (ie.
> segread -dump) to a flat file in large chunks?

In principle, yes, but I have not tested the segread code in the mapred
branch, and it may need to be updated, as the structure of segments has
changed a bit.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: NDFS java.io.IOException

Rod Taylor-2
> > Secondly, will it still be possible to get the output dumped (ie.
> > segread -dump) to a flat file in large chunks?
>
> In principle, yes, but I have not tested the segread code in the mapred
> branch, and it may need to be updated, as the structure of segments has
> changed a bit.

I'm not a Java programmer nor do I really understand what is going on,
but I took a crack at reimplementing the most basic version of the
segread code (full output with -dump to stdout).

It appears to function correctly with a single Nutch backend. I am sure
it is not correct to  send data to STDOUT from the reduce() function,
but I'm not sure what other location is more appropriate.

I am hoping that this will encourage someone to either finish it off or
tell me about the logic issues.

The attached SegmentReader.java goes into org.apache.nutch.crawl and you
may need to fiddle with the bin/nutch shell script to use it.

--
Rod Taylor <[hidden email]>

SegmentReader.java (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: NDFS java.io.IOException

Gal Nitzan
In reply to this post by Doug Cutting-2
Doug Cutting wrote:

> What version of Nutch are you using?
>
> The version of NDFS in the mapred branch is much improved.  The
> crawling code in that branch has also been re-written to be
> MapReduce-based, and will automatically manage multi-machine fetching,
> db updates, indexing, etc.
>
> There's not yet much documentation for this version however.  Probably
> the best documentation is in this pdf, and it is spartan:
>
> http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf 
>
>
> Here's a quick cheat sheet:
>
> svn co https://svn.apache.org/repos/asf/lucene/nutch/branches/mapred
> cd mapred
> ant
>
> emacs conf/nutch-site.xml
> # define fs.default.name to be masterHost:XXXX
> # define mapred.job.tracker to be masterHost:YYYY
>
> emacs conf/mapred-default.xml
> # define mapred.map.tasks to be multiple of # of slave hosts
> # define mapred.reduce tasks to be # of slave hosts
>
> # make a file with slave host names
> echo slave1 >> ~/.slaves
> echo slave2 >> ~/.slaves
> echo slave3 >> ~/.slaves
>
> # start all ndfs & mapred daemons
> bin/start-all.sh
>
> # make a directory with seed list file
> mkdir seeds
> echo http://lucene.apache.org/nutch/ > seeds/urls
>
> # put seed directory in ndfs
> bin/nutch ndfs -put seeds seeds
>
> # crawl a bit
> bin/nutch crawl seeds -depth 3
>
> # monitor things from adminstrative interface
> firefox masterHost:7845
>
> If you try this, please tell us how it goes.
>
> Doug
>
> .
>

Hi,

This cheat sheet worked perfectly !!! first time !!!

And all I can say is wow. Looks great.

Gal.
Reply | Threaded
Open this post in threaded view
|

HTTP ERROR: 500

Gal Nitzan
Hi,

I connected to jetty on port: 7845

when clicking: jobdetails.jsp


    HTTP ERROR: 500

Internal Server Error

RequestURI=/jobdetails.jsp

/Powered by Jetty:/ <http://jetty.mortbay.org>

/The job tracker works perfectly.

Regards,

Gal
Reply | Threaded
Open this post in threaded view
|

Maintaining only one FAQ

Gal Nitzan
Hi,

I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show  to
contain the questions from the FAQ in nutchs' home page
http://lucene.apache.org/nutch/faq.html

I propose to replace the current home page FAQ with the one in the wiki.
I believe there should be only one FAQ and it is easier to maintain.

Regards,

Gal
Reply | Threaded
Open this post in threaded view
|

Re: Maintaining only one FAQ

Stefan Groschupf-2
1+

Am 28.09.2005 um 12:12 schrieb Gal Nitzan:

> Hi,
>
> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?
> action=show  to contain the questions from the FAQ in nutchs' home  
> page http://lucene.apache.org/nutch/faq.html
>
> I propose to replace the current home page FAQ with the one in the  
> wiki. I believe there should be only one FAQ and it is easier to  
> maintain.
>
> Regards,
>
> Gal
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Maintaining only one FAQ

Jérôme Charron
+1

On 9/28/05, Stefan Groschupf <[hidden email]> wrote:

>
> 1+
>
> Am 28.09.2005 um 12:12 schrieb Gal Nitzan:
>
> > Hi,
> >
> > I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?
> > action=show to contain the questions from the FAQ in nutchs' home
> > page http://lucene.apache.org/nutch/faq.html
> >
> > I propose to replace the current home page FAQ with the one in the
> > wiki. I believe there should be only one FAQ and it is easier to
> > maintain.
> >
> > Regards,
> >
> > Gal
> >
> >
>
>


--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: Maintaining only one FAQ

Paul van Brouwershaven
In reply to this post by Gal Nitzan
My comments:

Are there any mailing lists available?

It should be easier if the answer listed within a table here.

Also comebine the questions: Are there any mailing lists available? & Is
there a mail archive?

Like:

Listname | Subscribe | Unsubscribe | Online Archive | Download Archive


Gal Nitzan wrote:
> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show  to
> contain the questions from the FAQ in nutchs' home page
> http://lucene.apache.org/nutch/faq.html
>
> I propose to replace the current home page FAQ with the one in the wiki.
> I believe there should be only one FAQ and it is easier to maintain.
Yes, mutch better!
Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-general] Maintaining only one FAQ

Otis Gospodnetic-2-2
In reply to this post by Gal Nitzan
+1
And +1 for making it look like the existing Lucene FAQ.

Otis

--- Gal Nitzan <[hidden email]> wrote:

> Hi,
>
> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show 
> to
> contain the questions from the FAQ in nutchs' home page
> http://lucene.apache.org/nutch/faq.html
>
> I propose to replace the current home page FAQ with the one in the
> wiki.
> I believe there should be only one FAQ and it is easier to maintain.
>
> Regards,
>
> Gal
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Power Architecture Resource Center: Free content, downloads,
> discussions,
> and more. http://solutions.newsforge.com/ibmarch.tmpl
> _______________________________________________
> Nutch-general mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>

Reply | Threaded
Open this post in threaded view
|

Re: Maintaining only one FAQ

Doug Cutting-2
In reply to this post by Gal Nitzan
+1

Gal Nitzan wrote:

> Hi,
>
> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show  to
> contain the questions from the FAQ in nutchs' home page
> http://lucene.apache.org/nutch/faq.html
>
> I propose to replace the current home page FAQ with the one in the wiki.
> I believe there should be only one FAQ and it is easier to maintain.
>
> Regards,
>
> Gal
Reply | Threaded
Open this post in threaded view
|

Re: Maintaining only one FAQ - I can not do it only webmaster

Gal Nitzan
Doug Cutting wrote:

> +1
>
> Gal Nitzan wrote:
>> Hi,
>>
>> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show 
>> to contain the questions from the FAQ in nutchs' home page
>> http://lucene.apache.org/nutch/faq.html
>>
>> I propose to replace the current home page FAQ with the one in the
>> wiki. I believe there should be only one FAQ and it is easier to
>> maintain.
>>
>> Regards,
>>
>> Gal
>
> .
>

Reply | Threaded
Open this post in threaded view
|

Doug - FAQ - Re: Maintaining only one FAQ

Jon Shoberg
In reply to this post by Doug Cutting-2
Doug Cutting wrote:

> +1
>
> Gal Nitzan wrote:
>
>> Hi,
>>
>> I have enhanced the FAQ http://wiki.apache.org/nutch/FAQ?action=show 
>> to contain the questions from the FAQ in nutchs' home page
>> http://lucene.apache.org/nutch/faq.html
>>
>> I propose to replace the current home page FAQ with the one in the
>> wiki. I believe there should be only one FAQ and it is easier to
>> maintain.
>>
>> Regards,
>>
>> Gal

Doug,

   I would be glad to assist with the FAQ.  Is there a way to start?

   BTW .. We're using Nutch for search on our site :)

   http://fisher.osu.edu

Best,

Jon Shoberg
Systems Developer
Fisher College of Business


Reply | Threaded
Open this post in threaded view
|

Re: Doug - FAQ - Re: Maintaining only one FAQ

Gal Nitzan
Jon Shoberg wrote:

> Doug Cutting wrote:
>> +1
>>
>> Gal Nitzan wrote:
>>
>>> Hi,
>>>
>>> I have enhanced the FAQ
>>> http://wiki.apache.org/nutch/FAQ?action=show  to contain the
>>> questions from the FAQ in nutchs' home page
>>> http://lucene.apache.org/nutch/faq.html
>>>
>>> I propose to replace the current home page FAQ with the one in the
>>> wiki. I believe there should be only one FAQ and it is easier to
>>> maintain.
>>>
>>> Regards,
>>>
>>> Gal
>
> Doug,
>
>   I would be glad to assist with the FAQ.  Is there a way to start?
>
>   BTW .. We're using Nutch for search on our site :)
>
>   http://fisher.osu.edu
>
> Best,
>
> Jon Shoberg
> Systems Developer
> Fisher College of Business
>
>
>
> .
>
Hi Doug, yes thank you.

I do not have an access to the main site of Nutch on apache.org

The link on our home page points currently to:
http://lucene.apache.org/nutch/faq.html

It should point to: http://wiki.apache.org/nutch/FAQ

Regards,

Gal
Reply | Threaded
Open this post in threaded view
|

Re: Doug - FAQ - Re: Maintaining only one FAQ

Piotr Kosiorowski
I will redeploy the site to point to Wiki - I am in process of 0.7.1
release but it takes much longer than I expected because of lack of
time. I will do this change during release preparation - I hope I will
manage to do it today or over the weekend finally.
Regards
Piotr


Gal Nitzan wrote:

> Jon Shoberg wrote:
>
>> Doug Cutting wrote:
>>
>>> +1
>>>
>>> Gal Nitzan wrote:
>>>
>>>> Hi,
>>>>
>>>> I have enhanced the FAQ
>>>> http://wiki.apache.org/nutch/FAQ?action=show  to contain the
>>>> questions from the FAQ in nutchs' home page
>>>> http://lucene.apache.org/nutch/faq.html
>>>>
>>>> I propose to replace the current home page FAQ with the one in the
>>>> wiki. I believe there should be only one FAQ and it is easier to
>>>> maintain.
>>>>
>>>> Regards,
>>>>
>>>> Gal
>>
>>
>> Doug,
>>
>>   I would be glad to assist with the FAQ.  Is there a way to start?
>>
>>   BTW .. We're using Nutch for search on our site :)
>>
>>   http://fisher.osu.edu
>>
>> Best,
>>
>> Jon Shoberg
>> Systems Developer
>> Fisher College of Business
>>
>>
>>
>> .
>>
> Hi Doug, yes thank you.
>
> I do not have an access to the main site of Nutch on apache.org
>
> The link on our home page points currently to:
> http://lucene.apache.org/nutch/faq.html
>
> It should point to: http://wiki.apache.org/nutch/FAQ
>
> Regards,
>
> Gal
>