Run Nutch in Eclipse- Wiki documentation -Query step 1.4.3

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Run Nutch in Eclipse- Wiki documentation -Query step 1.4.3

Rajinimaski
Hi Team,

    Initially I followed the steps mentioned in the  nutch wiki
tutorial<http://wiki.apache.org/nutch/NutchTutorial>
to set up nutch from binary distribution. And it was successful undertook
crawling and indexing.


Now I am trying to set up nutch in eclipse and I am stuck at 1.4.3 step  (
Link <http://wiki.apache.org/nutch/RunNutchInEclipse#Configure_Nutch>)
 mentioned below

   - 1. see the Tutorial and follow all configuration steps, ensure that
   you DO NOT undertake any crawling. The directory structure for Nutch trunk
   enables us to edit nutch-site.xml.template, nutch-default.xml and
   regex-urlfilter.txt.template in our /conf directory, these properties will
   then be automatically built into our /runtime build folder.
   - 2. ensure that you change the property "plugin.folders" to
   "./src/plugin" on $NUTCH_HOME/conf/nutch-site.xml.


This step 1 is pointing to the same tutorial that I followed in step one
when I used nutch in binary version. My doubt is whether I should use same
setup(if yes, where do I need to mention in eclipse nutch project that
nutch_home is at particular location) or should I follow the same steps and
configure it in eclipse work space //trunk folder?

  I am getting job failed message, error java.lang.RuntimeException: Error
in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

Regards
Rajani
Reply | Threaded
Open this post in threaded view
|

Re: Run Nutch in Eclipse- Wiki documentation -Query step 1.4.3

lewis john mcgibbney
Hi Rajani,

I'm slightly confused here.

Can you explain in a summary, what is actually wrong. Do you think there is
something wrong with the wording of the tutorial?

Lewis

On Tue, Dec 18, 2012 at 9:27 AM, Rajani Maski <[hidden email]> wrote:

> Hi Team,
>
>     Initially I followed the steps mentioned in the  nutch wiki
> tutorial<http://wiki.apache.org/nutch/NutchTutorial>
> to set up nutch from binary distribution. And it was successful undertook
> crawling and indexing.
>
>
> Now I am trying to set up nutch in eclipse and I am stuck at 1.4.3 step  (
> Link <http://wiki.apache.org/nutch/RunNutchInEclipse#Configure_Nutch>)
>  mentioned below
>
>    - 1. see the Tutorial and follow all configuration steps, ensure that
>    you DO NOT undertake any crawling. The directory structure for Nutch
> trunk
>    enables us to edit nutch-site.xml.template, nutch-default.xml and
>    regex-urlfilter.txt.template in our /conf directory, these properties
> will
>    then be automatically built into our /runtime build folder.
>    - 2. ensure that you change the property "plugin.folders" to
>    "./src/plugin" on $NUTCH_HOME/conf/nutch-site.xml.
>
>
> This step 1 is pointing to the same tutorial that I followed in step one
> when I used nutch in binary version. My doubt is whether I should use same
> setup(if yes, where do I need to mention in eclipse nutch project that
> nutch_home is at particular location) or should I follow the same steps and
> configure it in eclipse work space //trunk folder?
>
>   I am getting job failed message, error java.lang.RuntimeException: Error
> in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
>
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
> at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
>
> Regards
> Rajani
>



--
*Lewis*
Reply | Threaded
Open this post in threaded view
|

Re: Run Nutch in Eclipse- Wiki documentation -Query step 1.4.3

Rajinimaski
Hi Lewis,

     In the tutorial
<http://wiki.apache.org/nutch/RunNutchInEclipse#Configure_Nutch>there
is step which talks about configuring nutch : "*see the Tutorial and follow
all configuration
steps*"<http://wiki.apache.org/nutch/RunNutchInEclipse#Configure_Nutch>

Where this configuration need to be done? Is it in eclipse set up that will
have directory structure : trunk/conf enabling us to edit
nutch-site.xml.template, nutch-default.xml and
regex-urlfilter.txt.template?


And after the step to : Establish the Eclipse environment for
Nutch<http://wiki.apache.org/nutch/RunNutchInEclipse#Establish_the_Eclipse_environment_for_Nutch>,
I see that 2 jar files missing the reference and throws *error in import*.
 in classes :

*org.apache.nutch.parse.html.TestDOMContentUtils*
import org.cyberneko.html.parsers.*;

*org.apache.nutch.parse.feedFeedParser*

import com.sun.syndication.feed.synd.SyndCategory;
import com.sun.syndication.feed.synd.SyndContent;
import com.sun.syndication.feed.synd.SyndEntry;
import com.sun.syndication.feed.synd.SyndFeed;
import com.sun.syndication.feed.synd.SyndPerson;
import com.sun.syndication.io.SyndFeedInput;

*Should we download and add them separately?*

If I remove the plugins and build the project then build is successful but
while  running the application I get an error :
 "error java.lang.RuntimeException: Error in configuring object"
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)


Regards
Rajani






On Wed, Dec 19, 2012 at 7:00 PM, Lewis John Mcgibbney <
[hidden email]> wrote:

> Hi Rajani,
>
> I'm slightly confused here.
>
> Can you explain in a summary, what is actually wrong. Do you think there is
> something wrong with the wording of the tutorial?
>
> Lewis
>
> On Tue, Dec 18, 2012 at 9:27 AM, Rajani Maski <[hidden email]>
> wrote:
>
> > Hi Team,
> >
> >     Initially I followed the steps mentioned in the  nutch wiki
> > tutorial<http://wiki.apache.org/nutch/NutchTutorial>
> > to set up nutch from binary distribution. And it was successful undertook
> > crawling and indexing.
> >
> >
> > Now I am trying to set up nutch in eclipse and I am stuck at 1.4.3 step
>  (
> > Link <http://wiki.apache.org/nutch/RunNutchInEclipse#Configure_Nutch>)
> >  mentioned below
> >
> >    - 1. see the Tutorial and follow all configuration steps, ensure that
> >    you DO NOT undertake any crawling. The directory structure for Nutch
> > trunk
> >    enables us to edit nutch-site.xml.template, nutch-default.xml and
> >    regex-urlfilter.txt.template in our /conf directory, these properties
> > will
> >    then be automatically built into our /runtime build folder.
> >    - 2. ensure that you change the property "plugin.folders" to
> >    "./src/plugin" on $NUTCH_HOME/conf/nutch-site.xml.
> >
> >
> > This step 1 is pointing to the same tutorial that I followed in step one
> > when I used nutch in binary version. My doubt is whether I should use
> same
> > setup(if yes, where do I need to mention in eclipse nutch project that
> > nutch_home is at particular location) or should I follow the same steps
> and
> > configure it in eclipse work space //trunk folder?
> >
> >   I am getting job failed message, error java.lang.RuntimeException:
> Error
> > in configuring object
> > at
> >
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> > at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> > at
> >
> >
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> > at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> > Exception in thread "main" java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> > at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
> > at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> >
> > Regards
> > Rajani
> >
>
>
>
> --
> *Lewis*
>
Reply | Threaded
Open this post in threaded view
|

Re: Run Nutch in Eclipse- Wiki documentation -Query step 1.4.3

vibhor007
This post has NOT been accepted by the mailing list yet.
i am getting  problem to retriev data  from  segments.
i am using  bin/nutch readseg -list crawl/segments/* segmentAllContent this command.

o/p:-
NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
20130814112245 1 2013-08-14T11:22:52 2013-08-14T11:22:52 11
20130814112259 5 2013-08-14T11:23:03 2013-08-14T11:23:11 55
20130814112318 5 2013-08-14T11:23:24 2013-08-14T11:23:32 55
segmentAllContent 0 ? ? ?

why ? sign is showing?
Reply | Threaded
Open this post in threaded view
|

Re: Run Nutch in Eclipse- Wiki documentation -Query step 1.4.3

vibhor007
This post has NOT been accepted by the mailing list yet.
i am getting  problem to retriev data  from  segments.
i am using  bin/nutch readseg -list crawl/segments/* segmentAllContent this command.

o/p:-
NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
20130814112245 1 2013-08-14T11:22:52 2013-08-14T11:22:52 11
20130814112259 5 2013-08-14T11:23:03 2013-08-14T11:23:11 55
20130814112318 5 2013-08-14T11:23:24 2013-08-14T11:23:32 55
segmentAllContent 0 ? ? ?

why ? sign is showing?