cannot find nutch logs in distributed mode

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

cannot find nutch logs in distributed mode

srinir
Hi

I am running nutch in distributed mode. I would like to see all nuch logs
written to files. I only see the console output. Can i see the same
information logged to some log files ?

When i run nutch in local mode i do see the logs in runtime/local/logs
directory. But when i run nutch in distributed mode, i dont see it anywhere
except console.

Can anyone help me with the settings that i need to change ?

Thanks
Srini
Reply | Threaded
Open this post in threaded view
|

Re: cannot find nutch logs in distributed mode

Sebastian Nagel
Hi Srini,

in distributed mode the bulk of Nutch's log output is kept in the Hadoop task logs.
The configuration whether, how long and where these logs are kept depends on the
configuration of your Hadoop cluster.  You can easily find tutorials and examples
how to configure this if you google for "hadoop task logs".

Be careful the Nutch logs are usually huge.  The easiest way to get them for a jobs
is to run the following command on the master node:

  yarn logs -applicationId <app_id>

Best,
Sebastian

On 07/21/2017 10:09 PM, Srinivasan Ramaswamy wrote:

> Hi
>
> I am running nutch in distributed mode. I would like to see all nuch logs
> written to files. I only see the console output. Can i see the same
> information logged to some log files ?
>
> When i run nutch in local mode i do see the logs in runtime/local/logs
> directory. But when i run nutch in distributed mode, i dont see it anywhere
> except console.
>
> Can anyone help me with the settings that i need to change ?
>
> Thanks
> Srini
>

Reply | Threaded
Open this post in threaded view
|

Re: cannot find nutch logs in distributed mode

srinir
Hi Sebastin

I am referring to the INFO messages that are printed in console when nutch
1.14 is running in distributed mode. For example

Injecting seed URLs
/mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
seed.txt
17/07/29 06:51:18 INFO crawl.Injector: Injector: starting at 2017-07-29
06:51:18
17/07/29 06:51:18 INFO crawl.Injector: Injector: crawlDb:
/user/hadoop/crawlDIR/crawldb
17/07/29 06:51:18 INFO crawl.Injector: Injector: urlDir: seed.txt
17/07/29 06:51:18 INFO crawl.Injector: Injector: Converting injected urls
to crawl db entries.
17/07/29 06:51:19 INFO client.RMProxy: Connecting to ResourceManager at
ip-*-*-*-*.ec2.internal/*.*.*.*:8032
17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to process
: 0
17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to process
: 1
.
.
17/07/29 06:51:20 INFO mapreduce.Job: Running job: job_1500749038440_0003
17/07/29 06:51:28 INFO mapreduce.Job: Job job_1500749038440_0003 running in
uber mode : false
17/07/29 06:51:28 INFO mapreduce.Job:  map 0% reduce 0%
17/07/29 06:51:33 INFO mapreduce.Job:  map 100% reduce 0%
17/07/29 06:51:38 INFO mapreduce.Job:  map 100% reduce 4%
17/07/29 06:51:40 INFO mapreduce.Job:  map 100% reduce 6%
17/07/29 06:51:41 INFO mapreduce.Job:  map 100% reduce 49%
17/07/29 06:51:42 INFO mapreduce.Job:  map 100% reduce 66%
17/07/29 06:51:43 INFO mapreduce.Job:  map 100% reduce 87%
17/07/29 06:51:44 INFO mapreduce.Job:  map 100% reduce 100%

I am running nutch from a EMR cluster. I did check around the log
directories and I dont see the messages i see in the console anywhere else.

One more thing i noticed is when i issue the command

*ps -ef | grep nutch*

hadoop    21616  18344  2 06:59 pts/1    00:00:09
/usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx1000m -server
-XX:OnOutOfMemoryError=kill -9 %p *-Dhadoop.log.dir=/usr/lib/hadoop/logs*
*-Dhadoop.log.file=hadoop.log* -Dhadoop.home.dir=/usr/lib/hadoop
-Dhadoop.id.str= *-Dhadoop.root.logger=INFO,console*
-Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
-Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30
org.apache.hadoop.util.RunJar
/mnt/nutch/runtime/deploy/apache-nutch-1.14-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher -D mapreduce.map.java.opts=-Xmx2304m -D
mapreduce.map.memory.mb=2880 -D mapreduce.reduce.java.opts=-Xmx4608m -D
mapreduce.reduce.memory.mb=5760 -D mapreduce.job.reduces=12 -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
/user/hadoop/crawlDIR/segments/20170729065841 -noParsing -threads 100

The logger mentioned in the running process is console. How do i change it
to the log file rotated by log4j ?

i tried modifying the conf/log4j.properties file to use DRFA instead
of cmdstdout logger. but that did not help either.

Any help would be appreciated.

Thanks
Srini

On Mon, Jul 24, 2017 at 12:52 AM, Sebastian Nagel <
[hidden email]> wrote:

> Hi Srini,
>
> in distributed mode the bulk of Nutch's log output is kept in the Hadoop
> task logs.
> The configuration whether, how long and where these logs are kept depends
> on the
> configuration of your Hadoop cluster.  You can easily find tutorials and
> examples
> how to configure this if you google for "hadoop task logs".
>
> Be careful the Nutch logs are usually huge.  The easiest way to get them
> for a jobs
> is to run the following command on the master node:
>
>   yarn logs -applicationId <app_id>
>
> Best,
> Sebastian
>
> On 07/21/2017 10:09 PM, Srinivasan Ramaswamy wrote:
> > Hi
> >
> > I am running nutch in distributed mode. I would like to see all nuch logs
> > written to files. I only see the console output. Can i see the same
> > information logged to some log files ?
> >
> > When i run nutch in local mode i do see the logs in runtime/local/logs
> > directory. But when i run nutch in distributed mode, i dont see it
> anywhere
> > except console.
> >
> > Can anyone help me with the settings that i need to change ?
> >
> > Thanks
> > Srini
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: cannot find nutch logs in distributed mode

Sebastian Nagel
Hi Srini,

> I am referring to the INFO messages that are printed in console when nutch
> 1.14 is running in distributed mode. For example

Afaics, the only way to get the logs of the job client is to redirect the console output to a file,
e.g.,

/mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb seed.txt &>inject.log

> I am running nutch from a EMR cluster.

If you're interested in the logs of task attempts, see:

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html


Sebastian

On 07/29/2017 09:38 AM, Srinivasan Ramaswamy wrote:

> Hi Sebastin
>
> I am referring to the INFO messages that are printed in console when nutch
> 1.14 is running in distributed mode. For example
>
> Injecting seed URLs
> /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
> seed.txt
> 17/07/29 06:51:18 INFO crawl.Injector: Injector: starting at 2017-07-29
> 06:51:18
> 17/07/29 06:51:18 INFO crawl.Injector: Injector: crawlDb:
> /user/hadoop/crawlDIR/crawldb
> 17/07/29 06:51:18 INFO crawl.Injector: Injector: urlDir: seed.txt
> 17/07/29 06:51:18 INFO crawl.Injector: Injector: Converting injected urls
> to crawl db entries.
> 17/07/29 06:51:19 INFO client.RMProxy: Connecting to ResourceManager at
> ip-*-*-*-*.ec2.internal/*.*.*.*:8032
> 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to process
> : 0
> 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to process
> : 1
> .
> .
> 17/07/29 06:51:20 INFO mapreduce.Job: Running job: job_1500749038440_0003
> 17/07/29 06:51:28 INFO mapreduce.Job: Job job_1500749038440_0003 running in
> uber mode : false
> 17/07/29 06:51:28 INFO mapreduce.Job:  map 0% reduce 0%
> 17/07/29 06:51:33 INFO mapreduce.Job:  map 100% reduce 0%
> 17/07/29 06:51:38 INFO mapreduce.Job:  map 100% reduce 4%
> 17/07/29 06:51:40 INFO mapreduce.Job:  map 100% reduce 6%
> 17/07/29 06:51:41 INFO mapreduce.Job:  map 100% reduce 49%
> 17/07/29 06:51:42 INFO mapreduce.Job:  map 100% reduce 66%
> 17/07/29 06:51:43 INFO mapreduce.Job:  map 100% reduce 87%
> 17/07/29 06:51:44 INFO mapreduce.Job:  map 100% reduce 100%
>
> I am running nutch from a EMR cluster. I did check around the log
> directories and I dont see the messages i see in the console anywhere else.
>
> One more thing i noticed is when i issue the command
>
> *ps -ef | grep nutch*
>
> hadoop    21616  18344  2 06:59 pts/1    00:00:09
> /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx1000m -server
> -XX:OnOutOfMemoryError=kill -9 %p *-Dhadoop.log.dir=/usr/lib/hadoop/logs*
> *-Dhadoop.log.file=hadoop.log* -Dhadoop.home.dir=/usr/lib/hadoop
> -Dhadoop.id.str= *-Dhadoop.root.logger=INFO,console*
> -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native
> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
> -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30
> org.apache.hadoop.util.RunJar
> /mnt/nutch/runtime/deploy/apache-nutch-1.14-SNAPSHOT.job
> org.apache.nutch.fetcher.Fetcher -D mapreduce.map.java.opts=-Xmx2304m -D
> mapreduce.map.memory.mb=2880 -D mapreduce.reduce.java.opts=-Xmx4608m -D
> mapreduce.reduce.memory.mb=5760 -D mapreduce.job.reduces=12 -D
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
> /user/hadoop/crawlDIR/segments/20170729065841 -noParsing -threads 100
>
> The logger mentioned in the running process is console. How do i change it
> to the log file rotated by log4j ?
>
> i tried modifying the conf/log4j.properties file to use DRFA instead
> of cmdstdout logger. but that did not help either.
>
> Any help would be appreciated.
>
> Thanks
> Srini
>
> On Mon, Jul 24, 2017 at 12:52 AM, Sebastian Nagel <
> [hidden email]> wrote:
>
>> Hi Srini,
>>
>> in distributed mode the bulk of Nutch's log output is kept in the Hadoop
>> task logs.
>> The configuration whether, how long and where these logs are kept depends
>> on the
>> configuration of your Hadoop cluster.  You can easily find tutorials and
>> examples
>> how to configure this if you google for "hadoop task logs".
>>
>> Be careful the Nutch logs are usually huge.  The easiest way to get them
>> for a jobs
>> is to run the following command on the master node:
>>
>>   yarn logs -applicationId <app_id>
>>
>> Best,
>> Sebastian
>>
>> On 07/21/2017 10:09 PM, Srinivasan Ramaswamy wrote:
>>> Hi
>>>
>>> I am running nutch in distributed mode. I would like to see all nuch logs
>>> written to files. I only see the console output. Can i see the same
>>> information logged to some log files ?
>>>
>>> When i run nutch in local mode i do see the logs in runtime/local/logs
>>> directory. But when i run nutch in distributed mode, i dont see it
>> anywhere
>>> except console.
>>>
>>> Can anyone help me with the settings that i need to change ?
>>>
>>> Thanks
>>> Srini
>>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: cannot find nutch logs in distributed mode

srinir
Thanks for your reply Sebastian. I asked this question for the following
reasons:

* We were running crawl script using nohup and we redirected the output to
a local log file. In some weird/rare scenario (may be our master node went
down at that time, i am not sure), the log file stopped but nutch process
was running. We could not really see what it (nutch) is doing.

* I see that the nutch code uses log4j to log, so i am wondering it should
all go to a log4j rotated log file instead of just console. The same works
well in local mode. Can you please explain me why it doesnt write to a file
and only to console ?

* It also puzzles me why the running process shows "
Dhadoop.root.logger=INFO,console"  though i changed conf/log4j.properties
to "log4j.rootLogger=INFO,DRFA"

Thanks
Srini

On Tue, Aug 1, 2017 at 7:51 AM, Sebastian Nagel <[hidden email]>
wrote:

> Hi Srini,
>
> > I am referring to the INFO messages that are printed in console when
> nutch
> > 1.14 is running in distributed mode. For example
>
> Afaics, the only way to get the logs of the job client is to redirect the
> console output to a file,
> e.g.,
>
> /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
> seed.txt &>inject.log
>
> > I am running nutch from a EMR cluster.
>
> If you're interested in the logs of task attempts, see:
>
> http://docs.aws.amazon.com/emr/latest/ManagementGuide/
> emr-manage-view-web-log-files.html
>
>
> Sebastian
>
> On 07/29/2017 09:38 AM, Srinivasan Ramaswamy wrote:
> > Hi Sebastin
> >
> > I am referring to the INFO messages that are printed in console when
> nutch
> > 1.14 is running in distributed mode. For example
> >
> > Injecting seed URLs
> > /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
> > seed.txt
> > 17/07/29 06:51:18 INFO crawl.Injector: Injector: starting at 2017-07-29
> > 06:51:18
> > 17/07/29 06:51:18 INFO crawl.Injector: Injector: crawlDb:
> > /user/hadoop/crawlDIR/crawldb
> > 17/07/29 06:51:18 INFO crawl.Injector: Injector: urlDir: seed.txt
> > 17/07/29 06:51:18 INFO crawl.Injector: Injector: Converting injected urls
> > to crawl db entries.
> > 17/07/29 06:51:19 INFO client.RMProxy: Connecting to ResourceManager at
> > ip-*-*-*-*.ec2.internal/*.*.*.*:8032
> > 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to
> process
> > : 0
> > 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to
> process
> > : 1
> > .
> > .
> > 17/07/29 06:51:20 INFO mapreduce.Job: Running job: job_1500749038440_0003
> > 17/07/29 06:51:28 INFO mapreduce.Job: Job job_1500749038440_0003 running
> in
> > uber mode : false
> > 17/07/29 06:51:28 INFO mapreduce.Job:  map 0% reduce 0%
> > 17/07/29 06:51:33 INFO mapreduce.Job:  map 100% reduce 0%
> > 17/07/29 06:51:38 INFO mapreduce.Job:  map 100% reduce 4%
> > 17/07/29 06:51:40 INFO mapreduce.Job:  map 100% reduce 6%
> > 17/07/29 06:51:41 INFO mapreduce.Job:  map 100% reduce 49%
> > 17/07/29 06:51:42 INFO mapreduce.Job:  map 100% reduce 66%
> > 17/07/29 06:51:43 INFO mapreduce.Job:  map 100% reduce 87%
> > 17/07/29 06:51:44 INFO mapreduce.Job:  map 100% reduce 100%
> >
> > I am running nutch from a EMR cluster. I did check around the log
> > directories and I dont see the messages i see in the console anywhere
> else.
> >
> > One more thing i noticed is when i issue the command
> >
> > *ps -ef | grep nutch*
> >
> > hadoop    21616  18344  2 06:59 pts/1    00:00:09
> > /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx1000m -server
> > -XX:OnOutOfMemoryError=kill -9 %p *-Dhadoop.log.dir=/usr/lib/
> hadoop/logs*
> > *-Dhadoop.log.file=hadoop.log* -Dhadoop.home.dir=/usr/lib/hadoop
> > -Dhadoop.id.str= *-Dhadoop.root.logger=INFO,console*
> > -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/
> lib/hadoop/lib/native
> > -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
> > -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30
> > org.apache.hadoop.util.RunJar
> > /mnt/nutch/runtime/deploy/apache-nutch-1.14-SNAPSHOT.job
> > org.apache.nutch.fetcher.Fetcher -D mapreduce.map.java.opts=-Xmx2304m -D
> > mapreduce.map.memory.mb=2880 -D mapreduce.reduce.java.opts=-Xmx4608m -D
> > mapreduce.reduce.memory.mb=5760 -D mapreduce.job.reduces=12 -D
> > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> > mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
> > /user/hadoop/crawlDIR/segments/20170729065841 -noParsing -threads 100
> >
> > The logger mentioned in the running process is console. How do i change
> it
> > to the log file rotated by log4j ?
> >
> > i tried modifying the conf/log4j.properties file to use DRFA instead
> > of cmdstdout logger. but that did not help either.
> >
> > Any help would be appreciated.
> >
> > Thanks
> > Srini
> >
> > On Mon, Jul 24, 2017 at 12:52 AM, Sebastian Nagel <
> > [hidden email]> wrote:
> >
> >> Hi Srini,
> >>
> >> in distributed mode the bulk of Nutch's log output is kept in the Hadoop
> >> task logs.
> >> The configuration whether, how long and where these logs are kept
> depends
> >> on the
> >> configuration of your Hadoop cluster.  You can easily find tutorials and
> >> examples
> >> how to configure this if you google for "hadoop task logs".
> >>
> >> Be careful the Nutch logs are usually huge.  The easiest way to get them
> >> for a jobs
> >> is to run the following command on the master node:
> >>
> >>   yarn logs -applicationId <app_id>
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 07/21/2017 10:09 PM, Srinivasan Ramaswamy wrote:
> >>> Hi
> >>>
> >>> I am running nutch in distributed mode. I would like to see all nuch
> logs
> >>> written to files. I only see the console output. Can i see the same
> >>> information logged to some log files ?
> >>>
> >>> When i run nutch in local mode i do see the logs in runtime/local/logs
> >>> directory. But when i run nutch in distributed mode, i dont see it
> >> anywhere
> >>> except console.
> >>>
> >>> Can anyone help me with the settings that i need to change ?
> >>>
> >>> Thanks
> >>> Srini
> >>>
> >>
> >>
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: cannot find nutch logs in distributed mode

Sebastian Nagel
Hi Srini,

in local mode all log output from
- job client
- application master / job tracker
- YARN containers (map-reduce task attempts)
ends up in the same file, simply because all are running inside one single JVM.
In distributed mode all are running as separate processes on different machines
writing to separate log files.  That makes logging more complex.

Have a look at

https://discuss.pivotal.io/hc/en-us/articles/201925118-How-to-Find-and-Review-Logs-for-Yarn-MapReduce-Jobs
That's a condensed introduction into the topic.

Please, note that for more detailed questions the Hadoop user list or a forum
dedicated to EMR is the better place.


Best,
Sebastian


On 08/02/2017 09:55 AM, Srinivasan Ramaswamy wrote:

> Thanks for your reply Sebastian. I asked this question for the following
> reasons:
>
> * We were running crawl script using nohup and we redirected the output to
> a local log file. In some weird/rare scenario (may be our master node went
> down at that time, i am not sure), the log file stopped but nutch process
> was running. We could not really see what it (nutch) is doing.
>
> * I see that the nutch code uses log4j to log, so i am wondering it should
> all go to a log4j rotated log file instead of just console. The same works
> well in local mode. Can you please explain me why it doesnt write to a file
> and only to console ?
>
> * It also puzzles me why the running process shows "
> Dhadoop.root.logger=INFO,console"  though i changed conf/log4j.properties
> to "log4j.rootLogger=INFO,DRFA"
>
> Thanks
> Srini
>
> On Tue, Aug 1, 2017 at 7:51 AM, Sebastian Nagel <[hidden email]>
> wrote:
>
>> Hi Srini,
>>
>>> I am referring to the INFO messages that are printed in console when
>> nutch
>>> 1.14 is running in distributed mode. For example
>>
>> Afaics, the only way to get the logs of the job client is to redirect the
>> console output to a file,
>> e.g.,
>>
>> /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
>> seed.txt &>inject.log
>>
>>> I am running nutch from a EMR cluster.
>>
>> If you're interested in the logs of task attempts, see:
>>
>> http://docs.aws.amazon.com/emr/latest/ManagementGuide/
>> emr-manage-view-web-log-files.html
>>
>>
>> Sebastian
>>
>> On 07/29/2017 09:38 AM, Srinivasan Ramaswamy wrote:
>>> Hi Sebastin
>>>
>>> I am referring to the INFO messages that are printed in console when
>> nutch
>>> 1.14 is running in distributed mode. For example
>>>
>>> Injecting seed URLs
>>> /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
>>> seed.txt
>>> 17/07/29 06:51:18 INFO crawl.Injector: Injector: starting at 2017-07-29
>>> 06:51:18
>>> 17/07/29 06:51:18 INFO crawl.Injector: Injector: crawlDb:
>>> /user/hadoop/crawlDIR/crawldb
>>> 17/07/29 06:51:18 INFO crawl.Injector: Injector: urlDir: seed.txt
>>> 17/07/29 06:51:18 INFO crawl.Injector: Injector: Converting injected urls
>>> to crawl db entries.
>>> 17/07/29 06:51:19 INFO client.RMProxy: Connecting to ResourceManager at
>>> ip-*-*-*-*.ec2.internal/*.*.*.*:8032
>>> 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to
>> process
>>> : 0
>>> 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to
>> process
>>> : 1
>>> .
>>> .
>>> 17/07/29 06:51:20 INFO mapreduce.Job: Running job: job_1500749038440_0003
>>> 17/07/29 06:51:28 INFO mapreduce.Job: Job job_1500749038440_0003 running
>> in
>>> uber mode : false
>>> 17/07/29 06:51:28 INFO mapreduce.Job:  map 0% reduce 0%
>>> 17/07/29 06:51:33 INFO mapreduce.Job:  map 100% reduce 0%
>>> 17/07/29 06:51:38 INFO mapreduce.Job:  map 100% reduce 4%
>>> 17/07/29 06:51:40 INFO mapreduce.Job:  map 100% reduce 6%
>>> 17/07/29 06:51:41 INFO mapreduce.Job:  map 100% reduce 49%
>>> 17/07/29 06:51:42 INFO mapreduce.Job:  map 100% reduce 66%
>>> 17/07/29 06:51:43 INFO mapreduce.Job:  map 100% reduce 87%
>>> 17/07/29 06:51:44 INFO mapreduce.Job:  map 100% reduce 100%
>>>
>>> I am running nutch from a EMR cluster. I did check around the log
>>> directories and I dont see the messages i see in the console anywhere
>> else.
>>>
>>> One more thing i noticed is when i issue the command
>>>
>>> *ps -ef | grep nutch*
>>>
>>> hadoop    21616  18344  2 06:59 pts/1    00:00:09
>>> /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx1000m -server
>>> -XX:OnOutOfMemoryError=kill -9 %p *-Dhadoop.log.dir=/usr/lib/
>> hadoop/logs*
>>> *-Dhadoop.log.file=hadoop.log* -Dhadoop.home.dir=/usr/lib/hadoop
>>> -Dhadoop.id.str= *-Dhadoop.root.logger=INFO,console*
>>> -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/
>> lib/hadoop/lib/native
>>> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
>>> -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30
>>> org.apache.hadoop.util.RunJar
>>> /mnt/nutch/runtime/deploy/apache-nutch-1.14-SNAPSHOT.job
>>> org.apache.nutch.fetcher.Fetcher -D mapreduce.map.java.opts=-Xmx2304m -D
>>> mapreduce.map.memory.mb=2880 -D mapreduce.reduce.java.opts=-Xmx4608m -D
>>> mapreduce.reduce.memory.mb=5760 -D mapreduce.job.reduces=12 -D
>>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
>>> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
>>> /user/hadoop/crawlDIR/segments/20170729065841 -noParsing -threads 100
>>>
>>> The logger mentioned in the running process is console. How do i change
>> it
>>> to the log file rotated by log4j ?
>>>
>>> i tried modifying the conf/log4j.properties file to use DRFA instead
>>> of cmdstdout logger. but that did not help either.
>>>
>>> Any help would be appreciated.
>>>
>>> Thanks
>>> Srini
>>>
>>> On Mon, Jul 24, 2017 at 12:52 AM, Sebastian Nagel <
>>> [hidden email]> wrote:
>>>
>>>> Hi Srini,
>>>>
>>>> in distributed mode the bulk of Nutch's log output is kept in the Hadoop
>>>> task logs.
>>>> The configuration whether, how long and where these logs are kept
>> depends
>>>> on the
>>>> configuration of your Hadoop cluster.  You can easily find tutorials and
>>>> examples
>>>> how to configure this if you google for "hadoop task logs".
>>>>
>>>> Be careful the Nutch logs are usually huge.  The easiest way to get them
>>>> for a jobs
>>>> is to run the following command on the master node:
>>>>
>>>>   yarn logs -applicationId <app_id>
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> On 07/21/2017 10:09 PM, Srinivasan Ramaswamy wrote:
>>>>> Hi
>>>>>
>>>>> I am running nutch in distributed mode. I would like to see all nuch
>> logs
>>>>> written to files. I only see the console output. Can i see the same
>>>>> information logged to some log files ?
>>>>>
>>>>> When i run nutch in local mode i do see the logs in runtime/local/logs
>>>>> directory. But when i run nutch in distributed mode, i dont see it
>>>> anywhere
>>>>> except console.
>>>>>
>>>>> Can anyone help me with the settings that i need to change ?
>>>>>
>>>>> Thanks
>>>>> Srini
>>>>>
>>>>
>>>>
>>>
>>
>>
>