0.8 Intranet Crawl Output/Logging?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

0.8 Intranet Crawl Output/Logging?

jared.dunne
I am using the nutch 0.8 'crawl' command to crawl some content.  When I
run the crawl command, I don't see any output, but the crawl is
running...  Is there a way to see information about what the crawler is
doing?

I have tried setting 'fetcher.verbose' to 'true' in my nutch-site.xml
causing no change to the behaviour.

I am trying to enable some plugins (file protocol and parse-xml plugin)
but I cant tell if they are being loaded correctly with out some output
from nutch.

Thanks!
Jared-
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 0.8 Intranet Crawl Output/Logging?

Ben Ogle
Look in the hadoop.log file under the nutch-0.8/logs dir. It should have that info.

Ben

jared.dunne wrote
I am using the nutch 0.8 'crawl' command to crawl some content.  When I
run the crawl command, I don't see any output, but the crawl is
running...  Is there a way to see information about what the crawler is
doing?

I have tried setting 'fetcher.verbose' to 'true' in my nutch-site.xml
causing no change to the behaviour.

I am trying to enable some plugins (file protocol and parse-xml plugin)
but I cant tell if they are being loaded correctly with out some output
from nutch.

Thanks!
Jared-
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 0.8 Intranet Crawl Output/Logging?

wmelo
I have the same original doubt.  I know that the log shows  informations,
but, how to see the things happening, real time, like in nutch 0.7.2, when
you use the crawl command in the terminal?

----- Original Message -----
From: "Ben Ogle" <[hidden email]>
To: <[hidden email]>
Sent: Wednesday, September 13, 2006 5:59 PM
Subject: Re: 0.8 Intranet Crawl Output/Logging?


>
> Look in the hadoop.log file under the nutch-0.8/logs dir. It should have
> that
> info.
>
> Ben
>
>
> jared.dunne wrote:
>>
>> I am using the nutch 0.8 'crawl' command to crawl some content.  When I
>> run the crawl command, I don't see any output, but the crawl is
>> running...  Is there a way to see information about what the crawler is
>> doing?
>>
>> I have tried setting 'fetcher.verbose' to 'true' in my nutch-site.xml
>> causing no change to the behaviour.
>>
>> I am trying to enable some plugins (file protocol and parse-xml plugin)
>> but I cant tell if they are being loaded correctly with out some output
>> from nutch.
>>
>> Thanks!
>> Jared-
>>
>>
>
> --
> View this message in context:
> http://www.nabble.com/0.8-Intranet-Crawl-Output-Logging--tf2267654.html#a6294542
> Sent from the Nutch - User forum at Nabble.com.
>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.3/444 - Release Date: 11/9/2006
>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 0.8 Intranet Crawl Output/Logging?

Tomi N/A
On 9/13/06, wmelo <[hidden email]> wrote:
> I have the same original doubt.  I know that the log shows  informations,
> but, how to see the things happening, real time, like in nutch 0.7.2, when
> you use the crawl command in the terminal?

try something like this (assuming you know what's good for you so you
use a *n*x):
watch -n 1 "tail -n 20 /home/wmelo/nutch-0.8/logs/hadoop.log"

Please replace the path to your "logs" directory to match your
environment and report back if there's a problem.
Hope it helps.

t.n.a.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 0.8 Intranet Crawl Output/Logging?

Jim R. Wilson
If you don't know what's good for you, baretail can provide a suitable
Windows alternative.

http://www.baremetalsoft.com/baretail/

-- Jim

On 9/13/06, Tomi NA <[hidden email]> wrote:

>
> On 9/13/06, wmelo <[hidden email]> wrote:
> > I have the same original doubt.  I know that the log
> shows  informations,
> > but, how to see the things happening, real time, like in nutch 0.7.2,
> when
> > you use the crawl command in the terminal?
>
> try something like this (assuming you know what's good for you so you
> use a *n*x):
> watch -n 1 "tail -n 20 /home/wmelo/nutch-0.8/logs/hadoop.log"
>
> Please replace the path to your "logs" directory to match your
> environment and report back if there's a problem.
> Hope it helps.
>
> t.n.a.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 0.8 Intranet Crawl Output/Logging?

Jacob Brunson
In reply to this post by Tomi N/A
On my system, I run the crawl command in one shell while running this
command in another shell to monitor the crawl:
tail -f log/hadoop.log
Of course this does about the same thing as listed below, but "tail
-f" is a little easier to remember.

On 9/13/06, Tomi NA <[hidden email]> wrote:

> On 9/13/06, wmelo <[hidden email]> wrote:
> > I have the same original doubt.  I know that the log shows  informations,
> > but, how to see the things happening, real time, like in nutch 0.7.2, when
> > you use the crawl command in the terminal?
>
> try something like this (assuming you know what's good for you so you
> use a *n*x):
> watch -n 1 "tail -n 20 /home/wmelo/nutch-0.8/logs/hadoop.log"
>
> Please replace the path to your "logs" directory to match your
> environment and report back if there's a problem.
> Hope it helps.
>
> t.n.a.
>


--
http://JacobBrunson.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: 0.8 Intranet Crawl Output/Logging?

jared.dunne
Everyone, thanks for the help with this.  I hope to return the
assistance, once I am more familiar with 0.8.  I am using tail -f now to
monitor my test crawls.  It also look like you can use
conf/hadoop-env.sh to redirect log file output to a different location
for each of your configurations.

One follow up question:
Now that I can actually see the log, I am finding some of the output
rather annoying/noisy.  Specially, I am referring to the Registered
Plugins and Registered Extension-Points output.  It's nice to see that
once at crawl start, but not with every step of the crawl.

So does any one know if I can disable that output?  Here's the output to
which I refer:

2006-09-14 14:03:42,852 INFO  plugin.PluginRepository - Plugins: looking
in: /var/nutch/nutch-0.8/plugins
2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Registered
Plugins:
2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Site
Query Filter (query-site)
2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Html
Parse Plug-in (parse-html)
[snip]
2006-09-14 14:03:43,031 INFO  plugin.PluginRepository - Registered
Extension-Points:
2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Nutch
[snip]
Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2006-09-14 14:03:43,032 INFO  plugin.PluginRepository -         Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2006-09-14 14:03:43,032 INFO  plugin.PluginRepository -         Nutch
Content Parser (org.apache.nutch.parse.Parser)
[snip]

Jared-

-----Original Message-----
From: Jacob Brunson [mailto:[hidden email]]
Sent: Thursday, September 14, 2006 1:24 AM
To: [hidden email]
Subject: Re: 0.8 Intranet Crawl Output/Logging?

On my system, I run the crawl command in one shell while running this
command in another shell to monitor the crawl:
tail -f log/hadoop.log
Of course this does about the same thing as listed below, but "tail
-f" is a little easier to remember.

On 9/13/06, Tomi NA <[hidden email]> wrote:
> On 9/13/06, wmelo <[hidden email]> wrote:
> > I have the same original doubt.  I know that the log shows
informations,
> > but, how to see the things happening, real time, like in nutch
0.7.2, when

> > you use the crawl command in the terminal?
>
> try something like this (assuming you know what's good for you so you
> use a *n*x):
> watch -n 1 "tail -n 20 /home/wmelo/nutch-0.8/logs/hadoop.log"
>
> Please replace the path to your "logs" directory to match your
> environment and report back if there's a problem.
> Hope it helps.
>
> t.n.a.
>


--
http://JacobBrunson.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 0.8 Intranet Crawl Output/Logging?

Renaud Richardet-3
Hello Jared,

[hidden email] wrote:

> Everyone, thanks for the help with this.  I hope to return the
> assistance, once I am more familiar with 0.8.  I am using tail -f now to
> monitor my test crawls.  It also look like you can use
> conf/hadoop-env.sh to redirect log file output to a different location
> for each of your configurations.
>
> One follow up question:
> Now that I can actually see the log, I am finding some of the output
> rather annoying/noisy.  Specially, I am referring to the Registered
> Plugins and Registered Extension-Points output.  It's nice to see that
> once at crawl start, but not with every step of the crawl.
>
> So does any one know if I can disable that output?  
please see http://issues.apache.org/jira/browse/NUTCH-346

HTH,
Renaud

> Here's the output to
> which I refer:
>
> 2006-09-14 14:03:42,852 INFO  plugin.PluginRepository - Plugins: looking
> in: /var/nutch/nutch-0.8/plugins
> 2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -
> CyberNeko HTML Parser (lib-nekohtml)
> 2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Site
> Query Filter (query-site)
> 2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Html
> Parse Plug-in (parse-html)
> [snip]
> 2006-09-14 14:03:43,031 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Nutch
> Summarizer (org.apache.nutch.searcher.Summarizer)
> 2006-09-14 14:03:43,031 INFO  plugin.PluginRepository -         Nutch
> [snip]
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2006-09-14 14:03:43,032 INFO  plugin.PluginRepository -         Nutch
> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 2006-09-14 14:03:43,032 INFO  plugin.PluginRepository -         Nutch
> Content Parser (org.apache.nutch.parse.Parser)
> [snip]
>
> Jared-
>
> -----Original Message-----
> From: Jacob Brunson [mailto:[hidden email]]
> Sent: Thursday, September 14, 2006 1:24 AM
> To: [hidden email]
> Subject: Re: 0.8 Intranet Crawl Output/Logging?
>
> On my system, I run the crawl command in one shell while running this
> command in another shell to monitor the crawl:
> tail -f log/hadoop.log
> Of course this does about the same thing as listed below, but "tail
> -f" is a little easier to remember.
>
> On 9/13/06, Tomi NA <[hidden email]> wrote:
>  
>> On 9/13/06, wmelo <[hidden email]> wrote:
>>    
>>> I have the same original doubt.  I know that the log shows
>>>      
> informations,
>  
>>> but, how to see the things happening, real time, like in nutch
>>>      
> 0.7.2, when
>  
>>> you use the crawl command in the terminal?
>>>      
>> try something like this (assuming you know what's good for you so you
>> use a *n*x):
>> watch -n 1 "tail -n 20 /home/wmelo/nutch-0.8/logs/hadoop.log"
>>
>> Please replace the path to your "logs" directory to match your
>> environment and report back if there's a problem.
>> Hope it helps.
>>
>> t.n.a.
>>
>>    
>
>
>  

--
Renaud Richardet
COO America
Wyona    -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195                  mobile +1 617 230 9112
renaud.richardet <at> wyona.com           http://www.wyona.com

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 0.8 Intranet Crawl Output/Logging?

Tomi N/A
In reply to this post by jared.dunne
On 9/14/06, [hidden email] <[hidden email]> wrote:

> Everyone, thanks for the help with this.  I hope to return the
> assistance, once I am more familiar with 0.8.  I am using tail -f now to
> monitor my test crawls.  It also look like you can use
> conf/hadoop-env.sh to redirect log file output to a different location
> for each of your configurations.
>
> One follow up question:
> Now that I can actually see the log, I am finding some of the output
> rather annoying/noisy.  Specially, I am referring to the Registered
> Plugins and Registered Extension-Points output.  It's nice to see that
> once at crawl start, but not with every step of the crawl.
>
> So does any one know if I can disable that output?  Here's the output to
> which I refer:
>
> 2006-09-14 14:03:42,852 INFO  plugin.PluginRepository - Plugins: looking
> in: /var/nutch/nutch-0.8/plugins
> 2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2006-09-14 14:03:43,030 INFO  plugin.PluginRepository - Registered
> Plugins:

watch -n 1 "grep -v PluginRepository
/home/wmelo/nutch-0.8/logs/hadoop.log | tail -n 20"

t.n.a.
Loading...