How to investigate recrawl issue

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to investigate recrawl issue

Matteo Diarena
Dear all,

I'm completely new to Apache Nutch, I started only few days ago to use it
for the first time and I was impressed from its capabilities.

I'm experiencing a little issue I hope someone can help me to fix:

I configured a test  instance of Apache Nutch (1.9) to crawl a news website
using the following parameters:

 

<configuration>

<property>

  <name>http.agent.name</name>

  <value>NewsWatcher Agent</value>

</property>

<property>

  <name>fetcher.threads.per.queue</name>

  <value>50</value>

  <description></description>

</property>

<property>

  <name>plugin.includes</name>

 
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|scoring-depth</va
lue>

  <description></description>

</property>

<property>

  <name>db.fetch.interval.default</name>

  <value>300</value>

  <description></description>

</property>

</configuration>

 

and running a cron over ./bin/crawl command every five minutes with a
_maxdepth_=2 because I want to frequently update my index with only new
articles published in homepage without crawling the whole site.

 

At the first run everything is fine, but after it seems the homepage is not
updated anymore.

Looking at the log file it seems that the whole process is ok but I cannot
see new articles, published in homepage, in my index.

 

Looking in the crawldb with readdb command I always obtain the same
signature even if the page is changed.

 

Can anyone help me to understand how to investigate this issue?

Is there something else I can check after the log file?

Is there any debug option I can enable?

 

Thanks a lot everybody in advance,

Matteo

 

Reply | Threaded
Open this post in threaded view
|

Re: How to investigate recrawl issue

alxsss
There must be some config variable that allows to set  timeModified to current date when injected. You need to inject home page url on each run.


hth
Alex.



-----Original Message-----
From: Matteo Diarena <[hidden email]>
To: user <[hidden email]>
Sent: Wed, Apr 29, 2015 1:46 pm
Subject: How to investigate recrawl issue


Dear all,

I'm completely new to Apache Nutch, I started only few days ago to
use it
for the first time and I was impressed from its capabilities.

I'm
experiencing a little issue I hope someone can help me to fix:

I configured a
test  instance of Apache Nutch (1.9) to crawl a news website
using the
following parameters:

 

<configuration>

<property>

 
<name>http.agent.name</name>

  <value>NewsWatcher
Agent</value>

</property>

<property>

 
<name>fetcher.threads.per.queue</name>

  <value>50</value>

 
<description></description>

</property>

<property>

 
<name>plugin.includes</name>



<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|scoring-depth</va
lue>


<description></description>

</property>

<property>

 
<name>db.fetch.interval.default</name>

  <value>300</value>

 
<description></description>

</property>

</configuration>

 

and
running a cron over ./bin/crawl command every five minutes with a
_maxdepth_=2
because I want to frequently update my index with only new
articles published
in homepage without crawling the whole site.

 

At the first run everything
is fine, but after it seems the homepage is not
updated anymore.

Looking at
the log file it seems that the whole process is ok but I cannot
see new
articles, published in homepage, in my index.

 

Looking in the crawldb
with readdb command I always obtain the same
signature even if the page is
changed.

 

Can anyone help me to understand how to investigate this issue?


Is there something else I can check after the log file?

Is there any
debug option I can enable?

 

Thanks a lot everybody in advance,

Matteo


 


 
Reply | Threaded
Open this post in threaded view
|

Re: How to investigate recrawl issue

Jeff Cocking
Nutch has a default time value assigned to every page for reindexing.  This
is defaulted to 30 days.  There are also adaptive parameters that will
increase/decrease this timeframe.  If you want to index a page that fast,
you need to either re-inject the page and set the parameter to over write
to true and/or use a plugin like urlmeta to force in reindex timeframe
value.

Spend some time in the nutch-default.xml file. This has all the levers that
can be adjusted for nutch.

jeff

On Wed, Apr 29, 2015 at 4:19 PM, <[hidden email]> wrote:

> There must be some config variable that allows to set  timeModified to
> current date when injected. You need to inject home page url on each run.
>
>
> hth
> Alex.
>
>
>
> -----Original Message-----
> From: Matteo Diarena <[hidden email]>
> To: user <[hidden email]>
> Sent: Wed, Apr 29, 2015 1:46 pm
> Subject: How to investigate recrawl issue
>
>
> Dear all,
>
> I'm completely new to Apache Nutch, I started only few days ago to
> use it
> for the first time and I was impressed from its capabilities.
>
> I'm
> experiencing a little issue I hope someone can help me to fix:
>
> I configured a
> test  instance of Apache Nutch (1.9) to crawl a news website
> using the
> following parameters:
>
>
>
> <configuration>
>
> <property>
>
>
> <name>http.agent.name</name>
>
>   <value>NewsWatcher
> Agent</value>
>
> </property>
>
> <property>
>
>
> <name>fetcher.threads.per.queue</name>
>
>   <value>50</value>
>
>
> <description></description>
>
> </property>
>
> <property>
>
>
> <name>plugin.includes</name>
>
>
>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
>
> indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|scoring-depth</va
> lue>
>
>
> <description></description>
>
> </property>
>
> <property>
>
>
> <name>db.fetch.interval.default</name>
>
>   <value>300</value>
>
>
> <description></description>
>
> </property>
>
> </configuration>
>
>
>
> and
> running a cron over ./bin/crawl command every five minutes with a
> _maxdepth_=2
> because I want to frequently update my index with only new
> articles published
> in homepage without crawling the whole site.
>
>
>
> At the first run everything
> is fine, but after it seems the homepage is not
> updated anymore.
>
> Looking at
> the log file it seems that the whole process is ok but I cannot
> see new
> articles, published in homepage, in my index.
>
>
>
> Looking in the crawldb
> with readdb command I always obtain the same
> signature even if the page is
> changed.
>
>
>
> Can anyone help me to understand how to investigate this issue?
>
>
> Is there something else I can check after the log file?
>
> Is there any
> debug option I can enable?
>
>
>
> Thanks a lot everybody in advance,
>
> Matteo
>
>
>
>
>
>
>