Re-crawl every 24 hours

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Re-crawl every 24 hours

Ali rahmani
Dear Sir, 
I am customizing Nutch 2.2 to crawl my seed lists which contains about 30 URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new added links. I added the following configurations to nutch-site.xml file and use the following command:

<property>
  <name>db.fetch.interval.default</name>
  <value>1800</value>
  <description>The default number of seconds between re-fetches of a page (30 days).
  </description>
</property>

<property>
  <name>db.update.purge.404</name>
  <value>true</value>
  <description>If true, updatedb will add purge records with status DB_GONE
  from the CrawlDB.
  </description>
</property>


./crawl urls/ testdb http://localhost:8983/solr 2


but whenever I run mention command, nutch goes deep and deeper.
would you please tell where is the problem ?
Regards,
Reply | Threaded
Open this post in threaded view
|

Re: Re-crawl every 24 hours

Ali Nazemian
Hi Ali,
It is the same problem that I faced recently. It is my concert too. I would
appreciate if somebody answer this question.
Best regards.


On Wed, May 21, 2014 at 2:52 PM, Ali rahmani <[hidden email]> wrote:

> Dear Sir,
> I am customizing Nutch 2.2 to crawl my seed lists which contains about 30
> URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> added links. I added the following configurations to nutch-site.xml file
> and use the following command:
>
> <property>
>   <name>db.fetch.interval.default</name>
>   <value>1800</value>
>   <description>The default number of seconds between re-fetches of a page
> (30 days).
>   </description>
> </property>
>
> <property>
>   <name>db.update.purge.404</name>
>   <value>true</value>
>   <description>If true, updatedb will add purge records with status DB_GONE
>   from the CrawlDB.
>   </description>
> </property>
>
>
> ./crawl urls/ testdb http://localhost:8983/solr 2
>
>
> but whenever I run mention command, nutch goes deep and deeper.
> would you please tell where is the problem ?
> Regards,




--
A.Nazemian
Reply | Threaded
Open this post in threaded view
|

Re: Re-crawl every 24 hours

Julien Nioche-4
In reply to this post by Ali rahmani
<property>
  <name>db.fetch.interval.default</name>
  <value>1800</value>
  <description>The default number of seconds between re-fetches of a page
(30 days).
  </description>
</property>

means that a page which has already been fetched will be refetched again
after 30mins. This is what you want for the seeds but is also applied to
the subpages you've already discovered in previous rounds.

What you could do would be to set a custom fetch interval for the seeds
only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
nutch.fetchInterval) and have a larger value for db.fetch.interval.default.
This way the seeds would be revisited frequently but not the subpages. Note
that this would work only if the links to the pages you want to discover
are directly in the seed files. If they are at a deeper level then they'd
be discovered only when the page that mentions them is re-fetched (==
nutch.fetchInterval)

HTH

Julien


On 21 May 2014 11:22, Ali rahmani <[hidden email]> wrote:

> Dear Sir,
> I am customizing Nutch 2.2 to crawl my seed lists which contains about 30
> URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> added links. I added the following configurations to nutch-site.xml file
> and use the following command:
>
> <property>
>   <name>db.fetch.interval.default</name>
>   <value>1800</value>
>   <description>The default number of seconds between re-fetches of a page
> (30 days).
>   </description>
> </property>
>
> <property>
>   <name>db.update.purge.404</name>
>   <value>true</value>
>   <description>If true, updatedb will add purge records with status DB_GONE
>   from the CrawlDB.
>   </description>
> </property>
>
>
> ./crawl urls/ testdb http://localhost:8983/solr 2
>
>
> but whenever I run mention command, nutch goes deep and deeper.
> would you please tell where is the problem ?
> Regards,




--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Reply | Threaded
Open this post in threaded view
|

Re: crawl every 24 hours

alxsss
Hi,

Another way of doing this is to increase

db.fetch.interval.default

to x years and inject each time the original seed. In this way you will fetch only new pages during x year, since injected urls fetch time is set to current time (I believe, you can double check it first) and the other fetched pages only will be picked up after x year.

HTH.
Alex



 

 

 

-----Original Message-----
From: Julien Nioche <[hidden email]>
To: user <[hidden email]>; Ali rahmani <[hidden email]>
Sent: Wed, May 21, 2014 7:14 am
Subject: Re: Re-crawl every 24 hours


<property>
  <name>db.fetch.interval.default</name>
  <value>1800</value>
  <description>The default number of seconds between re-fetches of a page
(30 days).
  </description>
</property>

means that a page which has already been fetched will be refetched again
after 30mins. This is what you want for the seeds but is also applied to
the subpages you've already discovered in previous rounds.

What you could do would be to set a custom fetch interval for the seeds
only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
nutch.fetchInterval) and have a larger value for db.fetch.interval.default.
This way the seeds would be revisited frequently but not the subpages. Note
that this would work only if the links to the pages you want to discover
are directly in the seed files. If they are at a deeper level then they'd
be discovered only when the page that mentions them is re-fetched (==
nutch.fetchInterval)

HTH

Julien


On 21 May 2014 11:22, Ali rahmani <[hidden email]> wrote:

> Dear Sir,
> I am customizing Nutch 2.2 to crawl my seed lists which contains about 30
> URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> added links. I added the following configurations to nutch-site.xml file
> and use the following command:
>
> <property>
>   <name>db.fetch.interval.default</name>
>   <value>1800</value>
>   <description>The default number of seconds between re-fetches of a page
> (30 days).
>   </description>
> </property>
>
> <property>
>   <name>db.update.purge.404</name>
>   <value>true</value>
>   <description>If true, updatedb will add purge records with status DB_GONE
>   from the CrawlDB.
>   </description>
> </property>
>
>
> ./crawl urls/ testdb http://localhost:8983/solr 2
>
>
> but whenever I run mention command, nutch goes deep and deeper.
> would you please tell where is the problem ?
> Regards,




--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

 
Reply | Threaded
Open this post in threaded view
|

Re: Re-crawl every 24 hours

Ali Nazemian
In reply to this post by Julien Nioche-4
Dear Julien,
Hi,
Do you know any step by step guide for this procedure? Is this the same for
nutch 1.8?
Best regards.


On Wed, May 21, 2014 at 6:43 PM, Julien Nioche <
[hidden email]> wrote:

> <property>
>   <name>db.fetch.interval.default</name>
>   <value>1800</value>
>   <description>The default number of seconds between re-fetches of a page
> (30 days).
>   </description>
> </property>
>
> means that a page which has already been fetched will be refetched again
> after 30mins. This is what you want for the seeds but is also applied to
> the subpages you've already discovered in previous rounds.
>
> What you could do would be to set a custom fetch interval for the seeds
> only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
> nutch.fetchInterval) and have a larger value for db.fetch.interval.default.
> This way the seeds would be revisited frequently but not the subpages. Note
> that this would work only if the links to the pages you want to discover
> are directly in the seed files. If they are at a deeper level then they'd
> be discovered only when the page that mentions them is re-fetched (==
> nutch.fetchInterval)
>
> HTH
>
> Julien
>
>
> On 21 May 2014 11:22, Ali rahmani <[hidden email]> wrote:
>
> > Dear Sir,
> > I am customizing Nutch 2.2 to crawl my seed lists which contains about 30
> > URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> > added links. I added the following configurations to nutch-site.xml file
> > and use the following command:
> >
> > <property>
> >   <name>db.fetch.interval.default</name>
> >   <value>1800</value>
> >   <description>The default number of seconds between re-fetches of a page
> > (30 days).
> >   </description>
> > </property>
> >
> > <property>
> >   <name>db.update.purge.404</name>
> >   <value>true</value>
> >   <description>If true, updatedb will add purge records with status
> DB_GONE
> >   from the CrawlDB.
> >   </description>
> > </property>
> >
> >
> > ./crawl urls/ testdb http://localhost:8983/solr 2
> >
> >
> > but whenever I run mention command, nutch goes deep and deeper.
> > would you please tell where is the problem ?
> > Regards,
>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



--
A.Nazemian
Reply | Threaded
Open this post in threaded view
|

Re: Re-crawl every 24 hours

Julien Nioche-4
Hi

This will work with 1.8 indeed. What procedure do you mean? Just add
nutch.fetchInterval to the seeds, that's all.

J.


On 23 May 2014 10:13, Ali Nazemian <[hidden email]> wrote:

> Dear Julien,
> Hi,
> Do you know any step by step guide for this procedure? Is this the same for
> nutch 1.8?
> Best regards.
>
>
> On Wed, May 21, 2014 at 6:43 PM, Julien Nioche <
> [hidden email]> wrote:
>
> > <property>
> >   <name>db.fetch.interval.default</name>
> >   <value>1800</value>
> >   <description>The default number of seconds between re-fetches of a page
> > (30 days).
> >   </description>
> > </property>
> >
> > means that a page which has already been fetched will be refetched again
> > after 30mins. This is what you want for the seeds but is also applied to
> > the subpages you've already discovered in previous rounds.
> >
> > What you could do would be to set a custom fetch interval for the seeds
> > only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
> > nutch.fetchInterval) and have a larger value for
> db.fetch.interval.default.
> > This way the seeds would be revisited frequently but not the subpages.
> Note
> > that this would work only if the links to the pages you want to discover
> > are directly in the seed files. If they are at a deeper level then they'd
> > be discovered only when the page that mentions them is re-fetched (==
> > nutch.fetchInterval)
> >
> > HTH
> >
> > Julien
> >
> >
> > On 21 May 2014 11:22, Ali rahmani <[hidden email]> wrote:
> >
> > > Dear Sir,
> > > I am customizing Nutch 2.2 to crawl my seed lists which contains about
> 30
> > > URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> > > added links. I added the following configurations to nutch-site.xml
> file
> > > and use the following command:
> > >
> > > <property>
> > >   <name>db.fetch.interval.default</name>
> > >   <value>1800</value>
> > >   <description>The default number of seconds between re-fetches of a
> page
> > > (30 days).
> > >   </description>
> > > </property>
> > >
> > > <property>
> > >   <name>db.update.purge.404</name>
> > >   <value>true</value>
> > >   <description>If true, updatedb will add purge records with status
> > DB_GONE
> > >   from the CrawlDB.
> > >   </description>
> > > </property>
> > >
> > >
> > > ./crawl urls/ testdb http://localhost:8983/solr 2
> > >
> > >
> > > but whenever I run mention command, nutch goes deep and deeper.
> > > would you please tell where is the problem ?
> > > Regards,
> >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
>
>
> --
> A.Nazemian
>



--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Reply | Threaded
Open this post in threaded view
|

Re: Re-crawl every 24 hours

Ali rahmani
Hi Julien, 
Would you please guide me how a re-Crawling Script should be. I pass following steps(even after adding fetch.interval parameter), crawler goes deep and deeper. 
1) ./nutch Inject /url
2)Loop{
./nutch generate -topN 2000
./nutch fetch [CrwalID]
./nutch parse [CrawlID]
./nutch generatedb
}

It is worth mention to say that I pass mentioned steps after 24 hours.
Regards,
A.R


On Friday, May 23, 2014 2:39:13 PM, Julien Nioche <[hidden email]> wrote:
 


Hi

This will work with 1.8 indeed. What procedure do you mean? Just add
nutch.fetchInterval to the seeds, that's all.

J.


On 23 May 2014 10:13, Ali Nazemian <[hidden email]> wrote:

> Dear Julien,
> Hi,
> Do you know any step by step guide for this procedure? Is this the same for
> nutch 1.8?
> Best regards.
>
>
> On Wed, May 21, 2014 at 6:43 PM, Julien Nioche <
> [hidden email]> wrote:
>
> > <property>
> >   <name>db.fetch.interval.default</name>
> > 
 <value>1800</value>

> >   <description>The default number of seconds between re-fetches of a page
> > (30 days).
> >   </description>
> > </property>
> >
> > means that a page which has already been fetched will be refetched again
> > after 30mins. This is what you want for the seeds but is also applied to
> > the subpages you've already discovered in previous rounds.
> >
> > What you could do would be to set a custom fetch interval for the seeds
> > only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
> > nutch.fetchInterval) and have a larger
 value for

> db.fetch.interval.default.
> > This way the seeds would be revisited frequently but not the subpages.
> Note
> > that this would work only if the links to the pages you want to discover
> > are directly in the seed files. If they are at a deeper level then they'd
> > be discovered only when the page that mentions them is re-fetched (==
> > nutch.fetchInterval)
> >
> > HTH
> >
> > Julien
> >
> >
> > On 21 May 2014 11:22, Ali rahmani <[hidden email]> wrote:
> >
> > > Dear Sir,
> >
 > I am customizing Nutch 2.2 to crawl my seed lists which contains about

> 30
> > > URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> > > added links. I added the following configurations to nutch-site.xml
> file
> > > and use the following command:
> > >
> > > <property>
> > >   <name>db.fetch.interval.default</name>
> > >   <value>1800</value>
> > >   <description>The default number of seconds between re-fetches of a
> page
> > > (30 days).
> > >   </description>
> > > </property>
> > >
>
 > > <property>

> > >   <name>db.update.purge.404</name>
> > >   <value>true</value>
> > >   <description>If true, updatedb will add purge records with status
> > DB_GONE
> > >   from the CrawlDB.
> > >   </description>
> > > </property>
> > >
> > >
> > > ./crawl urls/ testdb http://localhost:8983/solr 2
> > >
> > >
> > > but whenever I run mention command, nutch goes deep and deeper.
> > > would you please tell where is the problem ?
> > >
 Regards,

> >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>
>
>
> --
> A.Nazemian

>



--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Reply | Threaded
Open this post in threaded view
|

RE: Re-crawl every 24 hours

Markus Jelsma-2
In reply to this post by Julien Nioche-4
That will work, but use nutch.fetchInterval.fixed in case you use an adaptive fetch scheduler.

 
 
-----Original message-----

> From:Julien Nioche <[hidden email]>
> Sent: Friday 23rd May 2014 12:09
> To: [hidden email]
> Subject: Re: Re-crawl every 24 hours
>
> Hi
>
> This will work with 1.8 indeed. What procedure do you mean? Just add
> nutch.fetchInterval to the seeds, that's all.
>
> J.
>
>
> On 23 May 2014 10:13, Ali Nazemian <[hidden email]> wrote:
>
> > Dear Julien,
> > Hi,
> > Do you know any step by step guide for this procedure? Is this the same for
> > nutch 1.8?
> > Best regards.
> >
> >
> > On Wed, May 21, 2014 at 6:43 PM, Julien Nioche <
> > [hidden email]> wrote:
> >
> > > <property>
> > >   <name>db.fetch.interval.default</name>
> > >   <value>1800</value>
> > >   <description>The default number of seconds between re-fetches of a page
> > > (30 days).
> > >   </description>
> > > </property>
> > >
> > > means that a page which has already been fetched will be refetched again
> > > after 30mins. This is what you want for the seeds but is also applied to
> > > the subpages you've already discovered in previous rounds.
> > >
> > > What you could do would be to set a custom fetch interval for the seeds
> > > only (see http://wiki.apache.org/nutch/bin/nutch%20inject for the use of
> > > nutch.fetchInterval) and have a larger value for
> > db.fetch.interval.default.
> > > This way the seeds would be revisited frequently but not the subpages.
> > Note
> > > that this would work only if the links to the pages you want to discover
> > > are directly in the seed files. If they are at a deeper level then they'd
> > > be discovered only when the page that mentions them is re-fetched (==
> > > nutch.fetchInterval)
> > >
> > > HTH
> > >
> > > Julien
> > >
> > >
> > > On 21 May 2014 11:22, Ali rahmani <[hidden email]> wrote:
> > >
> > > > Dear Sir,
> > > > I am customizing Nutch 2.2 to crawl my seed lists which contains about
> > 30
> > > > URL. I need to crawl mentioned URL every 24 minutes and JUST fetch new
> > > > added links. I added the following configurations to nutch-site.xml
> > file
> > > > and use the following command:
> > > >
> > > > <property>
> > > >   <name>db.fetch.interval.default</name>
> > > >   <value>1800</value>
> > > >   <description>The default number of seconds between re-fetches of a
> > page
> > > > (30 days).
> > > >   </description>
> > > > </property>
> > > >
> > > > <property>
> > > >   <name>db.update.purge.404</name>
> > > >   <value>true</value>
> > > >   <description>If true, updatedb will add purge records with status
> > > DB_GONE
> > > >   from the CrawlDB.
> > > >   </description>
> > > > </property>
> > > >
> > > >
> > > > ./crawl urls/ testdb http://localhost:8983/solr 2
> > > >
> > > >
> > > > but whenever I run mention command, nutch goes deep and deeper.
> > > > would you please tell where is the problem ?
> > > > Regards,
> > >
> > >
> > >
> > >
> > > --
> > >
> > > Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>