recrawling sites

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

recrawling sites

Suhail Ahmed
Hi,

How do I go about recrawling websites? Essentially I want to repeat  
the following tasks repeatedly:

[one off task] inject the database with a url list

1. create a segment with the initial list
2. fetch the segment
3. update the database
4. create a new segment with the outlinks from [2]
5. fetch the segement created in [4].

I basically want to repeat steps 2 through 5. How would I do this?

Thanks for the help

Suhail
Reply | Threaded
Open this post in threaded view
|

RE: recrawling sites

Howie Wang

>1. create a segment with the initial list
>2. fetch the segment
>3. update the database
>4. create a new segment with the outlinks from [2]
>5. fetch the segement created in [4].
>
>I basically want to repeat steps 2 through 5. How would I do this?

Here's what I have in my script:

bin/nutch generate crawl.test/db crawl.test/segments -topN 20   # Create new
segment
s1=`ls -d crawl.test/segments/2* | tail -1`
bin/nutch fetch $s1                                                          
       # Fetch it
bin/nutch updatedb crawl.test/db $s1                                       #
Updatedb with new links
bin/nutch analyze crawl.test/db 5
bin/nutch index $s1

Change the db and segments directories as needed and change topN to suit
your needs. The steps start at a different point than your step 2, but you
probably get the picture. See the Nutch tutorial for more info...


Reply | Threaded
Open this post in threaded view
|

RE: recrawling sites

Chirag Chaman
In reply to this post by Suhail Ahmed
Suhail,

The default nutch crawl process already does this. It will refetch pages
every 30 days.
Look at the nutch Wiki and documentation. To recrawl the links specify the
link depth.

CC-
 
--------------------------------------------
Filangy, Inc.
Interested in Improving Search? Join our Team!
http://filangy.com/jointheteam.jsp 


-----Original Message-----
From: Suhail Ahmed [mailto:[hidden email]]
Sent: Monday, May 30, 2005 12:44 PM
To: [hidden email]
Subject: recrawling sites

Hi,

How do I go about recrawling websites? Essentially I want to repeat the
following tasks repeatedly:

[one off task] inject the database with a url list

1. create a segment with the initial list 2. fetch the segment 3. update the
database 4. create a new segment with the outlinks from [2] 5. fetch the
segement created in [4].

I basically want to repeat steps 2 through 5. How would I do this?

Thanks for the help

Suhail