crawl returned just one url

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

crawl returned just one url

Rizwan Raza
Hi guys:

I am using nutch 1.2. I did a crawl on www.thedogplace.com using the command
below

./bin/nutch crawl urls -dir crawl-thedogplace -depth 3 >& crawl.log

It finished the crawl successfully.

I then checked the stats to see how many urls it retrieved and I was
surprized to see it returned only 1 url.

From the readdb command below

./bin/nutch readdb crawl/crawldb -dump outdir

it dumped the following output

http://www.thedogplace.com/ Version: 7
Status: 4 (db_redir_temp)
Fetch time: Sun Jan 23 01:54:24 CST 2011
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0:
http://www.thedogplace.com/Main/Default.aspx?p=5&s=0

I was expecting crawl to bring multiple urls but it brought only
http://www.thedogplace.com/Main/Default.aspx?p=5&s=0

Is there anything I am missing?

Thanks
-rizwan
Reply | Threaded
Open this post in threaded view
|

Re: crawl returned just one url

Alex McLintock
On 25 December 2010 03:59, Rizwan Raza <[hidden email]> wrote:

>
> http://www.thedogplace.com/ Version: 7
> Status: 4 (db_redir_temp)
> Fetch time: Sun Jan 23 01:54:24 CST 2011
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: temp_moved(13), lastModified=0:
> http://www.thedogplace.com/Main/Default.aspx?p=5&s=0
>
>

I am guessing that because it got the "This page has moved" message from the
server it didn't have any further links to crawl into.
What happens if you start your crawl from "
http://www.thedogplace.com/Main/Default.aspx" instead

Note that Nutch doesn't like coping with stuff after the "?". By default
Nutch will ignore everything after the ? through the regexp filtering. if
you can convert the DogPlace site to clean urls then that might help.

Alex