Problem crawling in Nutch 0.9

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem crawling in Nutch 0.9

Annona Keene
I recently upgraded to 0.9, and I've started encountering a problem. I began with a single url and crawled with a depth of 10, assuming I would get every page on my site. This same configuration worked for me in 0.8.  However, I noticed a particular url that I was especially interested in was not in the index. So I added the url explicitly and crawled again. And it still was not in the index. So I checked the logs, and it is being fetched. So I tried a lower depth, and it worked. With a depth of 6, the url does appear in the index. Any ideas on what would be causing this? I'm very confused.

Thanks,
Ann




       
____________________________________________________________________________________Pinpoint customers who are looking for what you sell.
http://searchmarketing.yahoo.com/
Reply | Threaded
Open this post in threaded view
|

Re: Problem crawling in Nutch 0.9

Briggs
Just curious, did you happen to limit the number of urls using the
"topN" switch?

On 5/14/07, Annona Keene <[hidden email]> wrote:

> I recently upgraded to 0.9, and I've started encountering a problem. I began with a single url and crawled with a depth of 10, assuming I would get every page on my site. This same configuration worked for me in 0.8.  However, I noticed a particular url that I was especially interested in was not in the index. So I added the url explicitly and crawled again. And it still was not in the index. So I checked the logs, and it is being fetched. So I tried a lower depth, and it worked. With a depth of 6, the url does appear in the index. Any ideas on what would be causing this? I'm very confused.
>
> Thanks,
> Ann
>
>
>
>
>
> ____________________________________________________________________________________Pinpoint customers who are looking for what you sell.
> http://searchmarketing.yahoo.com/


--
"Conscious decisions by conscious minds are what make reality real"
Reply | Threaded
Open this post in threaded view
|

Re: Problem crawling in Nutch 0.9

Annona Keene
In reply to this post by Annona Keene
I didn't set it with the crawl command, and from what I can see in the code, it defaults to Integer.MAX_VALUE, which should be more than enough. I'm only looking at about 2100 pages

And I never experienced this problem before with 0.8. I've checked the nutch-default.xml, and I can't see any settings that would make it fetch a url but not index it, especially only at higher depths.

Any other ideas?

Thanks,
Shawna



----- Original Message ----
From: Briggs <[hidden email]>
To: [hidden email]
Sent: Monday, May 14, 2007 4:18:46 PM
Subject: Re: Problem crawling in Nutch 0.9

Just curious, did you happen to limit the number of urls using the
"topN" switch?

On 5/14/07, Annona Keene <[hidden email]> wrote:

> I recently upgraded to 0.9, and I've started encountering a problem. I began with a single url and crawled with a depth of 10, assuming I would get every page on my site. This same configuration worked for me in 0.8.  However, I noticed a particular url that I was especially interested in was not in the index. So I added the url explicitly and crawled again. And it still was not in the index. So I checked the logs, and it is being fetched. So I tried a lower depth, and it worked. With a depth of 6, the url does appear in the index. Any ideas on what would be causing this? I'm very confused.
>
> Thanks,
> Ann
>
>
>
>
>
> ____________________________________________________________________________________Pinpoint customers who are looking for what you sell.
> http://searchmarketing.yahoo.com/


--
"Conscious decisions by conscious minds are what make reality real"







       
____________________________________________________________________________________Got a little couch potato?
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz