Crawl not crawling entire page

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawl not crawling entire page

Mike Howarth
I was wondering if anyone could help me.

I'm currently trying to get nutch to crawl a site I have. At the moment I'm pointing nutch at the root url e.g http://www.example.com

I know that I have over 130 links on the index page, however nutch is only finding 87 links. It appears that nutch stops crawling, the hadoop.log doesn't given any indication why this may occur.

I've amended my nutch-crawl to look like this:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|js)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
-^https:\/\/.*
+.

# skip everything else
#-^https://.*
Reply | Threaded
Open this post in threaded view
|

Re: Crawl not crawling entire page

Ratnesh,V2Solutions India
Hi ,
it may be because of the depth you specify is not able to reach the desired page link, so you do some settings related with depth,threads at the time of crawl.

like crawl -d urldir -dir crawl-dir -depth 20 -threads 10 -topN 50

try with increasing these values, might you get some good result.
and if I get some Updates regarding this,  I will let you know.

Thanks

Mike Howarth wrote
I was wondering if anyone could help me.

I'm currently trying to get nutch to crawl a site I have. At the moment I'm pointing nutch at the root url e.g http://www.example.com

I know that I have over 130 links on the index page, however nutch is only finding 87 links. It appears that nutch stops crawling, the hadoop.log doesn't given any indication why this may occur.

I've amended my nutch-crawl to look like this:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|js)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
-^https:\/\/.*
+.

# skip everything else
#-^https://.*
Reply | Threaded
Open this post in threaded view
|

Re: Crawl not crawling entire page

Mike Howarth
Thanks for the response

I've already played around with differing depths generally from 3 to 10 and have had no distinguisable difference in results.

Furthermore I've tried running the search with the topN and omitting the flag with little difference.

Anymore ideas?

Ratnesh,V2Solutions India wrote
Hi ,
it may be because of the depth you specify is not able to reach the desired page link, so you do some settings related with depth,threads at the time of crawl.

like crawl -d urldir -dir crawl-dir -depth 20 -threads 10 -topN 50

try with increasing these values, might you get some good result.
and if I get some Updates regarding this,  I will let you know.

Thanks

Mike Howarth wrote
I was wondering if anyone could help me.

I'm currently trying to get nutch to crawl a site I have. At the moment I'm pointing nutch at the root url e.g http://www.example.com

I know that I have over 130 links on the index page, however nutch is only finding 87 links. It appears that nutch stops crawling, the hadoop.log doesn't given any indication why this may occur.

I've amended my nutch-crawl to look like this:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|js)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
-^https:\/\/.*
+.

# skip everything else
#-^https://.*
Reply | Threaded
Open this post in threaded view
|

Re: Crawl not crawling entire page

Dennis Kubes
Nutch by default will only parse the first 65536 bytes of an http
request.  You can change this to your desired limit by changing the
http.content.limit configuration variable.

Another question is whether some of the links are duplicates?

Dennis Kubes

Mike Howarth wrote:

> Thanks for the response
>
> I've already played around with differing depths generally from 3 to 10 and
> have had no distinguisable difference in results.
>
> Furthermore I've tried running the search with the topN and omitting the
> flag with little difference.
>
> Anymore ideas?
>
>
> Ratnesh,V2Solutions India wrote:
>> Hi ,
>> it may be because of the depth you specify is not able to reach the
>> desired page link, so you do some settings related with depth,threads at
>> the time of crawl.
>>
>> like crawl -d urldir -dir crawl-dir -depth 20 -threads 10 -topN 50
>>
>> try with increasing these values, might you get some good result.
>> and if I get some Updates regarding this,  I will let you know.
>>
>> Thanks
>>
>>
>> Mike Howarth wrote:
>>> I was wondering if anyone could help me.
>>>
>>> I'm currently trying to get nutch to crawl a site I have. At the moment
>>> I'm pointing nutch at the root url e.g http://www.example.com
>>>
>>> I know that I have over 130 links on the index page, however nutch is
>>> only finding 87 links. It appears that nutch stops crawling, the
>>> hadoop.log doesn't given any indication why this may occur.
>>>
>>> I've amended my nutch-crawl to look like this:
>>>
>>> # The url filter file used by the crawl command.
>>>
>>> # Better for intranet crawling.
>>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>>>
>>> # Each non-comment, non-blank line contains a regular expression
>>> # prefixed by '+' or '-'.  The first matching pattern in the file
>>> # determines whether a URL is included or ignored.  If no pattern
>>> # matches, the URL is ignored.
>>>
>>> # skip file:, ftp:, & mailto: urls
>>> -^(file|ftp|mailto):
>>>
>>> # skip image and other suffixes we can't yet parse
>>> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|js)$
>>>
>>> # skip URLs containing certain characters as probable queries, etc.
>>> #-[?*!@=]
>>>
>>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>>> loops
>>> #-.*(/.+?)/.*?\1/.*?\1/
>>>
>>> # accept hosts in MY.DOMAIN.NAME
>>> -^https:\/\/.*
>>> +.
>>>
>>> # skip everything else
>>> #-^https://.*
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Crawl not crawling entire page

Annona Keene
In reply to this post by Mike Howarth

Thus saith Mike Howarth:
> I've already played around with differing depths generally from 3 to 10 and
> have had no distinguisable difference in results
>......
> Anymore ideas?



I fought with a similar problem for quite a while.  I suggest changing 2 things in your nutch-site.xml

The http.content.limit will prevent nutch from truncating the page.  As long as your pages aren't so big that you're going to kill the machine you're using, removing the truncation should work.

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

Second, by default, nutch only crawls the first 100 links it encounters on a page. So if you set db.max.outlinks.per.page to -1, it will crawl all the links.

<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>


I hope this helps!

Ann




 
____________________________________________________________________________________
We won't tell. Get more on shows you hate to love
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.com/collections/265 
Reply | Threaded
Open this post in threaded view
|

Re: Crawl not crawling entire page

Mike Howarth
Well db.maxoutlinks appears to have made a world of difference.

I'm now getting nutch crawling deeply through the site.

Many thanks for all your input, I'm sure I'll be back asking some more useless questions soon!

Annona Keene wrote
Thus saith Mike Howarth:
> I've already played around with differing depths generally from 3 to 10 and
> have had no distinguisable difference in results
>......
> Anymore ideas?



I fought with a similar problem for quite a while.  I suggest changing 2 things in your nutch-site.xml

The http.content.limit will prevent nutch from truncating the page.  As long as your pages aren't so big that you're going to kill the machine you're using, removing the truncation should work.

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

Second, by default, nutch only crawls the first 100 links it encounters on a page. So if you set db.max.outlinks.per.page to -1, it will crawl all the links.

<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>


I hope this helps!

Ann




 
____________________________________________________________________________________
We won't tell. Get more on shows you hate to love
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.com/collections/265