Can't index some pages

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Can't index some pages

Michael Plax
Hello,

Question summery:
Q: How can I set up crawler in order to index all web site?

I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html).
2. In the configuration file conf/crawl-urlfilter.txt domain was changed.
3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >& crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
   output:
  $ bin/nutch readdb crawledtottaly/db -stats
  run java in C:\Sun\AppServer\jdk
  060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
  060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
  060118 155526 No FS indicated, using default:local
  Stats for org.apache.nutch.db.WebDBReader@1b3f829
  -------------------------------
  Number of pages: 63
  Number of links: 3906
6. I get less pages than I have expected.

What I did:
0. I read http://www.mail-archive.com/nutch-user@.../msg02458.html
1. I changed the depth to 10,100, 1000- same results
2. I changed start page to page that did  not appear - I do get that page indexed
    output:
  $ bin/nutch readdb crawledtottaly/db -stats
  run java in C:\Sun\AppServer\jdk
  060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
  060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
  060118 162103 No FS indicated, using default:local
  Stats for org.apache.nutch.db.WebDBReader@1b3f829
  -------------------------------
  Number of pages: 64
  Number of links: 3906
This page appears in depth 3 from index.html
 
Q: How can I set up crawler in order to index all web site?

Thank you
Michael

P.S.
I have attached configuration files

============================
urls
============================
http://www.totallyfurniture.com/index.html


============================
crawl-url-filter.txt
============================
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]


# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*totallyfurniture.com/
+^http://([a-z0-9]*\.)*yahoo.net/


# skip everything else
-.
Reply | Threaded
Open this post in threaded view
|

Re: Can't index some pages

Steven Yelton
Is it not catching all the outbound links?

db.max.outlinks.per.page

I think the default is 100.  I had to bump it up significantly to index
a reference site...

Steven

Michael Plax wrote:

>Hello,
>
>Question summery:
>Q: How can I set up crawler in order to index all web site?
>
>I'm trying to run crawl with command from tutorial
>
>1. In urls file I have start page (index.html).
>2. In the configuration file conf/crawl-urlfilter.txt domain was changed.
>3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >& crawl.log
>4. Crawling is finished
>5. I run: bin/nutch readdb crawled/db -stats
>   output:
>  $ bin/nutch readdb crawledtottaly/db -stats
>  run java in C:\Sun\AppServer\jdk
>  060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
>  060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
>  060118 155526 No FS indicated, using default:local
>  Stats for org.apache.nutch.db.WebDBReader@1b3f829
>  -------------------------------
>  Number of pages: 63
>  Number of links: 3906
>6. I get less pages than I have expected.
>
>What I did:
>0. I read http://www.mail-archive.com/nutch-user@.../msg02458.html
>1. I changed the depth to 10,100, 1000- same results
>2. I changed start page to page that did  not appear - I do get that page indexed
>    output:
>  $ bin/nutch readdb crawledtottaly/db -stats
>  run java in C:\Sun\AppServer\jdk
>  060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
>  060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
>  060118 162103 No FS indicated, using default:local
>  Stats for org.apache.nutch.db.WebDBReader@1b3f829
>  -------------------------------
>  Number of pages: 64
>  Number of links: 3906
>This page appears in depth 3 from index.html
>
>Q: How can I set up crawler in order to index all web site?
>
>Thank you
>Michael
>
>P.S.
>I have attached configuration files
>
>============================
>urls
>============================
>http://www.totallyfurniture.com/index.html
>
>
>============================
>crawl-url-filter.txt
>============================
># The url filter file used by the crawl command.
>
># Better for intranet crawling.
># Be sure to change MY.DOMAIN.NAME to your domain name.
>
># Each non-comment, non-blank line contains a regular expression
># prefixed by '+' or '-'.  The first matching pattern in the file
># determines whether a URL is included or ignored.  If no pattern
># matches, the URL is ignored.
>
># skip file:, ftp:, & mailto: urls
>-^(file|ftp|mailto):
>
># skip image and other suffixes we can't yet parse
>-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
># skip URLs containing certain characters as probable queries, etc.
>-[?*!@=]
>
>
># accept hosts in MY.DOMAIN.NAME
>+^http://([a-z0-9]*\.)*totallyfurniture.com/
>+^http://([a-z0-9]*\.)*yahoo.net/
>
>
># skip everything else
>-.
>  
>
Reply | Threaded
Open this post in threaded view
|

Re: Can't index some pages

Doug Cutting-2
In reply to this post by Michael Plax
Michael Plax wrote:

> Question summery:
> Q: How can I set up crawler in order to index all web site?
>
> I'm trying to run crawl with command from tutorial
>
> 1. In urls file I have start page (index.html).
> 2. In the configuration file conf/crawl-urlfilter.txt domain was changed.
> 3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >& crawl.log
> 4. Crawling is finished
> 5. I run: bin/nutch readdb crawled/db -stats
>    output:
>   $ bin/nutch readdb crawledtottaly/db -stats
>   run java in C:\Sun\AppServer\jdk
>   060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
>   060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
>   060118 155526 No FS indicated, using default:local
>   Stats for org.apache.nutch.db.WebDBReader@1b3f829
>   -------------------------------
>   Number of pages: 63
>   Number of links: 3906
> 6. I get less pages than I have expected.

This is a common question, but there's not a common answer.  The problem
could be that urls are blocked by your url filter, or by
http.max.delays, or something else.

What might help is if the fetcher and crawl db printed more detailed
statistics.  In particular, the fetcher could categorize failures and
periodically print a list of failure counts by category.  The crawl db
updater could also list the number of urls that are filtered.

In the meantime, please examine the logs, particularly watching for
errors while fetching.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Can't index some pages

Michael Plax
In reply to this post by Steven Yelton
Thank you very much,

I changed db.max.outlinks.per.page and db.max.anchor.length to 200 and I got
whole web site indexed.
This particular web site has more than 100 outbound links per page.

Michael

----- Original Message -----
From: "Steven Yelton" <[hidden email]>
To: <[hidden email]>
Sent: Thursday, January 19, 2006 5:29 AM
Subject: Re: Can't index some pages


> Is it not catching all the outbound links?
>
> db.max.outlinks.per.page
>
> I think the default is 100.  I had to bump it up significantly to index a
> reference site...
>
> Steven
>
> Michael Plax wrote:
>
>>Hello,
>>
>>Question summery:
>>Q: How can I set up crawler in order to index all web site?
>>
>>I'm trying to run crawl with command from tutorial
>>
>>1. In urls file I have start page (index.html). 2. In the configuration
>>file conf/crawl-urlfilter.txt domain was changed.
>>3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >&
>>crawl.log
>>4. Crawling is finished
>>5. I run: bin/nutch readdb crawled/db -stats
>>   output:
>>  $ bin/nutch readdb crawledtottaly/db -stats
>>  run java in C:\Sun\AppServer\jdk
>>  060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
>>  060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
>>  060118 155526 No FS indicated, using default:local
>>  Stats for org.apache.nutch.db.WebDBReader@1b3f829
>>  -------------------------------
>>  Number of pages: 63
>>  Number of links: 3906
>>6. I get less pages than I have expected.
>>
>>What I did:
>>0. I read
>>http://www.mail-archive.com/nutch-user@.../msg02458.html
>>1. I changed the depth to 10,100, 1000- same results
>>2. I changed start page to page that did  not appear - I do get that page
>>indexed
>>    output:
>>  $ bin/nutch readdb crawledtottaly/db -stats
>>  run java in C:\Sun\AppServer\jdk
>>  060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
>>  060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
>>  060118 162103 No FS indicated, using default:local
>>  Stats for org.apache.nutch.db.WebDBReader@1b3f829
>>  -------------------------------
>>  Number of pages: 64
>>  Number of links: 3906
>>This page appears in depth 3 from index.html
>> Q: How can I set up crawler in order to index all web site?
>>
>>Thank you
>>Michael
>>
>>P.S.
>>I have attached configuration files
>>
>>============================
>>urls
>>============================
>>http://www.totallyfurniture.com/index.html
>>
>>
>>============================
>>crawl-url-filter.txt
>>============================
>># The url filter file used by the crawl command.
>>
>># Better for intranet crawling.
>># Be sure to change MY.DOMAIN.NAME to your domain name.
>>
>># Each non-comment, non-blank line contains a regular expression
>># prefixed by '+' or '-'.  The first matching pattern in the file
>># determines whether a URL is included or ignored.  If no pattern
>># matches, the URL is ignored.
>>
>># skip file:, ftp:, & mailto: urls
>>-^(file|ftp|mailto):
>>
>># skip image and other suffixes we can't yet parse
>>-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>>
>># skip URLs containing certain characters as probable queries, etc.
>>-[?*!@=]
>>
>>
>># accept hosts in MY.DOMAIN.NAME
>>+^http://([a-z0-9]*\.)*totallyfurniture.com/
>>+^http://([a-z0-9]*\.)*yahoo.net/
>>
>>
>># skip everything else
>>-.
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Can't index some pages

kangas
Doug, would it make sense to print a LOG.info() message every time  
the fetcher bumps into one of these "db.max" limits? This would help  
users find out when they need to adjust their configuration.

I can prepare a patch if it seems sensible.

--Matt

On Jan 19, 2006, at 5:34 PM, Michael Plax wrote:

> Thank you very much,
>
> I changed db.max.outlinks.per.page and db.max.anchor.length to 200  
> and I got whole web site indexed.
> This particular web site has more than 100 outbound links per page.
>
> Michael
>
> ----- Original Message ----- From: "Steven Yelton"  
> <[hidden email]>
> To: <[hidden email]>
> Sent: Thursday, January 19, 2006 5:29 AM
> Subject: Re: Can't index some pages
>
>
>> Is it not catching all the outbound links?
>>
>> db.max.outlinks.per.page
>>
>> I think the default is 100.  I had to bump it up significantly to  
>> index a reference site...
>>
>> Steven
>>
>> Michael Plax wrote:
>>
>>> Hello,
>>>
>>> Question summery:
>>> Q: How can I set up crawler in order to index all web site?
>>>
>>> I'm trying to run crawl with command from tutorial
>>>
>>> 1. In urls file I have start page (index.html). 2. In the  
>>> configuration file conf/crawl-urlfilter.txt domain was changed.
>>> 3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >&  
>>> crawl.log
>>> 4. Crawling is finished
>>> 5. I run: bin/nutch readdb crawled/db -stats
>>>   output:
>>>  $ bin/nutch readdb crawledtottaly/db -stats
>>>  run java in C:\Sun\AppServer\jdk
>>>  060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
>>>  060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
>>>  060118 155526 No FS indicated, using default:local
>>>  Stats for org.apache.nutch.db.WebDBReader@1b3f829
>>>  -------------------------------
>>>  Number of pages: 63
>>>  Number of links: 3906
>>> 6. I get less pages than I have expected.
>>>
>>> What I did:
>>> 0. I read http://www.mail-archive.com/nutch- 
>>> [hidden email]/msg02458.html
>>> 1. I changed the depth to 10,100, 1000- same results
>>> 2. I changed start page to page that did  not appear - I do get  
>>> that page indexed
>>>    output:
>>>  $ bin/nutch readdb crawledtottaly/db -stats
>>>  run java in C:\Sun\AppServer\jdk
>>>  060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
>>>  060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
>>>  060118 162103 No FS indicated, using default:local
>>>  Stats for org.apache.nutch.db.WebDBReader@1b3f829
>>>  -------------------------------
>>>  Number of pages: 64
>>>  Number of links: 3906
>>> This page appears in depth 3 from index.html
>>> Q: How can I set up crawler in order to index all web site?
>>>
>>> Thank you
>>> Michael
>>>
>>> P.S.
>>> I have attached configuration files
>>>
>>> ============================
>>> urls
>>> ============================
>>> http://www.totallyfurniture.com/index.html
>>>
>>>
>>> ============================
>>> crawl-url-filter.txt
>>> ============================
>>> # The url filter file used by the crawl command.
>>>
>>> # Better for intranet crawling.
>>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>>>
>>> # Each non-comment, non-blank line contains a regular expression
>>> # prefixed by '+' or '-'.  The first matching pattern in the file
>>> # determines whether a URL is included or ignored.  If no pattern
>>> # matches, the URL is ignored.
>>>
>>> # skip file:, ftp:, & mailto: urls
>>> -^(file|ftp|mailto):
>>>
>>> # skip image and other suffixes we can't yet parse
>>> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|
>>> rpm|tgz|mov|MOV|exe|png|PNG)$
>>>
>>> # skip URLs containing certain characters as probable queries, etc.
>>> -[?*!@=]
>>>
>>>
>>> # accept hosts in MY.DOMAIN.NAME
>>> +^http://([a-z0-9]*\.)*totallyfurniture.com/
>>> +^http://([a-z0-9]*\.)*yahoo.net/
>>>
>>>
>>> # skip everything else
>>> -.
>>>
>

--
Matt Kangas / [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Can't index some pages

Doug Cutting-2
att Kangas wrote:
> Doug, would it make sense to print a LOG.info() message every time the
> fetcher bumps into one of these "db.max" limits? This would help users
> find out when they need to adjust their configuration.
>
> I can prepare a patch if it seems sensible.

Sure, this is sensible.  But it's not done under the fetcher, but when
the links are read, under db update.

Doug