Unable to crawl all links

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Unable to crawl all links

Amitabha Banerjee
Hi folks,
I am unable to crawl all the links in my website. For some reason, only one
or two links are picked up by nutch.

Here is the website I am trying to index: http://www.knowmydestination.com

All links a this website are internal.

My crawl-urlfilter does not block any kind of internal links. It looks as
possible.

# accept hosts in MY.DOMAIN.NAME
+^http://www.knowmydestination.com/

# skip everything else
-.

My urls are:  http://www.knowmydestination.com/

When I run:
bin/nutch crawl urls -dir crawl.kmd -depth 3 -topN 100

nutch only crwal one link
http://www.knowmydestination.com/articles/cheapfares.html

Can anyone help me figre this out.

/Amitab
Reply | Threaded
Open this post in threaded view
|

Re: Unable to crawl all links

Kevin MacDonald-3
Your crawl-urlfilter is too specific. It is a regular expression that needs
to match on every url you want to hit. Try

+^http://([a-z0-9]*\.)*\S*

That will allow most any url.

Kevin

On Wed, Sep 10, 2008 at 8:29 PM, Amitabha Banerjee <[hidden email]>wrote:

> Hi folks,
> I am unable to crawl all the links in my website. For some reason, only one
> or two links are picked up by nutch.
>
> Here is the website I am trying to index: http://www.knowmydestination.com
>
> All links a this website are internal.
>
> My crawl-urlfilter does not block any kind of internal links. It looks as
> possible.
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://www.knowmydestination.com/
>
> # skip everything else
> -.
>
> My urls are:  http://www.knowmydestination.com/
>
> When I run:
> bin/nutch crawl urls -dir crawl.kmd -depth 3 -topN 100
>
> nutch only crwal one link
> http://www.knowmydestination.com/articles/cheapfares.html
>
> Can anyone help me figre this out.
>
> /Amitab
>
Reply | Threaded
Open this post in threaded view
|

Re: Unable to crawl all links

vishal vachhani
In reply to this post by Amitabha Banerjee
Hi Amitabha,
                  Look at the nutch-default.xml. Change the following
property in order to crawl whole site.

db.ignore.internal.links ----by default it is true which should be false in
you case.

--Vishal


On Thu, Sep 11, 2008 at 8:59 AM, Amitabha Banerjee <[hidden email]>wrote:

> Hi folks,
> I am unable to crawl all the links in my website. For some reason, only one
> or two links are picked up by nutch.
>
> Here is the website I am trying to index: http://www.knowmydestination.com
>
> All links a this website are internal.
>
> My crawl-urlfilter does not block any kind of internal links. It looks as
> possible.
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://www.knowmydestination.com/
>
> # skip everything else
> -.
>
> My urls are:  http://www.knowmydestination.com/
>
> When I run:
> bin/nutch crawl urls -dir crawl.kmd -depth 3 -topN 100
>
> nutch only crwal one link
> http://www.knowmydestination.com/articles/cheapfares.html
>
> Can anyone help me figre this out.
>
> /Amitab
>



--
Thanks and Regards,
Vishal Vachhani
M.tech, CSE dept
Indian Institute of Technology, Bombay
http://www.cse.iitb.ac.in/~vishalv
Reply | Threaded
Open this post in threaded view
|

Re: Unable to crawl all links

Chetan Patel
Hi Vishal,

I have same problem as Amitabha.

I have did as per your instruction but still nutch not crawl all url.

Please help me.

Thanks in advance.

Regards,
Chetan Patel

vishal vachhani wrote
Hi Amitabha,
                  Look at the nutch-default.xml. Change the following
property in order to crawl whole site.

db.ignore.internal.links ----by default it is true which should be false in
you case.

--Vishal


On Thu, Sep 11, 2008 at 8:59 AM, Amitabha Banerjee <hiamitabha@gmail.com>wrote:

> Hi folks,
> I am unable to crawl all the links in my website. For some reason, only one
> or two links are picked up by nutch.
>
> Here is the website I am trying to index: http://www.knowmydestination.com
>
> All links a this website are internal.
>
> My crawl-urlfilter does not block any kind of internal links. It looks as
> possible.
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://www.knowmydestination.com/
>
> # skip everything else
> -.
>
> My urls are:  http://www.knowmydestination.com/
>
> When I run:
> bin/nutch crawl urls -dir crawl.kmd -depth 3 -topN 100
>
> nutch only crwal one link
> http://www.knowmydestination.com/articles/cheapfares.html
>
> Can anyone help me figre this out.
>
> /Amitab
>



--
Thanks and Regards,
Vishal Vachhani
M.tech, CSE dept
Indian Institute of Technology, Bombay
http://www.cse.iitb.ac.in/~vishalv
Reply | Threaded
Open this post in threaded view
|

Re: Unable to crawl all links

Kevin MacDonald-3
Dig into the code. Look at Fetcher.run() and Fetcher.handleRedirect(). Put
extra logging lines around the filters and normalizers to see if your urls
are showing up but being either removed or altered. You can also disable all
Normalizers by modifying the 'plugin.includes' property. Copy the property
from nutch-default.xml to nutch-site.xml and remove the normalizers.

On Fri, Sep 26, 2008 at 6:16 AM, Chetan Patel <[hidden email]>wrote:

>
> Hi Vishal,
>
> I have same problem as Amitabha.
>
> I have did as per your instruction but still nutch not crawl all url.
>
> Please help me.
>
> Thanks in advance.
>
> Regards,
> Chetan Patel
>
>
> vishal vachhani wrote:
> >
> > Hi Amitabha,
> >                   Look at the nutch-default.xml. Change the following
> > property in order to crawl whole site.
> >
> > db.ignore.internal.links ----by default it is true which should be false
> > in
> > you case.
> >
> > --Vishal
> >
> >
> > On Thu, Sep 11, 2008 at 8:59 AM, Amitabha Banerjee
> > <[hidden email]>wrote:
> >
> >> Hi folks,
> >> I am unable to crawl all the links in my website. For some reason, only
> >> one
> >> or two links are picked up by nutch.
> >>
> >> Here is the website I am trying to index:
> >> http://www.knowmydestination.com
> >>
> >> All links a this website are internal.
> >>
> >> My crawl-urlfilter does not block any kind of internal links. It looks
> as
> >> possible.
> >>
> >> # accept hosts in MY.DOMAIN.NAME
> >> +^http://www.knowmydestination.com/
> >>
> >> # skip everything else
> >> -.
> >>
> >> My urls are:  http://www.knowmydestination.com/
> >>
> >> When I run:
> >> bin/nutch crawl urls -dir crawl.kmd -depth 3 -topN 100
> >>
> >> nutch only crwal one link
> >> http://www.knowmydestination.com/articles/cheapfares.html
> >>
> >> Can anyone help me figre this out.
> >>
> >> /Amitab
> >>
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vishal Vachhani
> > M.tech, CSE dept
> > Indian Institute of Technology, Bombay
> > http://www.cse.iitb.ac.in/~vishalv
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19688302.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Unable to crawl all links

Chetan Patel
Hi All,

I have read the db file and it display result below.
==========================================
$ bin/nutch readdb mytest/crawldb -stats
CrawlDb statistics start: mytest/crawldb
Statistics for CrawlDb: mytest/crawldb
TOTAL urls:     345
retry 0:        345
min score:      0.0
avg score:      0.028
max score:      1.055
status 1 (db_unfetched):        285
status 2 (db_fetched):  48
status 3 (db_gone):     5
status 4 (db_redir_temp):       4
status 5 (db_redir_perm):       3
CrawlDb statistics: done
==========================================

You can see there are total 345 URL. From that nutch have fetch only 48.
I need to fetch all 345 URL.

I have also tried kevin's solution.

Please help me I am new in nutch.

Thanks.

Regards,
Chetan Patel

Reply | Threaded
Open this post in threaded view
|

RE: Unable to crawl all links

Edward Quick

Chetan, if you haven't already done this, check your crawl-urlfilter.txt (or regex-urlfilter.txt if you're running nutch fetch) and comment out the line below:

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

Ed.


> Date: Fri, 26 Sep 2008 23:18:29 -0700
> From: [hidden email]
> To: [hidden email]
> Subject: Re: Unable to crawl all links
>
>
> Hi All,
>
> I have read the db file and it display result below.
> ==========================================
> $ bin/nutch readdb mytest/crawldb -stats
> CrawlDb statistics start: mytest/crawldb
> Statistics for CrawlDb: mytest/crawldb
> TOTAL urls:     345
> retry 0:        345
> min score:      0.0
> avg score:      0.028
> max score:      1.055
> status 1 (db_unfetched):        285
> status 2 (db_fetched):  48
> status 3 (db_gone):     5
> status 4 (db_redir_temp):       4
> status 5 (db_redir_perm):       3
> CrawlDb statistics: done
> ==========================================
>
> You can see there are total 345 URL. From that nutch have fetch only 48.
> I need to fetch all 345 URL.
>
> I have also tried kevin's solution.
>
> Please help me I am new in nutch.
>
> Thanks.
>
> Regards,
> Chetan Patel
>
>
> --
> View this message in context: http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19700067.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: Unable to crawl all links

Chetan Patel
Hi,

Here is my crawl-urlfilter.txt file
=============================
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# accept everything else
+.
=============================

Please let me know how can I crawl all URL?

Thank you.

Regards,
Chetan Patel

Edward Quick wrote
Chetan, if you haven't already done this, check your crawl-urlfilter.txt (or regex-urlfilter.txt if you're running nutch fetch) and comment out the line below:

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

Ed.


> Date: Fri, 26 Sep 2008 23:18:29 -0700
> From: chetan@webmail.aruhat.com
> To: nutch-user@lucene.apache.org
> Subject: Re: Unable to crawl all links
>
>
> Hi All,
>
> I have read the db file and it display result below.
> ==========================================
> $ bin/nutch readdb mytest/crawldb -stats
> CrawlDb statistics start: mytest/crawldb
> Statistics for CrawlDb: mytest/crawldb
> TOTAL urls:     345
> retry 0:        345
> min score:      0.0
> avg score:      0.028
> max score:      1.055
> status 1 (db_unfetched):        285
> status 2 (db_fetched):  48
> status 3 (db_gone):     5
> status 4 (db_redir_temp):       4
> status 5 (db_redir_perm):       3
> CrawlDb statistics: done
> ==========================================
>
> You can see there are total 345 URL. From that nutch have fetch only 48.
> I need to fetch all 345 URL.
>
> I have also tried kevin's solution.
>
> Please help me I am new in nutch.
>
> Thanks.
>
> Regards,
> Chetan Patel
>
>
> --
> View this message in context: http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19700067.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: Unable to crawl all links

vishal vachhani
hi Chetan ,
                  check properties call "db.ignore.external.links" and
"db.ignore.internal.links". in nutch-default.xml.  Description is given in
the file what does that mean. you can act accordingly if all pages from the
crawl url is not being crawled.


--vishal

On Sat, Sep 27, 2008 at 3:18 PM, Chetan Patel <[hidden email]>wrote:

>
> Hi,
>
> Here is my crawl-urlfilter.txt file
> =============================
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # accept everything else
> +.
> =============================
>
> Please let me know how can I crawl all URL?
>
> Thank you.
>
> Regards,
> Chetan Patel
>
>
> Edward Quick wrote:
> >
> >
> > Chetan, if you haven't already done this, check your crawl-urlfilter.txt
> > (or regex-urlfilter.txt if you're running nutch fetch) and comment out
> the
> > line below:
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> >
> > Ed.
> >
> >
> >> Date: Fri, 26 Sep 2008 23:18:29 -0700
> >> From: [hidden email]
> >> To: [hidden email]
> >> Subject: Re: Unable to crawl all links
> >>
> >>
> >> Hi All,
> >>
> >> I have read the db file and it display result below.
> >> ==========================================
> >> $ bin/nutch readdb mytest/crawldb -stats
> >> CrawlDb statistics start: mytest/crawldb
> >> Statistics for CrawlDb: mytest/crawldb
> >> TOTAL urls:     345
> >> retry 0:        345
> >> min score:      0.0
> >> avg score:      0.028
> >> max score:      1.055
> >> status 1 (db_unfetched):        285
> >> status 2 (db_fetched):  48
> >> status 3 (db_gone):     5
> >> status 4 (db_redir_temp):       4
> >> status 5 (db_redir_perm):       3
> >> CrawlDb statistics: done
> >> ==========================================
> >>
> >> You can see there are total 345 URL. From that nutch have fetch only 48.
> >> I need to fetch all 345 URL.
> >>
> >> I have also tried kevin's solution.
> >>
> >> Please help me I am new in nutch.
> >>
> >> Thanks.
> >>
> >> Regards,
> >> Chetan Patel
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19700067.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >
> > _________________________________________________________________
> > Win New York holidays with Kellogg's & Live Search
> > http://clk.atdmt.com/UKM/go/111354033/direct/01/
> >
>
> --
> View this message in context:
> http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19701303.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Unable to crawl all links

Edward Quick
In reply to this post by Chetan Patel


>
>
> Hi,
>
> Here is my crawl-urlfilter.txt file
> =============================
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # accept everything else
> +.
> =============================

Well that should cover most of it! I'm not one of the experts here, but think that the 285 unfetched urls in your stats below, means that you need to run 'nutch fetch' on your most recent segment to fetch 285 new links eg, try something like this:

nutch fetch crawl/segments/200809251253 > crawl.log

then go through the crawl.log to see which fetches are failing, reconfigure, and re-run.



>
> Please let me know how can I crawl all URL?
>
> Thank you.
>
> Regards,
> Chetan Patel
>
>
> Edward Quick wrote:
> >
> >
> > Chetan, if you haven't already done this, check your crawl-urlfilter.txt
> > (or regex-urlfilter.txt if you're running nutch fetch) and comment out the
> > line below:
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> >
> > Ed.
> >
> >
> >> Date: Fri, 26 Sep 2008 23:18:29 -0700
> >> From: [hidden email]
> >> To: [hidden email]
> >> Subject: Re: Unable to crawl all links
> >>
> >>
> >> Hi All,
> >>
> >> I have read the db file and it display result below.
> >> ==========================================
> >> $ bin/nutch readdb mytest/crawldb -stats
> >> CrawlDb statistics start: mytest/crawldb
> >> Statistics for CrawlDb: mytest/crawldb
> >> TOTAL urls:     345
> >> retry 0:        345
> >> min score:      0.0
> >> avg score:      0.028
> >> max score:      1.055
> >> status 1 (db_unfetched):        285
> >> status 2 (db_fetched):  48
> >> status 3 (db_gone):     5
> >> status 4 (db_redir_temp):       4
> >> status 5 (db_redir_perm):       3
> >> CrawlDb statistics: done
> >> ==========================================
> >>
> >> You can see there are total 345 URL. From that nutch have fetch only 48.
> >> I need to fetch all 345 URL.
> >>
> >> I have also tried kevin's solution.
> >>
> >> Please help me I am new in nutch.
> >>
> >> Thanks.
> >>
> >> Regards,
> >> Chetan Patel
> >>
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19700067.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >
> > _________________________________________________________________
> > Win New York holidays with Kellogg’s & Live Search
> > http://clk.atdmt.com/UKM/go/111354033/direct/01/
> >
>
> --
> View this message in context: http://www.nabble.com/Unable-to-crawl-all-links-tp19427208p19701303.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/