Nutch crawl not fetching portions of site

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch crawl not fetching portions of site

Andrew Libby
Hello,

I have what I assume to be a simple user issue with nutch-0.8-dev.  I'm
using Nutch
to do a single site crawl on a Fedora Core 4 Linux machine.  The site
I'm crawling consists
of Perl (Catalyst to be specific), and PHP (an app called gallery, and
an instance of Media Wiki).

The issue I'm having is that Nutch does not seem to crawl the gallery
section of the site.
There are links from the main site to gallery, and I've listed the top
level gallery URL
my initial url list I pass to nutch crawl.

Sorry for the length of the message, but I wanted to try to provide as
much information about
the problem as I could.

Nutch does crawl the wiki and perl sections of the site.

Crawl Command:

nutch crawl urls -dir ../nutch-index -depth 25 -topN10000

The urls dir contains one file called urls.txt:

http://www.philadelphiariders.com/
http://www.philadelphiariders.com/c/dmoz/Top.html
http://www.philadelphiariders.com/gallery/

The only change I've nade to crawl-urlfilter.txt is:

+^http://www.philadelphiariders.com/

which I replaced the example regex rule that was there out of the box.

In the index output, I see a reference to the gallery:

 Indexing [http://www.philadelphiariders.com/gallery/] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@6833f2 (null)

But the rest of the gallery is not referenced in index output.

The command  ./bin/nutch readdb ../nutch-index/crawldb -dump ./dumpdata
Has only these two entries referencing the gallery.    Does the Status of
view_album.php have anything to do with my issue?

http://www.philadelphiariders.com/gallery/  Version: 4
Status: 2 (DB_fetched)
Fetch time: Tue May 16 15:20:15 EDT 2006
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 316.0114
Signature: b7619f18442c6f356f802ba7847dc127

http://www.philadelphiariders.com/gallery/view_album.php    Version: 4
Status: 3 (DB_gone)
Fetch time: Sun Apr 16 15:21:12 EDT 2006
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 2.0824916
Signature: null

Links that are not indexed are in the linkdb:

./bin/nutch readlinkdb ../nutch-index/linkdb -dump joe

yields:

http://philadelphiariders.com/gallery/2005-Events   Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:

http://philadelphiariders.com/gallery/2006-Events   Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2006 Events

http://philadelphiariders.com/gallery/2nd-Sunday-Rides  Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
 fromUrl: http://www.philadelphiariders.com/c/ anchor: 2nd Sunday Rides
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2nd Sunday
Rides

http://philadelphiariders.com/gallery/April-2006    Inlinks:
 fromUrl: http://www.philadelphiariders.com/c/ anchor: April 2006

http://philadelphiariders.com/gallery/Marilyns-Photos   Inlinks:
 fromUrl: http://www.philadelphiariders.com/c/ anchor: Marilyn's Photos

http://philadelphiariders.com/gallery/Rider-Gallery Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor: Rider Gallery
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:

Also, a lot fo the navigation in the Gallery application makes use of
GET parameters.  To follow links contianing these, would I need to tweak
crawl-urlfilter.txt to remove the following line:

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

I don't think this is the whole problem, because the root url
for the gallery has been fetched/ indexed.  This page contains
links that are not queryies (i.e. contain ?).

Thanks in advance for any help you can offer.

Andy

--
Andrew Libby                                  
[hidden email]
http://philadelphiariders.com/


Reply | Threaded
Open this post in threaded view
|

Re: Nutch crawl not fetching portions of site

Dennis Kubes
It is possible that the URL filter is preventing the links from being
crawled, especially if they have characters such as ? or ; in them (i.e.
like a php session id).  Can you post an example of a link?

Dennis

Andrew Libby wrote:

> Hello,
>
> I have what I assume to be a simple user issue with nutch-0.8-dev.  I'm
> using Nutch
> to do a single site crawl on a Fedora Core 4 Linux machine.  The site
> I'm crawling consists
> of Perl (Catalyst to be specific), and PHP (an app called gallery, and
> an instance of Media Wiki).
>
> The issue I'm having is that Nutch does not seem to crawl the gallery
> section of the site.
> There are links from the main site to gallery, and I've listed the top
> level gallery URL
> my initial url list I pass to nutch crawl.
>
> Sorry for the length of the message, but I wanted to try to provide as
> much information about
> the problem as I could.
>
> Nutch does crawl the wiki and perl sections of the site.
>
> Crawl Command:
>
> nutch crawl urls -dir ../nutch-index -depth 25 -topN10000
>
> The urls dir contains one file called urls.txt:
>
> http://www.philadelphiariders.com/
> http://www.philadelphiariders.com/c/dmoz/Top.html
> http://www.philadelphiariders.com/gallery/
>
> The only change I've nade to crawl-urlfilter.txt is:
>
> +^http://www.philadelphiariders.com/
>
> which I replaced the example regex rule that was there out of the box.
>
> In the index output, I see a reference to the gallery:
>
>  Indexing [http://www.philadelphiariders.com/gallery/] with analyzer
> org.apache.nutch.analysis.NutchDocumentAnalyzer@6833f2 (null)
>
> But the rest of the gallery is not referenced in index output.
>
> The command  ./bin/nutch readdb ../nutch-index/crawldb -dump ./dumpdata
> Has only these two entries referencing the gallery.    Does the Status of
> view_album.php have anything to do with my issue?
>
> http://www.philadelphiariders.com/gallery/  Version: 4
> Status: 2 (DB_fetched)
> Fetch time: Tue May 16 15:20:15 EDT 2006
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 30.0 days
> Score: 316.0114
> Signature: b7619f18442c6f356f802ba7847dc127
>
> http://www.philadelphiariders.com/gallery/view_album.php    Version: 4
> Status: 3 (DB_gone)
> Fetch time: Sun Apr 16 15:21:12 EDT 2006
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 30.0 days
> Score: 2.0824916
> Signature: null
>
> Links that are not indexed are in the linkdb:
>
> ./bin/nutch readlinkdb ../nutch-index/linkdb -dump joe
>
> yields:
>
> http://philadelphiariders.com/gallery/2005-Events   Inlinks:
>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
>
> http://philadelphiariders.com/gallery/2006-Events   Inlinks:
>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2006 Events
>
> http://philadelphiariders.com/gallery/2nd-Sunday-Rides  Inlinks:
>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
>  fromUrl: http://www.philadelphiariders.com/c/ anchor: 2nd Sunday Rides
>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2nd Sunday
> Rides
>
> http://philadelphiariders.com/gallery/April-2006    Inlinks:
>  fromUrl: http://www.philadelphiariders.com/c/ anchor: April 2006
>
> http://philadelphiariders.com/gallery/Marilyns-Photos   Inlinks:
>  fromUrl: http://www.philadelphiariders.com/c/ anchor: Marilyn's Photos
>
> http://philadelphiariders.com/gallery/Rider-Gallery Inlinks:
>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor: Rider Gallery
>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
>
> Also, a lot fo the navigation in the Gallery application makes use of
> GET parameters.  To follow links contianing these, would I need to tweak
> crawl-urlfilter.txt to remove the following line:
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> I don't think this is the whole problem, because the root url
> for the gallery has been fetched/ indexed.  This page contains
> links that are not queryies (i.e. contain ?).
>
> Thanks in advance for any help you can offer.
>
> Andy
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: Nutch crawl not fetching portions of site

Andrew Libby

The odd part is that they are in the linkdb, which the would not be if
they were
in the filter, am I right?  Output in the initial message I sent shows a
few of these:

./bin/nutch readlinkdb ../nutch-index/linkdb -dump joe

yields:

http://philadelphiariders.com/gallery/2005-Events   Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:

http://philadelphiariders.com/gallery/2006-Events   Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2006 Events

http://philadelphiariders.com/gallery/2nd-Sunday-Rides  Inlinks:
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
 fromUrl: http://www.philadelphiariders.com/c/ anchor: 2nd Sunday Rides
 fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2nd Sunday
Rides

http://philadelphiariders.com/gallery/April-2006    Inlinks:
 fromUrl: http://www.philadelphiariders.com/c/ anchor: April 2006

http://philadelphiariders.com/gallery/Marilyns-Photos   Inlinks:
 fromUrl: http://www.philadelphiariders.com/c/ anchor: Marilyn's Photos


Is my understanding/ assumption accurate?    As you can see the links
above do not
contain query characters '?', but the Gallery application does use these
in page navigation.

Thanks.

Andy



Dennis Kubes wrote:

> It is possible that the URL filter is preventing the links from being
> crawled, especially if they have characters such as ? or ; in them
> (i.e. like a php session id).  Can you post an example of a link?
>
> Dennis
>
> Andrew Libby wrote:
>
>> Hello,
>>
>> I have what I assume to be a simple user issue with nutch-0.8-dev.  I'm
>> using Nutch
>> to do a single site crawl on a Fedora Core 4 Linux machine.  The site
>> I'm crawling consists
>> of Perl (Catalyst to be specific), and PHP (an app called gallery, and
>> an instance of Media Wiki).
>>
>> The issue I'm having is that Nutch does not seem to crawl the gallery
>> section of the site.
>> There are links from the main site to gallery, and I've listed the top
>> level gallery URL
>> my initial url list I pass to nutch crawl.
>>
>> Sorry for the length of the message, but I wanted to try to provide as
>> much information about
>> the problem as I could.
>>
>> Nutch does crawl the wiki and perl sections of the site.
>>
>> Crawl Command:
>>
>> nutch crawl urls -dir ../nutch-index -depth 25 -topN10000
>>
>> The urls dir contains one file called urls.txt:
>>
>> http://www.philadelphiariders.com/
>> http://www.philadelphiariders.com/c/dmoz/Top.html
>> http://www.philadelphiariders.com/gallery/
>>
>> The only change I've nade to crawl-urlfilter.txt is:
>>
>> +^http://www.philadelphiariders.com/
>>
>> which I replaced the example regex rule that was there out of the box.
>>
>> In the index output, I see a reference to the gallery:
>>
>>  Indexing [http://www.philadelphiariders.com/gallery/] with analyzer
>> org.apache.nutch.analysis.NutchDocumentAnalyzer@6833f2 (null)
>>
>> But the rest of the gallery is not referenced in index output.
>> The command  ./bin/nutch readdb ../nutch-index/crawldb -dump ./dumpdata
>> Has only these two entries referencing the gallery.    Does the
>> Status of
>> view_album.php have anything to do with my issue?
>> http://www.philadelphiariders.com/gallery/  Version: 4
>> Status: 2 (DB_fetched)
>> Fetch time: Tue May 16 15:20:15 EDT 2006
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 316.0114
>> Signature: b7619f18442c6f356f802ba7847dc127
>>
>> http://www.philadelphiariders.com/gallery/view_album.php    Version: 4
>> Status: 3 (DB_gone)
>> Fetch time: Sun Apr 16 15:21:12 EDT 2006
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 30.0 days
>> Score: 2.0824916
>> Signature: null
>>
>> Links that are not indexed are in the linkdb:
>>
>> ./bin/nutch readlinkdb ../nutch-index/linkdb -dump joe
>>
>> yields:
>>
>> http://philadelphiariders.com/gallery/2005-Events   Inlinks:
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
>>
>> http://philadelphiariders.com/gallery/2006-Events   Inlinks:
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2006 Events
>>
>> http://philadelphiariders.com/gallery/2nd-Sunday-Rides  Inlinks:
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
>>  fromUrl: http://www.philadelphiariders.com/c/ anchor: 2nd Sunday Rides
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor: 2nd Sunday
>> Rides
>>
>> http://philadelphiariders.com/gallery/April-2006    Inlinks:
>>  fromUrl: http://www.philadelphiariders.com/c/ anchor: April 2006
>>
>> http://philadelphiariders.com/gallery/Marilyns-Photos   Inlinks:
>>  fromUrl: http://www.philadelphiariders.com/c/ anchor: Marilyn's Photos
>>
>> http://philadelphiariders.com/gallery/Rider-Gallery Inlinks:
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor: Rider
>> Gallery
>>  fromUrl: http://www.philadelphiariders.com/gallery/ anchor:
>>
>> Also, a lot fo the navigation in the Gallery application makes use of
>> GET parameters.  To follow links contianing these, would I need to tweak
>> crawl-urlfilter.txt to remove the following line:
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> -[?*!@=]
>>
>> I don't think this is the whole problem, because the root url
>> for the gallery has been fetched/ indexed.  This page contains
>> links that are not queryies (i.e. contain ?).
>> Thanks in advance for any help you can offer.
>>
>> Andy
>>
>>  
>
>


--
Andrew Libby                                  
[hidden email]
http://philadelphiariders.com/