Intranet crawl and re-fetch - newbie question

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Intranet crawl and re-fetch - newbie question

isabelle.moulinier
Hello,

I have a newbie question:

I have launched and completed an intranet crawling (bin/nutch crawl mySite myDB).
Since I would like to recrawl in a few days, I changed the nutch default parameter to 3 days (instead of 30).
How do I perform the recrawl? Do I just launch a new intranet crawling using the same parameters?
If I do, will the fetching only download new or modified pages, or will it download everything again?

Thanks for any help

Isabelle

[hidden email]
Ph: 651 687 3424



Reply | Threaded
Open this post in threaded view
|

Re: Intranet crawl and re-fetch - newbie question

Jack.Tang
Hi

I focused on Nutch month ago, then was interruptted, and here I am now.
One question should be confirmed. Nutch hosted in svn supports recrawling now?
If yes, could you pls tell me the config params? Thanks

/Jack

On 6/2/05, [hidden email]
<[hidden email]> wrote:

> Hello,
>
> I have a newbie question:
>
> I have launched and completed an intranet crawling (bin/nutch crawl mySite myDB).
> Since I would like to recrawl in a few days, I changed the nutch default parameter to 3 days (instead of 30).
> How do I perform the recrawl? Do I just launch a new intranet crawling using the same parameters?
> If I do, will the fetching only download new or modified pages, or will it download everything again?
>
> Thanks for any help
>
> Isabelle
>
> [hidden email]
> Ph: 651 687 3424
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Intranet crawl and re-fetch - newbie question

Piotr Kosiorowski
In reply to this post by isabelle.moulinier
As far as I know crawl - (named Intranet crawling in tutorial) - assumes
you refetch everything from scratch every time you run it. Whole Web
crawling allows you to control what you want to crawl and recrawl with
more details but some parameters might not work as I would expect (eg.
-refetchonly). Support for checking if page was modified from last fetch
time is  currently missing (although as I understand there is some work
going on in this direction: http://issues.apache.org/jira/browse/NUTCH-61 )
Regards
Piotr



[hidden email] wrote:

> Hello,
>
> I have a newbie question:
>
> I have launched and completed an intranet crawling (bin/nutch crawl mySite myDB).
> Since I would like to recrawl in a few days, I changed the nutch default parameter to 3 days (instead of 30).
> How do I perform the recrawl? Do I just launch a new intranet crawling using the same parameters?
> If I do, will the fetching only download new or modified pages, or will it download everything again?
>
> Thanks for any help
>
> Isabelle
>
> [hidden email]
> Ph: 651 687 3424
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Intranet crawl and re-fetch - newbie question

Daniel D.-2
Hi,

I have run some tests to verify (as nobody confirmed this yet) how –refetchonly is behaving and would like to share with you the results. I also will add some questions in the end.

I'm using Nutch v6.
For test purposes I have modified code to create log file with some URL information. I have also changed code in test 2 to modify the fetchinterval (see below).

Test 1:
I have created DB and have injected 3 URLS. Re-fetch interval was set to 1 ( db.default.fetch.interval).
1. I have run fetch. I'm attaching the log_10_7_days.txt to see the results of the fetch. Please pay attention to the nextFetch date. Even so that fetchinterval is 1 nextFetch date was in 7 days. I think this nextFetch is being read from the fetchlist. (Question #1)
2. I have updated DB.
3. I have created the segments with –refetchonly option. Results of the nutch fetchlist –dumpurls … attached as test1_dumpurls.txt
You can see that only new URLS were included. But URLS having the following form: http://www.webct.com/software/viewpage?name=software_campus_edition or http://v.extreme-dm.com/?login=cguilfor were not included (Question #2)
4. I have run fetch on new segment (create in # 3) Results are in the log_10_7_refetch.txt. You will see that all URLS from the test1_dumpurls.txt were fetch but no outlinks were recorded. (Question #3)


Test 2: After realizing that nextFetch is in 7 days I have modified code to ignore value being loaded from the fetchlist and kept it equal to the current time (assigned in time of initialization)

I have created DB and have injected 3 URLS. Re-fetch interval was set to 1 ( db.default.fetch.interval).
1. I have run fetch. I'm attaching the log_10_0_days.txt to see the results of the fetch. Please pay attention to the nextFetch date.
2. I have updated DB.
3. I have created the segments with –refetchonly option. Results of the nutch fetchlist –dumpurls … attached as test2_dumpurls.txt. Note that even so that current time has passed the nextFetch date I have found exact the same list of URLS as in test1!!!!
     You can see that only new URLS were included. But URLS having the following form: http://www.webct.com/software/viewpage?name=software_campus_edition or http://v.extreme-dm.com/?login=cguilfor were not included (Question #2)
4. I have run fetch on new segment (create in # 3) Results are in the log_10_0_refetch.txt. You will see that all URLS from the test2_dumpurls.txt were fetch but no outlinks were recorded. (Question #3)

Questions:
1. Why when db.default.fetch.interval is 1 Page object nextFetch variable is in 7 days?
2. Why created the segments with –refetchonly excluded the URLS with the following form (I think having question mark):   http://www.webct.com/software/viewpage?name=software_campus_edition or http://v.extreme-dm.com/?login=cguilfor
3. Why fetch of the fetchlist created with –refetchonly is not storing outlinks in the results?

Hope my results will help to understand how it works.

Guys, please find time and ask those questions as this greatly help in my work.

Thanks,
Daniel.


On 6/6/05, Piotr Kosiorowski <[hidden email]> wrote:

> As far as I know crawl - (named Intranet crawling in tutorial) - assumes
> you refetch everything from scratch every time you run it. Whole Web
> crawling allows you to control what you want to crawl and recrawl with
> more details but some parameters might not work as I would expect (eg.
> -refetchonly). Support for checking if page was modified from last fetch
> time is  currently missing (although as I understand there is some work
> going on in this direction: http://issues.apache.org/jira/browse/NUTCH-61 )
> Regards
> Piotr
>

Log_10_1_0_days.txt (11K) Download Attachment
Log_10_1_0_refetch.txt (19K) Download Attachment
Log_10_1_7_days.txt (11K) Download Attachment
Log_10_1_7_refetch.txt (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Intranet crawl and re-fetch - newbie question

Piotr Kosiorowski
Hello Daniel,
I raised -refetchonly question on nutch-dev list two days ago (subject:
-refetchonly investigation). I have described my tests and code findings
there. If you are interested you can check it there but for me the most
important is Doug answer so I will cite it here:
<cite>
The original rationale for the "-refetchonly" option was to permit
indexing of all of the urls known the the database, with anchor text,
but without fetching them.  Thus one can, e.g., provide an index of 10M
urls while only actually fetching 1M urls.  I have never actually used
this feature myseufl.  I don't know whether other folks have ever used
it sucessfully, nor whether such a feature is in fact desired.
</cite>

I do not personally find such feature useful but maybe it is for
somebody. I would like to add a feature that allows one to generate
fetchlist that would contain only urls that were already fetched (and
for symmetry the opposite - urls that were never fetched) - but at the
moment I am a bit busy with my personal life and work - but I have it on
my TODO list (I will get back to your questions than too).
Regards
Piotr


Daniel D. wrote:

> Hi,
>
> I have run some tests to verify (as nobody confirmed this yet) how
> –refetchonly is behaving and would like to share with you the results. I
> also will add some questions in the end.
>
> I'm using Nutch v6.
> For test purposes I have modified code to create log file with some URL
> information. I have also changed code in test 2 to modify the
> fetchinterval (see below).
>
> Test 1:
> I have created DB and have injected 3 URLS. Re-fetch interval was set to
> 1 ( db.default.fetch.interval).
> 1. I have run fetch. I'm attaching the log_10_7_days.txt to see the
> results of the fetch. Please pay attention to the nextFetch date. Even
> so that fetchinterval is 1 nextFetch date was in 7 days. I think this
> nextFetch is being read from the fetchlist. (Question #1)
> 2. I have updated DB.
> 3. I have created the segments with –refetchonly option. Results of the
> nutch fetchlist –dumpurls … attached as test1_dumpurls.txt
> You can see that only new URLS were included. But URLS having the
> following form:
> http://www.webct.com/software/viewpage?name=software_campus_edition or
> http://v.extreme-dm.com/?login=cguilfor 
> <http://v.extreme-dm.com/?login=cguilfor> were not included (Question #2)
> 4. I have run fetch on new segment (create in # 3) Results are in the
> log_10_7_refetch.txt. You will see that all URLS from the
> test1_dumpurls.txt were fetch but no outlinks were recorded. (Question #3)
>
>
> Test 2: After realizing that nextFetch is in 7 days I have modified code
> to ignore value being loaded from the fetchlist and kept it equal to the
> current time (assigned in time of initialization)
>
> I have created DB and have injected 3 URLS. Re-fetch interval was set to
> 1 ( db.default.fetch.interval).
> 1. I have run fetch. I'm attaching the log_10_0_days.txt to see the
> results of the fetch. Please pay attention to the nextFetch date.
> 2. I have updated DB.
> 3. I have created the segments with –refetchonly option. Results of the
> nutch fetchlist –dumpurls … attached as test2_dumpurls.txt. Note that
> even so that current time has passed the nextFetch date I have found
> exact the same list of URLS as in test1!!!!
>      You can see that only new URLS were included. But URLS having the
> following form:
> http://www.webct.com/software/viewpage?name=software_campus_edition 
> <http://www.webct.com/software/viewpage?name=software_campus_edition> or
> http://v.extreme-dm.com/?login=cguilfor were not included (Question #2)
> 4. I have run fetch on new segment (create in # 3) Results are in the
> log_10_0_refetch.txt. You will see that all URLS from the
> test2_dumpurls.txt were fetch but no outlinks were recorded. (Question #3)
>
> Questions:
> 1. Why when db.default.fetch.interval is 1 Page object nextFetch
> variable is in 7 days?
> 2. Why created the segments with –refetchonly excluded the URLS with the
> following form (I think having question mark):  
> http://www.webct.com/software/viewpage?name=software_campus_edition or
> http://v.extreme-dm.com/?login=cguilfor 
> <http://v.extreme-dm.com/?login=cguilfor>
> 3. Why fetch of the fetchlist created with –refetchonly is not storing
> outlinks in the results?
>
> Hope my results will help to understand how it works.
>
> Guys, please find time and ask those questions as this greatly help in
> my work.
>
> Thanks,
> Daniel.
>
>
> On 6/6/05, Piotr Kosiorowski <[hidden email]
> <mailto:[hidden email]>> wrote:
>  > As far as I know crawl - (named Intranet crawling in tutorial) - assumes
>  > you refetch everything from scratch every time you run it. Whole Web
>  > crawling allows you to control what you want to crawl and recrawl with
>  > more details but some parameters might not work as I would expect (eg.
>  > -refetchonly). Support for checking if page was modified from last fetch
>  > time is  currently missing (although as I understand there is some work
>  > going on in this direction:
> http://issues.apache.org/jira/browse/NUTCH-61 )
>  > Regards
>  > Piotr
>  >
>
>
> ------------------------------------------------------------------------
>
>
> ==================================================
> URL: http://www.hypermail.org/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 22:49:28 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 18
> Outlink: toUrl: http://www.hypermail.org/docs.html anchor: Documentation
> Outlink: toUrl: http://dev.hypermail.org/openfaq/ anchor: OpenFAQ
> Outlink: toUrl: http://www.hypermail.org/lists.html anchor: Mailing Lists
> Outlink: toUrl: http://www.hypermail.org/mail-archive/archives.html anchor: Mailing List Archives
> Outlink: toUrl: http://www.hypermail.org/dist anchor: Download Hypermail Software
> Outlink: toUrl: http://www.hypermail.org/cvs.html anchor: CVS Server Access
> Outlink: toUrl: http://cvsweb.hypermail.org/ anchor: Browsing the CVS Baseline
> Outlink: toUrl: http://www.hypermail.org/submit-patches.html anchor: Submitting Patches
> Outlink: toUrl: mailto:[hidden email] anchor: Suggestions
> Outlink: toUrl: http://www.hypermail.org/using.html anchor: Lists Using Hypermail
> Outlink: toUrl: http://www.hypermail.org/net-resources.html anchor: Net.Resources
> Outlink: toUrl: http://www.hypermail.org/others.html anchor: The Others
> Outlink: toUrl: http://www.hypermail.org/credits.html anchor: Credits
> Outlink: toUrl: http://www.hypermail.org/copyright.html anchor: Copyright
> Outlink: toUrl: http://home.netscape.com/comprod/mirror/index.html anchor: Download
> Outlink: toUrl: http://www.hypermail.org/navbar.html anchor:
> Outlink: toUrl: http://www.hypermail.org/firstpage.html anchor:
> Outlink: toUrl: http://www.hypermail.org/search.html anchor:
>
> ==================================================
> URL: http://www.powa.org/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 22:49:29 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 16
> Outlink: toUrl: http://my.powa.org/modules.php?name=Your_Account anchor: Login
> Outlink: toUrl: http://www.webenglishteacher.com/ anchor:
> Outlink: toUrl: http://www.eltweb.com/ anchor:
> Outlink: toUrl: http://www.rockhillpress.com/ anchor:
> Outlink: toUrl: http://members.tripod.com/~DoctorAhClem/ahclem.html anchor:
> Outlink: toUrl: http://webcrawler.com/select/ anchor:
> Outlink: toUrl: http://www.excite.com/apple/guide/Arts_and_Humanities/Books_and_Literature/Writing/reviews.html anchor:
> Outlink: toUrl: http://www.studyweb.com/ anchor:
> Outlink: toUrl: http://www.schoolzone.co.uk/ anchor:
> Outlink: toUrl: http://www.awesomelibrary.org/ratings.html anchor:
> Outlink: toUrl: http://www.kn.pacbell.com/wired/bluewebn/ anchor:
> Outlink: toUrl: http://www.homeworkspot.com/high/english/essaywriting.htm anchor:
> Outlink: toUrl: http://www.links2go.com/topic/Writing anchor:
> Outlink: toUrl: http://www.cs.wisc.edu/scout/report anchor:
> Outlink: toUrl: http://v.extreme-dm.com/?login=cguilfor anchor:
> Outlink: toUrl: mailto:[hidden email] anchor: Chuck Guilford
>
> ==================================================
> URL: http://www.webct.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 22:49:30 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 62
> Outlink: toUrl: http://www.webct.com/entrypage anchor:
> Outlink: toUrl: http://www.webct.com/entrypage anchor: Home
> Outlink: toUrl: http://www.webct.com/software anchor: Software
> Outlink: toUrl: http://www.webct.com/services anchor: Services
> Outlink: toUrl: http://www.webct.com/techsupport anchor: Support
> Outlink: toUrl: http://www.webct.com/success anchor: Customer Success
> Outlink: toUrl: http://www.webct.com/content anchor: Digital Content
> Outlink: toUrl: http://www.webct.com/powerlinks anchor: WebCT PowerLinks
> Outlink: toUrl: http://www.webct.com/vision anchor: Vision
> Outlink: toUrl: http://www.webct.com/company anchor: About Us
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_campus_edition anchor: WebCT Campus Edition
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_vista anchor: WebCT Vista
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_languages anchor: Languages
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_technical_solutions anchor: Technical Solutions
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_professional_development anchor: Professional Development
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_hosting anchor: Hosting Services
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_support_options anchor: Support Services
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_expanding_campus_edition anchor: WebCT Campus Edition
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_getting_started_vista anchor: WebCT Vista
> Outlink: toUrl: http://www.webct.com/support anchor: WebCT Support
> Outlink: toUrl: http://www.webct.com/ask_drc anchor: Ask Dr. C
> Outlink: toUrl: http://www.webct.com/support/viewpage?name=company_documentation_index anchor: Documentation
> Outlink: toUrl: http://download.webct.com/ anchor: Software Downloads
> Outlink: toUrl: http://www.webct.com/techsupport/viewpage?name=techsupport_license_faq anchor: License Keys
> Outlink: toUrl: http://www.webct.com/success/viewpage?name=success_case_studies anchor: Case Studies
> Outlink: toUrl: http://www.webct.com/exemplary anchor: Exemplary Courses
> Outlink: toUrl: http://www.webct.com/institutes anchor: WebCT Institutes
> Outlink: toUrl: http://www.webct.com/worldwide anchor: WebCT Worldwide
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_instructors anchor: Instructors
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_admin anchor: Administrators
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_access anchor: Student Access Codes
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_customer_care anchor: Help
> Outlink: toUrl: http://www.webct.com/powerlinks/viewpage?name=powerlinks_network anchor: PowerLinks Network
> Outlink: toUrl: http://www.webct.com/powerlinks/viewpage?name=powerlinks_showcase anchor: PowerLinks Showcase
> Outlink: toUrl: http://www.webct.com/developers anchor: Vista Developers Network
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_webct_customers anchor: Customers
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_management_team anchor: Leadership
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_manage_investors anchor: Investors
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_partners anchor: Partners
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_press_kit anchor: Press
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_events anchor: Events
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_jobs anchor: Jobs
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_contact_us anchor: Contact Us
> Outlink: toUrl: http://www.webct.com/ce6 anchor:
> Outlink: toUrl: http://www.webct.com/service/ViewContent?contentID=26162711 anchor: Innovative e-learning project
> Outlink: toUrl: http://www.webct.com/service/ViewContent?contentID=26052806 anchor: WebCT to give sneak preview
> Outlink: toUrl: http://www.webct.com/2005 anchor: WebCT Impact 2005
> Outlink: toUrl: http://www.webct.com/company/service/selectnewsletters anchor: Subscribe to WebCT Newsletter
> Outlink: toUrl: http://www.webct.com/vision anchor: Learn how WebCT can help your institution achieve learning without limits
> Outlink: toUrl: http://www.webct.com/events anchor: Events
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_languages anchor: Languages
> Outlink: toUrl: http://www.webct.com/students anchor: Students
> Outlink: toUrl: http://www.webct.com/faculty anchor: Faculty
> Outlink: toUrl: http://www.webct.com/workshops anchor: Online Workshops
> Outlink: toUrl: http://www.webct.com/seminars anchor: Online Seminars
> Outlink: toUrl: http://www.webct.com/ask_drc anchor: Ask Dr. C
> Outlink: toUrl: http://www.webct.com/ce6 anchor: CE 6 Upgrade
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_contact_us anchor: Contact Us
> Outlink: toUrl: http://www.webct.com/communities/servicepolicy anchor: Terms of Service
> Outlink: toUrl: http://www.webct.com/communities/privacypolicy anchor: Privacy Policy
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_how_to_apply anchor: Employment
> Outlink: toUrl: http://www.webct.com/communities/viewpage?name=communities_site_map anchor: Site Map
>
>
> ------------------------------------------------------------------------
>
>
> ==================================================
> URL: http://www.eltweb.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/students
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Students
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/navbar.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.cs.wisc.edu/scout/report
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/worldwide
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: WebCT Worldwide
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/techsupport
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Support
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/support
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: WebCT Support
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.kn.pacbell.com/wired/bluewebn/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.rockhillpress.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/mail-archive/archives.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Mailing List Archives
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/powerlinks
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: WebCT PowerLinks
> Outlinks Count: 0
>
> ==================================================
> URL: http://dev.hypermail.org/openfaq/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: OpenFAQ
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/dist
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Download Hypermail Software
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/credits.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Credits
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/copyright.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Copyright
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/search.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/company/service/selectnewsletters
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Subscribe to WebCT Newsletter
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/company
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: About Us
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/developers
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Vista Developers Network
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.excite.com/apple/guide/Arts_and_Humanities/Books_and_Literature/Writing/reviews.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/services
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Services
> Outlinks Count: 0
>
> ==================================================
> URL: http://members.tripod.com/~DoctorAhClem/ahclem.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/firstpage.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.homeworkspot.com/high/english/essaywriting.htm
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/using.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Lists Using Hypermail
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/communities/privacypolicy
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Privacy Policy
> Outlinks Count: 0
>
> ==================================================
> URL: http://home.netscape.com/comprod/mirror/index.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Download
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.links2go.com/topic/Writing
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/seminars
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Online Seminars
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/institutes
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: WebCT Institutes
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/events
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Events
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/ce6
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/lists.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Mailing Lists
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/success
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:57 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Customer Success
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.studyweb.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/2005
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: WebCT Impact 2005
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/entrypage
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://download.webct.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Software Downloads
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/cvs.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: CVS Server Access
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/net-resources.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Net.Resources
> Outlinks Count: 0
>
> ==================================================
> URL: http://cvsweb.hypermail.org/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Browsing the CVS Baseline
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/exemplary
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Exemplary Courses
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/docs.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Documentation
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.awesomelibrary.org/ratings.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/ask_drc
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Ask Dr. C
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.schoolzone.co.uk/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/vision
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Vision
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/content
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Digital Content
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/software
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Software
> Outlinks Count: 0
>
> ==================================================
> URL: http://webcrawler.com/select/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/submit-patches.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Submitting Patches
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/communities/servicepolicy
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Terms of Service
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/others.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: The Others
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/workshops
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Online Workshops
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webenglishteacher.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/faculty
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 07 23:07:58 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Faculty
> Outlinks Count: 0
>
>
> ------------------------------------------------------------------------
>
>
> ==================================================
> URL: http://www.hypermail.org/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 20:59:25 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 18
> Outlink: toUrl: http://www.hypermail.org/docs.html anchor: Documentation
> Outlink: toUrl: http://dev.hypermail.org/openfaq/ anchor: OpenFAQ
> Outlink: toUrl: http://www.hypermail.org/lists.html anchor: Mailing Lists
> Outlink: toUrl: http://www.hypermail.org/mail-archive/archives.html anchor: Mailing List Archives
> Outlink: toUrl: http://www.hypermail.org/dist anchor: Download Hypermail Software
> Outlink: toUrl: http://www.hypermail.org/cvs.html anchor: CVS Server Access
> Outlink: toUrl: http://cvsweb.hypermail.org/ anchor: Browsing the CVS Baseline
> Outlink: toUrl: http://www.hypermail.org/submit-patches.html anchor: Submitting Patches
> Outlink: toUrl: mailto:[hidden email] anchor: Suggestions
> Outlink: toUrl: http://www.hypermail.org/using.html anchor: Lists Using Hypermail
> Outlink: toUrl: http://www.hypermail.org/net-resources.html anchor: Net.Resources
> Outlink: toUrl: http://www.hypermail.org/others.html anchor: The Others
> Outlink: toUrl: http://www.hypermail.org/credits.html anchor: Credits
> Outlink: toUrl: http://www.hypermail.org/copyright.html anchor: Copyright
> Outlink: toUrl: http://home.netscape.com/comprod/mirror/index.html anchor: Download
> Outlink: toUrl: http://www.hypermail.org/navbar.html anchor:
> Outlink: toUrl: http://www.hypermail.org/firstpage.html anchor:
> Outlink: toUrl: http://www.hypermail.org/search.html anchor:
>
> ==================================================
> URL: http://www.powa.org/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 20:59:25 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 16
> Outlink: toUrl: http://my.powa.org/modules.php?name=Your_Account anchor: Login
> Outlink: toUrl: http://www.webenglishteacher.com/ anchor:
> Outlink: toUrl: http://www.eltweb.com/ anchor:
> Outlink: toUrl: http://www.rockhillpress.com/ anchor:
> Outlink: toUrl: http://members.tripod.com/~DoctorAhClem/ahclem.html anchor:
> Outlink: toUrl: http://webcrawler.com/select/ anchor:
> Outlink: toUrl: http://www.excite.com/apple/guide/Arts_and_Humanities/Books_and_Literature/Writing/reviews.html anchor:
> Outlink: toUrl: http://www.studyweb.com/ anchor:
> Outlink: toUrl: http://www.schoolzone.co.uk/ anchor:
> Outlink: toUrl: http://www.awesomelibrary.org/ratings.html anchor:
> Outlink: toUrl: http://www.kn.pacbell.com/wired/bluewebn/ anchor:
> Outlink: toUrl: http://www.homeworkspot.com/high/english/essaywriting.htm anchor:
> Outlink: toUrl: http://www.links2go.com/topic/Writing anchor:
> Outlink: toUrl: http://www.cs.wisc.edu/scout/report anchor:
> Outlink: toUrl: http://v.extreme-dm.com/?login=cguilfor anchor:
> Outlink: toUrl: mailto:[hidden email] anchor: Chuck Guilford
>
> ==================================================
> URL: http://www.webct.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 20:59:25 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 62
> Outlink: toUrl: http://www.webct.com/entrypage anchor:
> Outlink: toUrl: http://www.webct.com/entrypage anchor: Home
> Outlink: toUrl: http://www.webct.com/software anchor: Software
> Outlink: toUrl: http://www.webct.com/services anchor: Services
> Outlink: toUrl: http://www.webct.com/techsupport anchor: Support
> Outlink: toUrl: http://www.webct.com/success anchor: Customer Success
> Outlink: toUrl: http://www.webct.com/content anchor: Digital Content
> Outlink: toUrl: http://www.webct.com/powerlinks anchor: WebCT PowerLinks
> Outlink: toUrl: http://www.webct.com/vision anchor: Vision
> Outlink: toUrl: http://www.webct.com/company anchor: About Us
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_campus_edition anchor: WebCT Campus Edition
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_vista anchor: WebCT Vista
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_languages anchor: Languages
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_technical_solutions anchor: Technical Solutions
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_professional_development anchor: Professional Development
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_hosting anchor: Hosting Services
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_support_options anchor: Support Services
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_expanding_campus_edition anchor: WebCT Campus Edition
> Outlink: toUrl: http://www.webct.com/services/viewpage?name=services_getting_started_vista anchor: WebCT Vista
> Outlink: toUrl: http://www.webct.com/support anchor: WebCT Support
> Outlink: toUrl: http://www.webct.com/ask_drc anchor: Ask Dr. C
> Outlink: toUrl: http://www.webct.com/support/viewpage?name=company_documentation_index anchor: Documentation
> Outlink: toUrl: http://download.webct.com/ anchor: Software Downloads
> Outlink: toUrl: http://www.webct.com/techsupport/viewpage?name=techsupport_license_faq anchor: License Keys
> Outlink: toUrl: http://www.webct.com/success/viewpage?name=success_case_studies anchor: Case Studies
> Outlink: toUrl: http://www.webct.com/exemplary anchor: Exemplary Courses
> Outlink: toUrl: http://www.webct.com/institutes anchor: WebCT Institutes
> Outlink: toUrl: http://www.webct.com/worldwide anchor: WebCT Worldwide
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_instructors anchor: Instructors
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_admin anchor: Administrators
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_access anchor: Student Access Codes
> Outlink: toUrl: http://www.webct.com/content/viewpage?name=content_customer_care anchor: Help
> Outlink: toUrl: http://www.webct.com/powerlinks/viewpage?name=powerlinks_network anchor: PowerLinks Network
> Outlink: toUrl: http://www.webct.com/powerlinks/viewpage?name=powerlinks_showcase anchor: PowerLinks Showcase
> Outlink: toUrl: http://www.webct.com/developers anchor: Vista Developers Network
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_webct_customers anchor: Customers
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_management_team anchor: Leadership
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_manage_investors anchor: Investors
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_partners anchor: Partners
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_press_kit anchor: Press
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_events anchor: Events
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_jobs anchor: Jobs
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_contact_us anchor: Contact Us
> Outlink: toUrl: http://www.webct.com/ce6 anchor:
> Outlink: toUrl: http://www.webct.com/service/ViewContent?contentID=26162711 anchor: Innovative e-learning project
> Outlink: toUrl: http://www.webct.com/service/ViewContent?contentID=26052806 anchor: WebCT to give sneak preview
> Outlink: toUrl: http://www.webct.com/2005 anchor: WebCT Impact 2005
> Outlink: toUrl: http://www.webct.com/company/service/selectnewsletters anchor: Subscribe to WebCT Newsletter
> Outlink: toUrl: http://www.webct.com/vision anchor: Learn how WebCT can help your institution achieve learning without limits
> Outlink: toUrl: http://www.webct.com/events anchor: Events
> Outlink: toUrl: http://www.webct.com/software/viewpage?name=software_languages anchor: Languages
> Outlink: toUrl: http://www.webct.com/students anchor: Students
> Outlink: toUrl: http://www.webct.com/faculty anchor: Faculty
> Outlink: toUrl: http://www.webct.com/workshops anchor: Online Workshops
> Outlink: toUrl: http://www.webct.com/seminars anchor: Online Seminars
> Outlink: toUrl: http://www.webct.com/ask_drc anchor: Ask Dr. C
> Outlink: toUrl: http://www.webct.com/ce6 anchor: CE 6 Upgrade
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_contact_us anchor: Contact Us
> Outlink: toUrl: http://www.webct.com/communities/servicepolicy anchor: Terms of Service
> Outlink: toUrl: http://www.webct.com/communities/privacypolicy anchor: Privacy Policy
> Outlink: toUrl: http://www.webct.com/company/viewpage?name=company_how_to_apply anchor: Employment
> Outlink: toUrl: http://www.webct.com/communities/viewpage?name=communities_site_map anchor: Site Map
>
>
> ------------------------------------------------------------------------
>
>
> ==================================================
> URL: http://www.eltweb.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/students
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Students
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/navbar.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.cs.wisc.edu/scout/report
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/worldwide
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: WebCT Worldwide
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/techsupport
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Support
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/support
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: WebCT Support
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.kn.pacbell.com/wired/bluewebn/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.rockhillpress.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/mail-archive/archives.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Mailing List Archives
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/powerlinks
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: WebCT PowerLinks
> Outlinks Count: 0
>
> ==================================================
> URL: http://dev.hypermail.org/openfaq/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: OpenFAQ
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/dist
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Download Hypermail Software
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/credits.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Credits
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/copyright.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Copyright
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/search.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/company/service/selectnewsletters
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Subscribe to WebCT Newsletter
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/company
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: About Us
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/developers
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Vista Developers Network
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.excite.com/apple/guide/Arts_and_Humanities/Books_and_Literature/Writing/reviews.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/services
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Services
> Outlinks Count: 0
>
> ==================================================
> URL: http://members.tripod.com/~DoctorAhClem/ahclem.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/firstpage.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.homeworkspot.com/high/english/essaywriting.htm
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/using.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Lists Using Hypermail
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/communities/privacypolicy
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Privacy Policy
> Outlinks Count: 0
>
> ==================================================
> URL: http://home.netscape.com/comprod/mirror/index.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Download
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.links2go.com/topic/Writing
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/seminars
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Online Seminars
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/institutes
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: WebCT Institutes
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/events
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Events
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/ce6
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.hypermail.org/lists.html
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Mailing Lists
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/success
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: Customer Success
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.studyweb.com/
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 0
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/2005
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14 21:39:08 EDT 2005
> Score: 1.0
> NextScore: 1.0
>
>
> Number of anchors: 1
> Anchors: WebCT Impact 2005
> Outlinks Count: 0
>
> ==================================================
> URL: http://www.webct.com/entrypage
> Number of OUT links: 0
> fetchInterval: 1
> nextFetch: Tue Jun 14
Reply | Threaded
Open this post in threaded view
|

Re: Intranet crawl and re-fetch - newbie question

Daniel D.-2
Hi Piotr,

 Thanks for the information.

You are right, those URLs (generated with -refetchonly) are not being
fetched. In my bullet # 4 I have said that they are fetched as I was mislead
by presents of data files (even so they were very small and I didn't check
the content).

 I'm trying to understand how to start with initial set of URLs and continue
fetching new URLS and re-fetching existing URLS (when they due to re-fetch).

I will post the questions below in nutch-dev list also.

 
   1. I have set db.default.fetch.interval to 1 (in nutch-default.xml)
   but I have noticed that fetchInterval field in Page object is being set to
   current time + 7 days while URL link data is being read from the fetchlist.
   Can somebody explain why or am I not reading the code correctly?
   2. I have modified code to ignore fetchInterval value coming from the
   fetchlist, meaning that fetchInterval stays equal to the initial value -
   current time. After I do the following commands: fetch, db update
and generate
   db segments, I'm getting new fetchlist but this list doesn't include my
   original sites. Even so their next fetch time should be in past already. Can
   somebody help me to understand when those URLS will be fetch?
   3. Looks like fetcher fail to extract links from http://www.eltweb.com.
   I know that there are some formats (looks like some HTML variations also)
   that are not supported. Where can I find information what is currently
   supported?
   4. Some of the out-links discovered during the fetch (for instance:
   http://www.webct.com/software/viewpage?name=software_campus_edition or
   http://v.extreme-dm.com/?login=cguilfor ) are being ignored (not
   included in the next fetchlist after executing [generate db segments]
   command). Is there known reason for this? Is there some documentation
   describing supported URL types.

Thanks,

Daniel


On 6/8/05, Piotr Kosiorowski <[hidden email]> wrote:

>
> Hello Daniel,
> I raised -refetchonly question on nutch-dev list two days ago (subject:
> -refetchonly investigation). I have described my tests and code findings
> there. If you are interested you can check it there but for me the most
> important is Doug answer so I will cite it here:
> <cite>
> The original rationale for the "-refetchonly" option was to permit
> indexing of all of the urls known the the database, with anchor text,
> but without fetching them. Thus one can, e.g., provide an index of 10M
> urls while only actually fetching 1M urls. I have never actually used
> this feature myseufl. I don't know whether other folks have ever used
> it sucessfully, nor whether such a feature is in fact desired.
> </cite>
>
> I do not personally find such feature useful but maybe it is for
> somebody. I would like to add a feature that allows one to generate
> fetchlist that would contain only urls that were already fetched (and
> for symmetry the opposite - urls that were never fetched) - but at the
> moment I am a bit busy with my personal life and work - but I have it on
> my TODO list (I will get back to your questions than too).
> Regards
> Piotr
>
>
>
>