Fetch not finishing everything in its list?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Fetch not finishing everything in its list?

Rod Taylor-2
As you scan see from the below the %age complete is very low until all
of a sudden it jumps to fully complete. This started happening with some
segments about a week ago. Others go through their full list of ~10 000
urls. It appears to occur whether I use a generate.max.per.host
directive or if I leave it out. Plugins are as defined by default.

There are no errors logged at either the jobtracker or tasktracker.
Happens whether I use a datanode/namenode configuration or local
filesystem.

A full log for this task is attached.

051110 214542 task_m_8pwl0q  Parsing [http://www.nebrodibandb.it/chiesemonum.html] with [org.apache.nutch.parse.html.HtmlParser@106caf16]
051110 214543 task_m_8pwl0q  Parsing [http://www.nyc-architecture.com/SOH/SOH017.htm] with [org.apache.nutch.parse.html.HtmlParser@106caf16]
051110 214543 task_m_8pwl0q  Parsing [http://www.town.ocean-city.md.us/Recreation/Forms/CampRegistrationForm.html] with [org.apache.nutch.parse.html.HtmlParser@106caf16]
051110 214543 task_m_8pwl0q 0.0022044207% 470 pages, 71 errors, 9.4 pages/s, 781 kb/s,
051110 214544 task_m_8pwl0q 0.0022044207% 470 pages, 71 errors, 9.2 pages/s, 766 kb/s,
051110 214545 task_m_8pwl0q 0.0022044207% 470 pages, 71 errors, 9.0 pages/s, 751 kb/s,
051110 214546 task_m_8pwl0q 0.0022044207% 470 pages, 71 errors, 8.9 pages/s, 737 kb/s,
051110 214547 task_m_8pwl0q org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
051110 214547 task_m_8pwl0q at org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133)
051110 214547 task_m_8pwl0q at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201)
051110 214547 task_m_8pwl0q at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:182)
051110 214547 task_m_8pwl0q at org.apache.nutch.crawl.Fetcher$FetcherThread.run(Fetcher.java:114)
051110 214547 task_m_8pwl0q  fetch of http://www.thisisjersey.com/section/familynotices.html failed with: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
051110 214547 task_m_8pwl0q org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
051110 214547 task_m_8pwl0q at org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133)
051110 214547 task_m_8pwl0q at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201)
051110 214547 task_m_8pwl0q at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:182)
051110 214547 task_m_8pwl0q at org.apache.nutch.crawl.Fetcher$FetcherThread.run(Fetcher.java:114)
051110 214547 task_m_8pwl0q  fetch of http://www.thisisjersey.com/section/sale.html failed with: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
051110 214547 task_m_8pwl0q  Parsing [http://www.thisisjersey.com/itprofessionals/] with [org.apache.nutch.parse.html.HtmlParser@106caf16]
051110 214548 task_m_8pwl0q 0.0022044207% 471 pages, 73 errors, 8.7 pages/s, 727 kb/s,
051110 214549 task_m_8pwl0q 0.0022044207% 471 pages, 73 errors, 8.6 pages/s, 713 kb/s,
051110 214550 task_m_8pwl0q  Parsing [http://www.geocities.com/redzombies/] with [org.apache.nutch.parse.html.HtmlParser@106caf16]
051110 214550 task_m_8pwl0q 0.0022044207% 471 pages, 73 errors, 8.6 pages/s, 713 kb/s,
051110 214551 task_m_8pwl0q 0.0022044207% 472 pages, 73 errors, 8.3 pages/s, 689 kb/s,
051110 214551 task_m_8pwl0q  Parsing [http://www.communitytransport.com/events/2005/pdfs/brochure05.pdf] with [org.apache.nutch.parse.text.TextParser@5f0ab09f]
051110 214552 task_m_8pwl0q 0.0022044207% 473 pages, 73 errors, 8.2 pages/s, 680 kb/s,
051110 214552 task_m_8pwl0q 0.0022044207% 473 pages, 73 errors, 8.2 pages/s, 680 kb/s,
051110 214552 Task task_m_8pwl0q is done.
--
Rod Taylor <[hidden email]>

task.log.gz (29K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Fetch not finishing everything in its list?

Rod Taylor-2
On Thu, 2005-11-10 at 21:58 -0500, Rod Taylor wrote:
> As you scan see from the below the %age complete is very low until all
> of a sudden it jumps to fully complete. This started happening with some
> segments about a week ago. Others go through their full list of ~10 000
> urls. It appears to occur whether I use a generate.max.per.host
> directive or if I leave it out. Plugins are as defined by default.

As an added note, switching from httpclient to http seems to be enough
to get all of the segments to download completely.

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Fetch not finishing everything in its list?

Michael-49
In reply to this post by Rod Taylor-2
I'm running the latest mapred svn and have the same problem, switching to
httpclient helped.

RT> As you scan see from the below the %age complete is very low until all
RT> of a sudden it jumps to fully complete. This started happening with some
RT> segments about a week ago. Others go through their full list of ~10 000
RT> urls. It appears to occur whether I use a generate.max.per.host
RT> directive or if I leave it out. Plugins are as defined by default.

RT> There are no errors logged at either the jobtracker or tasktracker.
RT> Happens whether I use a datanode/namenode configuration or local
RT> filesystem.



Michael