I am developing and app that needs to have dockerized Nutch 1.X instances
and gets crawl requests from Celery and indexes it to Solr 6.6.0. The app
indexes images (using protocol-selenium plugin to fetch dynamic content).
However, I noticed that whereas small crawl tasks are properly indexed I
had no success with a slightly bigger query - when I asked my dockerized
app to crawl a website that (after 3 iterations of the crawl script) needs
to fetch ~5000 links the Nutch in the Docker container just stops to work -
the last thing I see in hadoop.log are from the fetcher; there are no
exceptions, however, save for an exception that does not occur when I run
(successfully) the very same crawling task on the host machine.
the exception (pastebin to full exception):
org.apache.commons.httpclient.NoHttpResponseException: The server
some.site.web failed to respond
I doubt that failing to fetch a couple of links would put Nutch in this
crashed-but-not-really state. I say "not really", because Celery still sees
the task as active - but when I look at htop or *docker stats *it's quite
obvious that nutch ceased to do anything productive. Let me restate that
this doesn't occur when I run the task outside of Docker.
Has anyone here stumbled upon anything similair, or has any experience with
running bigger crawls on dockerized Nutch?