Nutch or Heritrix?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch or Heritrix?

Svein Yngvar Willassen
Hello folks,

We are in the starting phase of a project, and we are currently wondering
whether Heritrix or Nutch is the best choice of crawler for us.

Our project:

Basically, we're going to set up Hadoop and crawl the web for images.
We will then run our own indexing software on the images stored in HDFS
based on the Map/Reduce facility in Hadoop. We will not use other indexing
than our own.

Some particular questions:

- Which crawler will handle crawling for images best?
- Which crawler will best adapt to a distributed crawling system, in which we
  use many servers conducting crawling together?
- Which crawler is/will be under most active development?


Any views on this?


Best Regards,

Svein Willassen
Reply | Threaded
Open this post in threaded view
|

Re: Nutch or Heritrix?

Otis Gospodnetic-2
Hello Svein,

Quick answers to your questions:
- Nutch does not include an image crawler, though some people have started working on that a long time ago, and Archive.org is sponsoring this work/project.

- Nutch has a distributed fetcher.  Not sure about Heritrix.

- Nutch is being worked on, but not very aggressively at the moment.  I think Heritrix development may be similar.

I know of another company who is using a modified version of Nutch for image crawling.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Svein Yngvar Willassen <[hidden email]>
To: [hidden email]
Sent: Saturday, April 5, 2008 3:35:26 PM
Subject: Nutch or Heritrix?

Hello folks,

We are in the starting phase of a project, and we are currently wondering
whether Heritrix or Nutch is the best choice of crawler for us.

Our project:

Basically, we're going to set up Hadoop and crawl the web for images.
We will then run our own indexing software on the images stored in HDFS
based on the Map/Reduce facility in Hadoop. We will not use other indexing
than our own.

Some particular questions:

- Which crawler will handle crawling for images best?
- Which crawler will best adapt to a distributed crawling system, in which we
  use many servers conducting crawling together?
- Which crawler is/will be under most active development?


Any views on this?


Best Regards,

Svein Willassen