I want to ask a general question about nutch.
Nutch does the crawling task in a stepwise way. I
mean the crawling steps are distinct from
each other. In the normal way of performing a crawl
you first do the generate step, then the fetch step and
then the updatedb step. (And I use the word 'depth' here as this
three step procedure) But when some other concepts,
for example re-visiting of sites, are added to nutch, not only
the concept of depth is meaningless, but also it is
a hindrance to implement the re-visit policy easily, because
you can not re-visit the sites before a fetching step is
over. But a fetching step can last forever.
So what is the rationale behind this design? Is it that
important to make the steps distinct?