after playing around to figure out the best place to resolve IP's of
freshly discovered ulrs I agree with Andrzej that the
Parseoutputformat isn't the best place.
The problem here, Parseoutputformat is not multithreaded and we
definitely need many threads for ip lookup.
I think in case we a ip Resolving MapRunnable to preprocess segment
data (after fetching) before crawldb updating would be may be a
+ less data to process (in opposite to process a complete crawldb)
+ good dns cache usage, since many new urls will point to the same
host (internal links)
- we may lookup urls we already have in the crawldb.
I'm new in this mailing list and in use of nutch. I read a lots of things
about nutch. Actually I can do a index and get some queries too. However I
only obtained results in HTML files. I've try to index msdoc and PDF, but I
only can do the index. I have problems with the search. I'm using to search
the application that comes with the nutch. Have anyone the same problem?
Don't repair to my bad english. I'm brazilian... :)