Preventing overlapped search results.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Preventing overlapped search results.

Brian Hill-3

I'm new to Nutch, but I couldn't find this in the archives or docs and
it has me stumped.

I have two websites that I need to index in Nutch. I am presently
running two separate crawls to index these sites, but a single link is
screwing up my search results.

I have two flat files in my Nutch directory, "Domain1" and "Domain2".
Each of these files contains the appropriate starting URL for each of
the two sites, and the two crawls generate completely separate database
folders, which are in turn called by two independent Nutch frontend
installations in Tomcat.

My problem is with the crawl-urlfilter.txt file. Because this is a local
search, I need to limit the domains and the file contains these lines:

# accept hosts in MY.DOMAIN.NAME

This would work perfectly EXCEPT that there is a single link on the site to the homepage of the site. Nutch is
following this link, and as a result the domain1 search results are
bringing up the full AND sites.

What's the best way to deal with this problem? When I run the Domain1
Nutch search, I need the results to be limited to the,, and websites. Likewise,
if I add a reciprocal link to, I need users of THAT search
interface to receive results only relevant to that domain.

PLEASE don't tell me I need two independent Nutch installations! Your
help is appreciated.

Brian Hill