Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?
(Copied from nutch-user, this is more a dev topic now)
> It's not an issue with readseg or readlinkdb themselves, because a
> segment fetched in the older nutch (using the exact same
> configuration) expels png links in trunk's readlinkdb. It appears
> the fetcher now only parses URLs that pass the filters into the
I checked the diffs from my old version (mid-December 06) and trunk
ParseOutputFormat. It appears now that the parse puts the outlink
URLs through the URLFilters. I confirmed this by taking out .png from
my URLFilters and re-running a crawl -- pngs now appear in the
1) Was it a bug that URLs that would not pass URLFilters got into the
linkdb for analysis?
2) If so, why is there a -noFilter option for readlinkdb? The linkdb
has already been filtered whether you like it or not. -noFilter will
never have any effect.
There needs to be a way to have the linkdb reflect all URLs
(unfiltered) for further analysis. I suggest a -noFilterOutlinks
(default off) in the fetch command (as the default behavior of fetch
is to parse.) This would simply not call the filter in
ParseOutputFormat, if my theory is correct.