Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?

Brian Whitman
(Copied from nutch-user, this is more a dev topic now)
> It's not an issue with readseg or readlinkdb themselves, because a  
> segment fetched in the older nutch (using the exact same  
> configuration) expels png links in trunk's readlinkdb. It appears  
> the fetcher now only parses URLs that pass the filters into the  
> segment.


I checked the diffs from my old version (mid-December 06) and trunk  
ParseOutputFormat. It appears now that the parse puts the outlink  
URLs through the URLFilters. I confirmed this by taking out .png from  
my URLFilters and re-running a crawl -- pngs now appear in the  
readlinkdb.

1) Was it a bug that URLs that would not pass URLFilters got into the  
linkdb for analysis?

2) If so, why is there a -noFilter option for readlinkdb? The linkdb  
has already been filtered whether you like it or not. -noFilter will  
never have any effect.

There needs to be a way to have the linkdb reflect all URLs  
(unfiltered) for further analysis. I suggest a -noFilterOutlinks  
(default off) in the fetch command (as the default behavior of fetch  
is to parse.) This would simply not call the filter in  
ParseOutputFormat, if my theory is correct.




Reply | Threaded
Open this post in threaded view
|

Re: nutch trunk filtering URLs in invertlinks even if -noFilter is on?

Brian Whitman

On Sep 23, 2007, at 11:38 AM, Brian Whitman wrote:
>
> 2) If so, why is there a -noFilter option for readlinkdb?
>

mistake, change this to

> 2) If so, why is there a -noFilter option for invertlinks?