Nutch searcher keeps reading CVS directories

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch searcher keeps reading CVS directories

afan0804
Hi All,

My problem occurs when this code is called:
Summary[] summaries = nbean.getSummary(details, query);
where nbean is a Nutchbean, query being a Query object, and details being HitDetails[].

I get this message:
[9/5/08 16:37:07:203 MDT] 00000034 SystemErr     R 08/09/05 16:37:07 FATAL searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to crawl folder]/segments/20080828123423/parse_text/CVS/data

Since this code is being submitted onto CVS, each level contains an auto-generate CVS directory.  My guess is that Nutch is reading those CVS directories as part of the segment and searching for the "data" file, which does not exist in the CVS directory.

I wish to ignore those CVS directory instead of removing them (since they are needed for CVS).

It seems that the path to the segment sub-directory is processed in:
org.apache.nutch.searcher.FetchedSegments
    private MapFile.Reader[] getReaders(String subDir) throws IOException {
      return MapFileOutputFormat.getReaders(fs, new Path(segmentDir, subDir), this.conf);
    }

I have tried passing in C:/[path to crawl folder]/segments/20080828123423/parse_text/part-00000, but then the error becomes
[9/5/08 14:34:08:453 MDT] 0000002a SystemErr     R 08/09/05 14:34:08 FATAL searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to crawl folder]/segments/20080828123423/parse_text/part-00000/CVS/data

Any ideas?  Is it possible to get Hadoop to ignore directories named "CVS"?  Or is there a way I can point directly to the data file?

Thank you very much,
Angela Fan
Reply | Threaded
Open this post in threaded view
|

Re: Nutch searcher keeps reading CVS directories

Dennis Kubes-2
It looks like your segments (data) is in CVS as well?  Is that what you
really want?  Maybe so I guess it depends on the project.  The error
though is a tricky one as you would have to change hadoop code,
specifically the MapFileOutputFormat.getReaders method to use
listStatus(ArrayList<FileStatus> results, Path f, PathFilter filter)
instead of the currently fs.listStatus(dir).  So it is doable but difficult.

Dennis

afan0804 wrote:

> Hi All,
>
> My problem occurs when this code is called:
> Summary[] summaries = nbean.getSummary(details, query);
> where nbean is a Nutchbean, query being a Query object, and details being
> HitDetails[].
>
> I get this message:
> [9/5/08 16:37:07:203 MDT] 00000034 SystemErr     R 08/09/05 16:37:07 FATAL
> searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to crawl
> folder]/segments/20080828123423/parse_text/CVS/data
>
> Since this code is being submitted onto CVS, each level contains an
> auto-generate CVS directory.  My guess is that Nutch is reading those CVS
> directories as part of the segment and searching for the "data" file, which
> does not exist in the CVS directory.
>
> I wish to ignore those CVS directory instead of removing them (since they
> are needed for CVS).
>
> It seems that the path to the segment sub-directory is processed in:
> org.apache.nutch.searcher.FetchedSegments
>     private MapFile.Reader[] getReaders(String subDir) throws IOException {
>       return MapFileOutputFormat.getReaders(fs, new Path(segmentDir,
> subDir), this.conf);
>     }
>
> I have tried passing in C:/[path to crawl
> folder]/segments/20080828123423/parse_text/part-00000, but then the error
> becomes
> [9/5/08 14:34:08:453 MDT] 0000002a SystemErr     R 08/09/05 14:34:08 FATAL
> searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to crawl
> folder]/segments/20080828123423/parse_text/part-00000/CVS/data
>
> Any ideas?  Is it possible to get Hadoop to ignore directories named "CVS"?
> Or is there a way I can point directly to the data file?
>
> Thank you very much,
> Angela Fan
Reply | Threaded
Open this post in threaded view
|

Re: Nutch searcher keeps reading CVS directories

afan0804
Alright I shall try that.  Thank you very much for your help!

Angela


Dennis Kubes-2 wrote
It looks like your segments (data) is in CVS as well?  Is that what you
really want?  Maybe so I guess it depends on the project.  The error
though is a tricky one as you would have to change hadoop code,
specifically the MapFileOutputFormat.getReaders method to use
listStatus(ArrayList<FileStatus> results, Path f, PathFilter filter)
instead of the currently fs.listStatus(dir).  So it is doable but difficult.

Dennis

afan0804 wrote:
> Hi All,
>
> My problem occurs when this code is called:
> Summary[] summaries = nbean.getSummary(details, query);
> where nbean is a Nutchbean, query being a Query object, and details being
> HitDetails[].
>
> I get this message:
> [9/5/08 16:37:07:203 MDT] 00000034 SystemErr     R 08/09/05 16:37:07 FATAL
> searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to crawl
> folder]/segments/20080828123423/parse_text/CVS/data
>
> Since this code is being submitted onto CVS, each level contains an
> auto-generate CVS directory.  My guess is that Nutch is reading those CVS
> directories as part of the segment and searching for the "data" file, which
> does not exist in the CVS directory.
>
> I wish to ignore those CVS directory instead of removing them (since they
> are needed for CVS).
>
> It seems that the path to the segment sub-directory is processed in:
> org.apache.nutch.searcher.FetchedSegments
>     private MapFile.Reader[] getReaders(String subDir) throws IOException {
>       return MapFileOutputFormat.getReaders(fs, new Path(segmentDir,
> subDir), this.conf);
>     }
>
> I have tried passing in C:/[path to crawl
> folder]/segments/20080828123423/parse_text/part-00000, but then the error
> becomes
> [9/5/08 14:34:08:453 MDT] 0000002a SystemErr     R 08/09/05 14:34:08 FATAL
> searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to crawl
> folder]/segments/20080828123423/parse_text/part-00000/CVS/data
>
> Any ideas?  Is it possible to get Hadoop to ignore directories named "CVS"?
> Or is there a way I can point directly to the data file?
>
> Thank you very much,
> Angela Fan