IOException: not a file with invertlinks/index

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

IOException: not a file with invertlinks/index

Ben Ogle
Hi all, I am having problems recrawling our intranet. Something in the recrawl script (is it invertlinks?) creates a crawldir\linkdb\current\linkdb-merge-<number> folder which has a part-00000 folder under that. When the indexer is invoked, it looks for crawldir\linkdb\current\linkdb-merge-<number>\data, but that file doesnt exist cause its in the part-00000 directory. How do I get the indexer to look in the part-00000 dir? Is it a configuration error?

I am running a python port of recrawl script on a windows 2000 machine without cygwin, where the crawldir and nutch 0.8 is on a windows 2003 server that I have very limited access to. Heres what the hadoop.log says about it:

2006-09-07 13:02:39,696 INFO  indexer.Indexer - Indexer: starting
2006-09-07 13:02:39,696 INFO  indexer.Indexer - Indexer: linkdb: F:/nutch-0.8/intranet-crawl/linkdb
2006-09-07 13:02:40,696 INFO  indexer.Indexer - Indexer: adding segment: F:/nutch-0.8/intranet-crawl/segments/20060907130151
2006-09-07 13:02:50,804 WARN  mapred.LocalJobRunner - job_fn20sr
java.io.IOException: Not a file: F:/nutch-0.8/intranet-crawl/linkdb/current/linkdb-merge-216906667/data
        at org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:121)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)

If I move the contents of linkdb-merge-216906667/part-00000 to linkdb-merge-216906667, indexing works ok (well, it wont delete _0.f0, but thats another issue).

The same thing happens when this linkdb-merge-* directory exists already and I run invertlinks.

What am I doing wrong? I havent been able to find anyone with these issues, so I must be doing something wrong.

Ben
Reply | Threaded
Open this post in threaded view
|

Re: IOException: not a file with invertlinks/index

maximus1
Hey Ben,

DId you find a solution? I'm having the same problem with cygwin and nutch-0.9

Thanks mate
Cornelius


Ben Ogle wrote
Hi all, I am having problems recrawling our intranet. Something in the recrawl script (is it invertlinks?) creates a crawldir\linkdb\current\linkdb-merge-<number> folder which has a part-00000 folder under that. When the indexer is invoked, it looks for crawldir\linkdb\current\linkdb-merge-<number>\data, but that file doesnt exist cause its in the part-00000 directory. How do I get the indexer to look in the part-00000 dir? Is it a configuration error?

I am running a python port of recrawl script on a windows 2000 machine without cygwin, where the crawldir and nutch 0.8 is on a windows 2003 server that I have very limited access to. Heres what the hadoop.log says about it:

2006-09-07 13:02:39,696 INFO  indexer.Indexer - Indexer: starting
2006-09-07 13:02:39,696 INFO  indexer.Indexer - Indexer: linkdb: F:/nutch-0.8/intranet-crawl/linkdb
2006-09-07 13:02:40,696 INFO  indexer.Indexer - Indexer: adding segment: F:/nutch-0.8/intranet-crawl/segments/20060907130151
2006-09-07 13:02:50,804 WARN  mapred.LocalJobRunner - job_fn20sr
java.io.IOException: Not a file: F:/nutch-0.8/intranet-crawl/linkdb/current/linkdb-merge-216906667/data
        at org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:121)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)

If I move the contents of linkdb-merge-216906667/part-00000 to linkdb-merge-216906667, indexing works ok (well, it wont delete _0.f0, but thats another issue).

The same thing happens when this linkdb-merge-* directory exists already and I run invertlinks.

What am I doing wrong? I havent been able to find anyone with these issues, so I must be doing something wrong.

Ben