errors with parsing and indexing

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

errors with parsing and indexing

Doğacan Güney-2
Hi,

After hadoop-0.9.1, parsing and indexing doesn't seem to work.
If you parse while fetching then it is fine, but if you run parse as a
different job, it creates an essentially empty parse_data
directory(which has index files, but doesn't have data files). I am
looking into this, but so far, I couldn't find the source of error.

Also, indexing fails at Indexer.OutputFormat.getRecordWriter. The
parameter fs seems to be an instance of PhasedFileSystem which throws
exceptions on delete and {start,complete}LocalOutput. The following
patch should fix it, but may not be the best way of doing this.

Index: src/java/org/apache/nutch/indexer/Indexer.java
===================================================================
--- src/java/org/apache/nutch/indexer/Indexer.java    (revision 487240)
+++ src/java/org/apache/nutch/indexer/Indexer.java    (working copy)
@@ -94,11 +94,15 @@
       final Path temp =
         job.getLocalPath("index/_"+Integer.toString(new
Random().nextInt()));
 
-      fs.delete(perm);                            // delete old, if any
-
+      final FileSystem dfs = FileSystem.get(job);
+    
+      if (dfs.exists(perm)) {
+        dfs.delete(perm);                            // delete old, if any
+      }
+    
       final AnalyzerFactory factory = new AnalyzerFactory(job);
       final IndexWriter writer =                  // build locally first
-        new IndexWriter(fs.startLocalOutput(perm, temp).toString(),
+        new IndexWriter(dfs.startLocalOutput(perm, temp).toString(),
                         new NutchDocumentAnalyzer(job), true);
 
       writer.setMergeFactor(job.getInt("indexer.mergeFactor", 10));
@@ -146,7 +150,7 @@
               // optimize & close index
               writer.optimize();
               writer.close();
-              fs.completeLocalOutput(perm, temp);   // copy to dfs
+              dfs.completeLocalOutput(perm, temp);
               fs.createNewFile(new Path(perm, DONE_NAME));
             } finally {
               closed = true;

Reply | Threaded
Open this post in threaded view
|

Re: errors with parsing and indexing

Doğacan Güney-2
Doğacan Güney wrote:

> Hi,
>
> After hadoop-0.9.1, parsing and indexing doesn't seem to work.
> If you parse while fetching then it is fine, but if you run parse as a
> different job, it creates an essentially empty parse_data
> directory(which has index files, but doesn't have data files). I am
> looking into this, but so far, I couldn't find the source of error.
>
> Also, indexing fails at Indexer.OutputFormat.getRecordWriter. The
> parameter fs seems to be an instance of PhasedFileSystem which throws
> exceptions on delete and {start,complete}LocalOutput. The following
> patch should fix it, but may not be the best way of doing this.
>
> Index: src/java/org/apache/nutch/indexer/Indexer.java
> ===================================================================
> --- src/java/org/apache/nutch/indexer/Indexer.java    (revision 487240)
> +++ src/java/org/apache/nutch/indexer/Indexer.java    (working copy)
> @@ -94,11 +94,15 @@
>       final Path temp =
>         job.getLocalPath("index/_"+Integer.toString(new
> Random().nextInt()));
>
> -      fs.delete(perm);                            // delete old, if any
> -
> +      final FileSystem dfs = FileSystem.get(job);
> +     +      if (dfs.exists(perm)) {
> +        dfs.delete(perm);                            // delete old,
> if any
> +      }
> +           final AnalyzerFactory factory = new AnalyzerFactory(job);
>       final IndexWriter writer =                  // build locally first
> -        new IndexWriter(fs.startLocalOutput(perm, temp).toString(),
> +        new IndexWriter(dfs.startLocalOutput(perm, temp).toString(),
>                         new NutchDocumentAnalyzer(job), true);
>
>       writer.setMergeFactor(job.getInt("indexer.mergeFactor", 10));
> @@ -146,7 +150,7 @@
>               // optimize & close index
>               writer.optimize();
>               writer.close();
> -              fs.completeLocalOutput(perm, temp);   // copy to dfs
> +              dfs.completeLocalOutput(perm, temp);
>               fs.createNewFile(new Path(perm, DONE_NAME));
>             } finally {
>               closed = true;
>
>
>
>
Sorry about the patch, it got garbled somehow. I am attaching it, I hope
mailing list doesn't drop attachments.


Index: src/java/org/apache/nutch/indexer/Indexer.java
===================================================================
--- src/java/org/apache/nutch/indexer/Indexer.java (revision 487240)
+++ src/java/org/apache/nutch/indexer/Indexer.java (working copy)
@@ -94,11 +94,15 @@
       final Path temp =
         job.getLocalPath("index/_"+Integer.toString(new Random().nextInt()));
 
-      fs.delete(perm);                            // delete old, if any
-
+      final FileSystem dfs = FileSystem.get(job);
+      
+      if (dfs.exists(perm)) {
+        dfs.delete(perm);                            // delete old, if any
+      }
+      
       final AnalyzerFactory factory = new AnalyzerFactory(job);
       final IndexWriter writer =                  // build locally first
-        new IndexWriter(fs.startLocalOutput(perm, temp).toString(),
+        new IndexWriter(dfs.startLocalOutput(perm, temp).toString(),
                         new NutchDocumentAnalyzer(job), true);
 
       writer.setMergeFactor(job.getInt("indexer.mergeFactor", 10));
@@ -146,7 +150,7 @@
               // optimize & close index
               writer.optimize();
               writer.close();
-              fs.completeLocalOutput(perm, temp);   // copy to dfs
+              dfs.completeLocalOutput(perm, temp);
               fs.createNewFile(new Path(perm, DONE_NAME));
             } finally {
               closed = true;
Reply | Threaded
Open this post in threaded view
|

Re: errors with parsing and indexing

Zaheed Haque
Hi:

Please attach the patch with a jira issue my mail account give me
trouble with attachment.

Kind regards
Zaheed

On 12/14/06, Doğacan Güney <[hidden email]> wrote:

> Doğacan Güney wrote:
> > Hi,
> >
> > After hadoop-0.9.1, parsing and indexing doesn't seem to work.
> > If you parse while fetching then it is fine, but if you run parse as a
> > different job, it creates an essentially empty parse_data
> > directory(which has index files, but doesn't have data files). I am
> > looking into this, but so far, I couldn't find the source of error.
> >
> > Also, indexing fails at Indexer.OutputFormat.getRecordWriter. The
> > parameter fs seems to be an instance of PhasedFileSystem which throws
> > exceptions on delete and {start,complete}LocalOutput. The following
> > patch should fix it, but may not be the best way of doing this.
> >
> > Index: src/java/org/apache/nutch/indexer/Indexer.java
> > ===================================================================
> > --- src/java/org/apache/nutch/indexer/Indexer.java    (revision 487240)
> > +++ src/java/org/apache/nutch/indexer/Indexer.java    (working copy)
> > @@ -94,11 +94,15 @@
> >       final Path temp =
> >         job.getLocalPath("index/_"+Integer.toString(new
> > Random().nextInt()));
> >
> > -      fs.delete(perm);                            // delete old, if any
> > -
> > +      final FileSystem dfs = FileSystem.get(job);
> > +     +      if (dfs.exists(perm)) {
> > +        dfs.delete(perm);                            // delete old,
> > if any
> > +      }
> > +           final AnalyzerFactory factory = new AnalyzerFactory(job);
> >       final IndexWriter writer =                  // build locally first
> > -        new IndexWriter(fs.startLocalOutput(perm, temp).toString(),
> > +        new IndexWriter(dfs.startLocalOutput(perm, temp).toString(),
> >                         new NutchDocumentAnalyzer(job), true);
> >
> >       writer.setMergeFactor(job.getInt("indexer.mergeFactor", 10));
> > @@ -146,7 +150,7 @@
> >               // optimize & close index
> >               writer.optimize();
> >               writer.close();
> > -              fs.completeLocalOutput(perm, temp);   // copy to dfs
> > +              dfs.completeLocalOutput(perm, temp);
> >               fs.createNewFile(new Path(perm, DONE_NAME));
> >             } finally {
> >               closed = true;
> >
> >
> >
> >
> Sorry about the patch, it got garbled somehow. I am attaching it, I hope
> mailing list doesn't drop attachments.
>
>
>
>