[jira] Created: (NUTCH-159) Specify temp/working directory for crawl

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-159) Specify temp/working directory for crawl

JIRA jira@apache.org
Specify temp/working directory for crawl
----------------------------------------

         Key: NUTCH-159
         URL: http://issues.apache.org/jira/browse/NUTCH-159
     Project: Nutch
        Type: Bug
  Components: fetcher, indexer  
    Versions: 0.8-dev    
 Environment: Linux/Debian
    Reporter: byron miller


I ran a crawl of 100k web pages and got:

org.apache.nutch.fs.FSError: java.io.IOException: No space left on device
        at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149)
        at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65)
        at org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178)
        at org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224)
        at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
Caused by: java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:260)
        at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147)
        ... 4 more
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
        at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:107)
byron@db02:/data/nutch$ df -k


It appears crawl created a /tmp/nutch directory that filled up even though i specified a db directory.

Need to add a parameter to the command line or make a globaly configurable /tmp (work area) for the nutch instance so that crawls won't fail.


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-159) Specify temp/working directory for crawl

JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12361541 ]

Doug Cutting commented on NUTCH-159:
------------------------------------

mapred.local.dir is the thing to set.  if that fails, then there is a bug.  what did you have this set to?

> Specify temp/working directory for crawl
> ----------------------------------------
>
>          Key: NUTCH-159
>          URL: http://issues.apache.org/jira/browse/NUTCH-159
>      Project: Nutch
>         Type: Bug
>   Components: fetcher, indexer
>     Versions: 0.8-dev
>  Environment: Linux/Debian
>     Reporter: byron miller

>
> I ran a crawl of 100k web pages and got:
> org.apache.nutch.fs.FSError: java.io.IOException: No space left on device
>         at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149)
>         at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65)
>         at org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178)
>         at org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224)
>         at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
> Caused by: java.io.IOException: No space left on device
>         at java.io.FileOutputStream.writeBytes(Native Method)
>         at java.io.FileOutputStream.write(FileOutputStream.java:260)
>         at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147)
>         ... 4 more
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
>         at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:107)
> byron@db02:/data/nutch$ df -k
> It appears crawl created a /tmp/nutch directory that filled up even though i specified a db directory.
> Need to add a parameter to the command line or make a globaly configurable /tmp (work area) for the nutch instance so that crawls won't fail.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-159) Specify temp/working directory for crawl

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12361545 ]

byron miller commented on NUTCH-159:
------------------------------------

While it's from the mapred trunk, it is a non ndfs/local instance only.  Mapred.temp.dir was left at it's defaults.. (which didn't exist)


<property>
  <name>mapred.temp.dir</name>
  <value>/tmp/nutch/mapred/temp</value>
  <description>A shared directory for temporary files.
  </description>
</property>

I'm going to modify this and re-run my fetch and let you know how that works.  


> Specify temp/working directory for crawl
> ----------------------------------------
>
>          Key: NUTCH-159
>          URL: http://issues.apache.org/jira/browse/NUTCH-159
>      Project: Nutch
>         Type: Bug
>   Components: fetcher, indexer
>     Versions: 0.8-dev
>  Environment: Linux/Debian
>     Reporter: byron miller

>
> I ran a crawl of 100k web pages and got:
> org.apache.nutch.fs.FSError: java.io.IOException: No space left on device
>         at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149)
>         at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65)
>         at org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178)
>         at org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224)
>         at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
> Caused by: java.io.IOException: No space left on device
>         at java.io.FileOutputStream.writeBytes(Native Method)
>         at java.io.FileOutputStream.write(FileOutputStream.java:260)
>         at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147)
>         ... 4 more
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
>         at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:107)
> byron@db02:/data/nutch$ df -k
> It appears crawl created a /tmp/nutch directory that filled up even though i specified a db directory.
> Need to add a parameter to the command line or make a globaly configurable /tmp (work area) for the nutch instance so that crawls won't fail.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-159) Specify temp/working directory for crawl

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12362392 ]

Paul Baclace commented on NUTCH-159:
------------------------------------

mapred.temp.dir and mapred.local.dir  are used for different purposes.

I think this is a sysadmin useability bug that really means:

1. defaults for these settings should be documented (of course)
2. it should be clear whether a path is abstract (applies to NDFS or local FS depending on fs.default.name) or local FS only, or NDFS-only (if any).  Config attribute names should consistently indicate this.
2. some clues as to how much space might be needed (some of this is in transition, however).
3. when the space is exhausted, the error message should indicate the path(s) in question and config param that is used to specify it.

Separately, I am preparing a patch that will do (3) for mapred.local.dir


> Specify temp/working directory for crawl
> ----------------------------------------
>
>          Key: NUTCH-159
>          URL: http://issues.apache.org/jira/browse/NUTCH-159
>      Project: Nutch
>         Type: Bug
>   Components: fetcher, indexer
>     Versions: 0.8-dev
>  Environment: Linux/Debian
>     Reporter: byron miller

>
> I ran a crawl of 100k web pages and got:
> org.apache.nutch.fs.FSError: java.io.IOException: No space left on device
>         at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149)
>         at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65)
>         at org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178)
>         at org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224)
>         at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
> Caused by: java.io.IOException: No space left on device
>         at java.io.FileOutputStream.writeBytes(Native Method)
>         at java.io.FileOutputStream.write(FileOutputStream.java:260)
>         at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147)
>         ... 4 more
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
>         at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:107)
> byron@db02:/data/nutch$ df -k
> It appears crawl created a /tmp/nutch directory that filled up even though i specified a db directory.
> Need to add a parameter to the command line or make a globaly configurable /tmp (work area) for the nutch instance so that crawls won't fail.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira