[jira] Created: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

JIRA jira@apache.org
org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
--------------------------------------------------------------------------

         Key: NUTCH-148
         URL: http://issues.apache.org/jira/browse/NUTCH-148
     Project: Nutch
        Type: Bug
  Components: indexer  
    Versions: 0.8-dev    
 Environment: Windows XP Home
    Reporter: raghavendra prabhu


I get the following error while running org.apache.nutch.tools.CrawlTool

The error actually is in deleteduplicates

51223 001121 Reading url hashes...
051223 001121 Sorting url hashes...
051223 001121 Deleting url duplicates...
051223 001121 Error moving bad file
G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF
\classes\ddup-workingdir\ddup-20051223001121: java.io.IOException:
CreateProcess
: df -k  G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 error=2
It throws the error here in NFSDataInputStream.java
The exception is org.apache.nutch.fs.ChecksumException: Checksum
error: G:\apach
e-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 at 0

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-148?page=comments#action_12361128 ]

Piotr Kosiorowski commented on NUTCH-148:
-----------------------------------------

Do you have Cygwin installed?
Is 'df' working in your cygwin installation?
Do you run crawl from cygwin shell?

Nutch requires cygwin on Windows.

> org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
> --------------------------------------------------------------------------
>
>          Key: NUTCH-148
>          URL: http://issues.apache.org/jira/browse/NUTCH-148
>      Project: Nutch
>         Type: Bug
>   Components: indexer
>     Versions: 0.8-dev
>  Environment: Windows XP Home
>     Reporter: raghavendra prabhu

>
> I get the following error while running org.apache.nutch.tools.CrawlTool
> The error actually is in deleteduplicates
> 51223 001121 Reading url hashes...
> 051223 001121 Sorting url hashes...
> 051223 001121 Deleting url duplicates...
> 051223 001121 Error moving bad file
> G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF
> \classes\ddup-workingdir\ddup-20051223001121: java.io.IOException:
> CreateProcess
> : df -k  G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 error=2
> It throws the error here in NFSDataInputStream.java
> The exception is org.apache.nutch.fs.ChecksumException: Checksum
> error: G:\apach
> e-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 at 0

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-148?page=comments#action_12361197 ]

raghavendra prabhu commented on NUTCH-148:
------------------------------------------

Does nutch-0.8-dev require cygwin

Till now i had been using nutch-0.7.1

I have also raised another bug that org.apache.nutch.crawl.Crawl runs in a loop

Is that also because on cygwin

Can you please confirm.

Doubts
1)Does nutch-0.8-dev has dependency on cygwin?
2) Was this dependency there in nutch-0.7
Thanks for responding soon

> org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
> --------------------------------------------------------------------------
>
>          Key: NUTCH-148
>          URL: http://issues.apache.org/jira/browse/NUTCH-148
>      Project: Nutch
>         Type: Bug
>   Components: indexer
>     Versions: 0.8-dev
>  Environment: Windows XP Home
>     Reporter: raghavendra prabhu

>
> I get the following error while running org.apache.nutch.tools.CrawlTool
> The error actually is in deleteduplicates
> 51223 001121 Reading url hashes...
> 051223 001121 Sorting url hashes...
> 051223 001121 Deleting url duplicates...
> 051223 001121 Error moving bad file
> G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF
> \classes\ddup-workingdir\ddup-20051223001121: java.io.IOException:
> CreateProcess
> : df -k  G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 error=2
> It throws the error here in NFSDataInputStream.java
> The exception is org.apache.nutch.fs.ChecksumException: Checksum
> error: G:\apach
> e-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 at 0

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-148?page=comments#action_12361199 ]

Stefan Groschupf commented on NUTCH-148:
----------------------------------------

nutch require cygwin or a unix operation system for 0.7 and 0.8.

> org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
> --------------------------------------------------------------------------
>
>          Key: NUTCH-148
>          URL: http://issues.apache.org/jira/browse/NUTCH-148
>      Project: Nutch
>         Type: Bug
>   Components: indexer
>     Versions: 0.8-dev
>  Environment: Windows XP Home
>     Reporter: raghavendra prabhu

>
> I get the following error while running org.apache.nutch.tools.CrawlTool
> The error actually is in deleteduplicates
> 51223 001121 Reading url hashes...
> 051223 001121 Sorting url hashes...
> 051223 001121 Deleting url duplicates...
> 051223 001121 Error moving bad file
> G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF
> \classes\ddup-workingdir\ddup-20051223001121: java.io.IOException:
> CreateProcess
> : df -k  G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 error=2
> It throws the error here in NFSDataInputStream.java
> The exception is org.apache.nutch.fs.ChecksumException: Checksum
> error: G:\apach
> e-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 at 0

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
    [ http://issues.apache.org/jira/browse/NUTCH-148?page=comments#action_12361206 ]

Piotr Kosiorowski commented on NUTCH-148:
-----------------------------------------

'df' command is required for NDFS operation so if you were not using NDFS in 0.7.1 and nutch shell scripts you were able to run it on Windows without cygwin. Now majority of tools use NDFS so cygwin is required on Windows. I would asssume the other bug is also cygwin related - please test it with cygwin and report if it fixed the issue.
In future in case if doubts it is better to ask on the nutch-user mailing list rather than create JIRA issue first. I will close both your issues now assuming they are cygwin related. If you fins that it still does not work with cygwin please reopen.


> org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
> --------------------------------------------------------------------------
>
>          Key: NUTCH-148
>          URL: http://issues.apache.org/jira/browse/NUTCH-148
>      Project: Nutch
>         Type: Bug
>   Components: indexer
>     Versions: 0.8-dev
>  Environment: Windows XP Home
>     Reporter: raghavendra prabhu

>
> I get the following error while running org.apache.nutch.tools.CrawlTool
> The error actually is in deleteduplicates
> 51223 001121 Reading url hashes...
> 051223 001121 Sorting url hashes...
> 051223 001121 Deleting url duplicates...
> 051223 001121 Error moving bad file
> G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF
> \classes\ddup-workingdir\ddup-20051223001121: java.io.IOException:
> CreateProcess
> : df -k  G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 error=2
> It throws the error here in NFSDataInputStream.java
> The exception is org.apache.nutch.fs.ChecksumException: Checksum
> error: G:\apach
> e-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 at 0

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-148) org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-148?page=all ]
     
Piotr Kosiorowski closed NUTCH-148:
-----------------------------------

    Resolution: Invalid

> org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
> --------------------------------------------------------------------------
>
>          Key: NUTCH-148
>          URL: http://issues.apache.org/jira/browse/NUTCH-148
>      Project: Nutch
>         Type: Bug
>   Components: indexer
>     Versions: 0.8-dev
>  Environment: Windows XP Home
>     Reporter: raghavendra prabhu

>
> I get the following error while running org.apache.nutch.tools.CrawlTool
> The error actually is in deleteduplicates
> 51223 001121 Reading url hashes...
> 051223 001121 Sorting url hashes...
> 051223 001121 Deleting url duplicates...
> 051223 001121 Error moving bad file
> G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF
> \classes\ddup-workingdir\ddup-20051223001121: java.io.IOException:
> CreateProcess
> : df -k  G:\apache-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 error=2
> It throws the error here in NFSDataInputStream.java
> The exception is org.apache.nutch.fs.ChecksumException: Checksum
> error: G:\apach
> e-tomcat-5.5.12\webapps\crux\WEB-INF\classes\ddup-workingdir\ddup-20051223001121 at 0

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira