[jira] Created: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

Nick Burch (Jira)
InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)
---------------------------------------------------------------------------------

         Key: NUTCH-191
         URL: http://issues.apache.org/jira/browse/NUTCH-191
     Project: Nutch
        Type: Bug
    Versions: 0.8-dev    
 Environment: ~20 node nutch mapreduce environment, running SVN trunk, on Linux
    Reporter: Bryan Pendleton
    Priority: Minor


During development, I've been creating/tweaking custom InputFormat implementations. However, when you try to run a job against a running cluster, you get:
  Exception in thread "main" java.io.IOException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: my.custom.InputFormat
          at org.apache.nutch.ipc.Client.call(Client.java:294)
          at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
          at $Proxy0.submitJob(Unknown Source)
          at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
          at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
          at com.parc.uir.wikipedia.WikipediaJob.main(WikipediaJob.java:85)

This error goes away if I restart the TaskTrackers/JobTracker with a classpath which includes the needed code. Other classes (Mapper, Reducer) appear to be available out of the jar file specified in the JobConf, but not the InputFormat. Obviously, it's less than idea to have to restart the JobTracker whenever there's a change to a job-specific class.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364678 ]

Doug Cutting commented on NUTCH-191:
------------------------------------

We've thus far avoided loading job-specific code in the JobTracker and TaskTracker, in order to keep these more reliable.  File splitting is performed by the job tracker.  So if you're overriding InputFormat.getSplits(), then fixing this is harder.  But if you're simply overriding getRecordReader(), then this should be easier to fix.  In that case one could fix this by moving getSplits() to a new interface that's used only by the TaskTracker.  If this is important to you, please submit a patch to this effect.

> InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)
> ---------------------------------------------------------------------------------
>
>          Key: NUTCH-191
>          URL: http://issues.apache.org/jira/browse/NUTCH-191
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>  Environment: ~20 node nutch mapreduce environment, running SVN trunk, on Linux
>     Reporter: Bryan Pendleton
>     Priority: Minor

>
> During development, I've been creating/tweaking custom InputFormat implementations. However, when you try to run a job against a running cluster, you get:
>   Exception in thread "main" java.io.IOException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: my.custom.InputFormat
>           at org.apache.nutch.ipc.Client.call(Client.java:294)
>           at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>           at $Proxy0.submitJob(Unknown Source)
>           at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>           at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>           at com.parc.uir.wikipedia.WikipediaJob.main(WikipediaJob.java:85)
> This error goes away if I restart the TaskTrackers/JobTracker with a classpath which includes the needed code. Other classes (Mapper, Reducer) appear to be available out of the jar file specified in the JobConf, but not the InputFormat. Obviously, it's less than idea to have to restart the JobTracker whenever there's a change to a job-specific class.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364739 ]

Owen O'Malley commented on NUTCH-191:
-------------------------------------

Wouldn't it be appropriate to make input splitting into a task, so that getSplits could be run by the TaskTrackerChild? That way the current interfaces would remain and the user could override it from the job.jar.

An example where we would find it useful is where the map input is coming from external servers over sockets. getSplits could return splits of the form FileSplit("host:port", 0 ,1000) and the RecordReader needs to know how to translate that name into a data stream.

> InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)
> ---------------------------------------------------------------------------------
>
>          Key: NUTCH-191
>          URL: http://issues.apache.org/jira/browse/NUTCH-191
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>  Environment: ~20 node nutch mapreduce environment, running SVN trunk, on Linux
>     Reporter: Bryan Pendleton
>     Priority: Minor

>
> During development, I've been creating/tweaking custom InputFormat implementations. However, when you try to run a job against a running cluster, you get:
>   Exception in thread "main" java.io.IOException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: my.custom.InputFormat
>           at org.apache.nutch.ipc.Client.call(Client.java:294)
>           at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>           at $Proxy0.submitJob(Unknown Source)
>           at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>           at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>           at com.parc.uir.wikipedia.WikipediaJob.main(WikipediaJob.java:85)
> This error goes away if I restart the TaskTrackers/JobTracker with a classpath which includes the needed code. Other classes (Mapper, Reducer) appear to be available out of the jar file specified in the JobConf, but not the InputFormat. Obviously, it's less than idea to have to restart the JobTracker whenever there's a change to a job-specific class.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364743 ]

Bryan Pendleton commented on NUTCH-191:
---------------------------------------

I think the reason to keep getSplits() in the jobtracker, is because the result of getSplits() determines the actual number of map tasks that's run, and the job tracker does more setup and tracking *after* getSplits(). How would you separate that out?

> InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)
> ---------------------------------------------------------------------------------
>
>          Key: NUTCH-191
>          URL: http://issues.apache.org/jira/browse/NUTCH-191
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>  Environment: ~20 node nutch mapreduce environment, running SVN trunk, on Linux
>     Reporter: Bryan Pendleton
>     Priority: Minor

>
> During development, I've been creating/tweaking custom InputFormat implementations. However, when you try to run a job against a running cluster, you get:
>   Exception in thread "main" java.io.IOException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: my.custom.InputFormat
>           at org.apache.nutch.ipc.Client.call(Client.java:294)
>           at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>           at $Proxy0.submitJob(Unknown Source)
>           at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>           at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>           at com.parc.uir.wikipedia.WikipediaJob.main(WikipediaJob.java:85)
> This error goes away if I restart the TaskTrackers/JobTracker with a classpath which includes the needed code. Other classes (Mapper, Reducer) appear to be available out of the jar file specified in the JobConf, but not the InputFormat. Obviously, it's less than idea to have to restart the JobTracker whenever there's a change to a job-specific class.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364773 ]

Owen O'Malley commented on NUTCH-191:
-------------------------------------

I would schedule the getSplits task and when it completed, I would schedule the map jobs. It would be pretty parallel to the way the completion of the map tasks causes the reduces to be scheduled. I think the right place to hook it would be in JobTracker.JobInProgress.completedTask(String). One difference that I'm aware of, is that until getSplits returns, you don't have any idea how many maps will be needed, so you can't create the map tasks when the job is created.

> InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)
> ---------------------------------------------------------------------------------
>
>          Key: NUTCH-191
>          URL: http://issues.apache.org/jira/browse/NUTCH-191
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>  Environment: ~20 node nutch mapreduce environment, running SVN trunk, on Linux
>     Reporter: Bryan Pendleton
>     Priority: Minor

>
> During development, I've been creating/tweaking custom InputFormat implementations. However, when you try to run a job against a running cluster, you get:
>   Exception in thread "main" java.io.IOException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: my.custom.InputFormat
>           at org.apache.nutch.ipc.Client.call(Client.java:294)
>           at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>           at $Proxy0.submitJob(Unknown Source)
>           at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>           at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>           at com.parc.uir.wikipedia.WikipediaJob.main(WikipediaJob.java:85)
> This error goes away if I restart the TaskTrackers/JobTracker with a classpath which includes the needed code. Other classes (Mapper, Reducer) appear to be available out of the jar file specified in the JobConf, but not the InputFormat. Obviously, it's less than idea to have to restart the JobTracker whenever there's a change to a job-specific class.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira