Set variables in mapper

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Set variables in mapper

Erik Test
Hi,

I'm trying to set a variable in my mapper class by reading an argument from
the command line and then passing the entry to the mapper from main. Is this
possible?

  public static void main(String[] args) throws Exception
  {
    JobConf conf = new JobConf(DistanceCalc2.class);
    conf.setJobName("Calculate Distances");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(DoubleWritable.class);

    conf.setMapperClass(Map.class);
    //conf.setReducerClass(Reduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    Map.setN(args[2]);

    JobClient.runJob(conf);
  }//main


  public static class Map extends MapReduceBase
    implements Mapper<LongWritable, Text,
      Text, DoubleWritable>
        {
               ...
               private static int N;

               ...

               public void map(LongWritable key, Text value,
                 OutputCollector<Text, DoubleWritable> output,
                  Reporter reporter) throws IOException
                {
                    ....
                    dim = tokens.length / N;
                    ...
                }

               public static void setN(String newN)
               {
                  N = Integer.parseInt(newN);
               }
        }

I've tried the code above but I get an error saying that I'm dividing by
zero. Obviously, the argument I enter for N isn't being set as specified.
Erik
Reply | Threaded
Open this post in threaded view
|

Re: Set variables in mapper

Edward Capriolo
On Mon, Aug 2, 2010 at 12:17 PM, Erik Test <[hidden email]> wrote:

> Hi,
>
> I'm trying to set a variable in my mapper class by reading an argument from
> the command line and then passing the entry to the mapper from main. Is this
> possible?
>
>  public static void main(String[] args) throws Exception
>  {
>    JobConf conf = new JobConf(DistanceCalc2.class);
>    conf.setJobName("Calculate Distances");
>
>    conf.setOutputKeyClass(Text.class);
>    conf.setOutputValueClass(DoubleWritable.class);
>
>    conf.setMapperClass(Map.class);
>    //conf.setReducerClass(Reduce.class);
>
>    conf.setInputFormat(TextInputFormat.class);
>    conf.setOutputFormat(TextOutputFormat.class);
>
>    FileInputFormat.setInputPaths(conf, new Path(args[0]));
>    FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>
>    Map.setN(args[2]);
>
>    JobClient.runJob(conf);
>  }//main
>
>
>  public static class Map extends MapReduceBase
>    implements Mapper<LongWritable, Text,
>      Text, DoubleWritable>
>        {
>               ...
>               private static int N;
>
>               ...
>
>               public void map(LongWritable key, Text value,
>                 OutputCollector<Text, DoubleWritable> output,
>                  Reporter reporter) throws IOException
>                {
>                    ....
>                    dim = tokens.length / N;
>                    ...
>                }
>
>               public static void setN(String newN)
>               {
>                  N = Integer.parseInt(newN);
>               }
>        }
>
> I've tried the code above but I get an error saying that I'm dividing by
> zero. Obviously, the argument I enter for N isn't being set as specified.
> Erik
>

You can pass variables to the Job using the JobConf class.

In your Driver class:
jobConf.set("clone_path", clonePath);

Then in your mapper / reducer override configure:

  private JobConf jobConf;
  public void configure(JobConf jobConf) {
        super.configure(jobConf);
        this.jobConf=jobConf;
  }
Reply | Threaded
Open this post in threaded view
|

Re: Set variables in mapper

Harsh J
And since it is an integer you're looking for, use the utility methods
JobConf.setInt and JobConf.getInt:

Integer N = Integer.parseInt(args[2]);
JobConf.setInt("your.pack.some.name", N);

And in the Mapper's "@Override void configure(JobConf conf)", do:
conf.getInt("your.pack.some.name", 1 /* Or other default value */);

On Mon, Aug 2, 2010 at 9:53 PM, Edward Capriolo <[hidden email]> wrote:

> On Mon, Aug 2, 2010 at 12:17 PM, Erik Test <[hidden email]> wrote:
>> Hi,
>>
>> I'm trying to set a variable in my mapper class by reading an argument from
>> the command line and then passing the entry to the mapper from main. Is this
>> possible?
>>
>>  public static void main(String[] args) throws Exception
>>  {
>>    JobConf conf = new JobConf(DistanceCalc2.class);
>>    conf.setJobName("Calculate Distances");
>>
>>    conf.setOutputKeyClass(Text.class);
>>    conf.setOutputValueClass(DoubleWritable.class);
>>
>>    conf.setMapperClass(Map.class);
>>    //conf.setReducerClass(Reduce.class);
>>
>>    conf.setInputFormat(TextInputFormat.class);
>>    conf.setOutputFormat(TextOutputFormat.class);
>>
>>    FileInputFormat.setInputPaths(conf, new Path(args[0]));
>>    FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>>
>>    Map.setN(args[2]);
>>
>>    JobClient.runJob(conf);
>>  }//main
>>
>>
>>  public static class Map extends MapReduceBase
>>    implements Mapper<LongWritable, Text,
>>      Text, DoubleWritable>
>>        {
>>               ...
>>               private static int N;
>>
>>               ...
>>
>>               public void map(LongWritable key, Text value,
>>                 OutputCollector<Text, DoubleWritable> output,
>>                  Reporter reporter) throws IOException
>>                {
>>                    ....
>>                    dim = tokens.length / N;
>>                    ...
>>                }
>>
>>               public static void setN(String newN)
>>               {
>>                  N = Integer.parseInt(newN);
>>               }
>>        }
>>
>> I've tried the code above but I get an error saying that I'm dividing by
>> zero. Obviously, the argument I enter for N isn't being set as specified.
>> Erik
>>
>
> You can pass variables to the Job using the JobConf class.
>
> In your Driver class:
> jobConf.set("clone_path", clonePath);
>
> Then in your mapper / reducer override configure:
>
>  private JobConf jobConf;
>  public void configure(JobConf jobConf) {
>        super.configure(jobConf);
>        this.jobConf=jobConf;
>  }
>



--
Harsh J
www.harshj.com
Reply | Threaded
Open this post in threaded view
|

Re: Set variables in mapper

Hemanth Yamijala
Hi,

It would also be worthwhile to look at the Tool interface
(http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Tool),
which is used by example programs in the MapReduce examples as well.
This would allow any arguments to be passed using the
-Dvar.name=var.value convention on command line.

Thanks
Hemanth

On Mon, Aug 2, 2010 at 10:33 PM, Harsh J <[hidden email]> wrote:

> And since it is an integer you're looking for, use the utility methods
> JobConf.setInt and JobConf.getInt:
>
> Integer N = Integer.parseInt(args[2]);
> JobConf.setInt("your.pack.some.name", N);
>
> And in the Mapper's "@Override void configure(JobConf conf)", do:
> conf.getInt("your.pack.some.name", 1 /* Or other default value */);
>
> On Mon, Aug 2, 2010 at 9:53 PM, Edward Capriolo <[hidden email]> wrote:
>> On Mon, Aug 2, 2010 at 12:17 PM, Erik Test <[hidden email]> wrote:
>>> Hi,
>>>
>>> I'm trying to set a variable in my mapper class by reading an argument from
>>> the command line and then passing the entry to the mapper from main. Is this
>>> possible?
>>>
>>>  public static void main(String[] args) throws Exception
>>>  {
>>>    JobConf conf = new JobConf(DistanceCalc2.class);
>>>    conf.setJobName("Calculate Distances");
>>>
>>>    conf.setOutputKeyClass(Text.class);
>>>    conf.setOutputValueClass(DoubleWritable.class);
>>>
>>>    conf.setMapperClass(Map.class);
>>>    //conf.setReducerClass(Reduce.class);
>>>
>>>    conf.setInputFormat(TextInputFormat.class);
>>>    conf.setOutputFormat(TextOutputFormat.class);
>>>
>>>    FileInputFormat.setInputPaths(conf, new Path(args[0]));
>>>    FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>>>
>>>    Map.setN(args[2]);
>>>
>>>    JobClient.runJob(conf);
>>>  }//main
>>>
>>>
>>>  public static class Map extends MapReduceBase
>>>    implements Mapper<LongWritable, Text,
>>>      Text, DoubleWritable>
>>>        {
>>>               ...
>>>               private static int N;
>>>
>>>               ...
>>>
>>>               public void map(LongWritable key, Text value,
>>>                 OutputCollector<Text, DoubleWritable> output,
>>>                  Reporter reporter) throws IOException
>>>                {
>>>                    ....
>>>                    dim = tokens.length / N;
>>>                    ...
>>>                }
>>>
>>>               public static void setN(String newN)
>>>               {
>>>                  N = Integer.parseInt(newN);
>>>               }
>>>        }
>>>
>>> I've tried the code above but I get an error saying that I'm dividing by
>>> zero. Obviously, the argument I enter for N isn't being set as specified.
>>> Erik
>>>
>>
>> You can pass variables to the Job using the JobConf class.
>>
>> In your Driver class:
>> jobConf.set("clone_path", clonePath);
>>
>> Then in your mapper / reducer override configure:
>>
>>  private JobConf jobConf;
>>  public void configure(JobConf jobConf) {
>>        super.configure(jobConf);
>>        this.jobConf=jobConf;
>>  }
>>
>
>
>
> --
> Harsh J
> www.harshj.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Set variables in mapper

Owen O'Malley
In reply to this post by Erik Test

On Aug 2, 2010, at 9:17 AM, Erik Test wrote:

> I'm trying to set a variable in my mapper class by reading an  
> argument from
> the command line and then passing the entry to the mapper from main.  
> Is this
> possible?

Others have already answered with the current solution of using  
JobConf to store the value. I should also note that I plan to  
implement MAPREDUCE-1183 for 0.22. It will allow you to do this  
directly like:

job.setMapper(new MyMapper(someIntegerParameter));

which will serialize MyMapper's state, including the integer  
parameter, and store it as part of your job.

-- Owen
Reply | Threaded
Open this post in threaded view
|

Re: Set variables in mapper

Erik Test
Really? This seems pretty nice.

In the future, with your implementation, would the value always have to be
wrapped in a MyMapper instance? How would parameters be removed if
necessary?

Erik


On 3 August 2010 02:37, Owen O'Malley <[hidden email]> wrote:

>
> On Aug 2, 2010, at 9:17 AM, Erik Test wrote:
>
>  I'm trying to set a variable in my mapper class by reading an argument
>> from
>> the command line and then passing the entry to the mapper from main. Is
>> this
>> possible?
>>
>
> Others have already answered with the current solution of using JobConf to
> store the value. I should also note that I plan to implement MAPREDUCE-1183
> for 0.22. It will allow you to do this directly like:
>
> job.setMapper(new MyMapper(someIntegerParameter));
>
> which will serialize MyMapper's state, including the integer parameter, and
> store it as part of your job.
>
> -- Owen
>
Reply | Threaded
Open this post in threaded view
|

Re: Set variables in mapper

Owen O'Malley

On Aug 3, 2010, at 6:12 AM, Erik Test wrote:

> Really? This seems pretty nice.
>
> In the future, with your implementation, would the value always have  
> to be
> wrapped in a MyMapper instance? How would parameters be removed if
> necessary?

Sorry, I wasn't clear. I mean that if you make the sub-classes of  
Mapper serializable, the framework will serialize them for you and  
deserialize them on the cluster.

So a fuller example would look like:

public class MyMapper extends  
Mapper<IntWritable,Text,IntWritable,Text> implements Writable {
   int param;

   public MyMapper() { param = 0; }
   public MyMapper(int param) { this.param = param; }

   public void map(IntWritable key, Text value, Context context) {...}

   public void readFields(DataInputStream in) throws IOException {
     param = in.readInt();
   }

   public void write(DataOutputStream out) throws IOException {
      out.writeInt(param);
   }
}

You won't need to use Writable, you can use ProtocolBuffers, Thrift,  
or Avro. Where this comes in really handy is places like the  
InputFormats and OutputFormats. It enables you to replace the current:

job.setInputFormatClass(SequenceFileInputFormat.class);
FileInputFormat.setInputPath(job, inDir);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
FileOutputFormat.setOutputPath(job, outDir);

with the more natural:

job.setInputFormat(new SequenceFileInputFormat(inDir));
job.setOutputFormat(new SequenceFileOutputFormat(outDir));

Is that clearer now?

-- Owen
Reply | Threaded
Open this post in threaded view
|

Re: Set variables in mapper

Erik Test
O ok. Yes this is clear now. Thanks for the explanation
Erik


On 3 August 2010 11:34, Owen O'Malley <[hidden email]> wrote:

>
> On Aug 3, 2010, at 6:12 AM, Erik Test wrote:
>
>  Really? This seems pretty nice.
>>
>> In the future, with your implementation, would the value always have to be
>> wrapped in a MyMapper instance? How would parameters be removed if
>> necessary?
>>
>
> Sorry, I wasn't clear. I mean that if you make the sub-classes of Mapper
> serializable, the framework will serialize them for you and deserialize them
> on the cluster.
>
> So a fuller example would look like:
>
> public class MyMapper extends Mapper<IntWritable,Text,IntWritable,Text>
> implements Writable {
>  int param;
>
>  public MyMapper() { param = 0; }
>  public MyMapper(int param) { this.param = param; }
>
>  public void map(IntWritable key, Text value, Context context) {...}
>
>  public void readFields(DataInputStream in) throws IOException {
>    param = in.readInt();
>  }
>
>  public void write(DataOutputStream out) throws IOException {
>     out.writeInt(param);
>  }
> }
>
> You won't need to use Writable, you can use ProtocolBuffers, Thrift, or
> Avro. Where this comes in really handy is places like the InputFormats and
> OutputFormats. It enables you to replace the current:
>
> job.setInputFormatClass(SequenceFileInputFormat.class);
> FileInputFormat.setInputPath(job, inDir);
> job.setOutputFormatClass(SequenceFileOutputFormat.class);
> FileOutputFormat.setOutputPath(job, outDir);
>
> with the more natural:
>
> job.setInputFormat(new SequenceFileInputFormat(inDir));
> job.setOutputFormat(new SequenceFileOutputFormat(outDir));
>
> Is that clearer now?
>
> -- Owen
>