Improve IdentityMapper code for wordcount

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Improve IdentityMapper code for wordcount

xeonmailinglist-gmail
Hi,

I have created a map method that reads the map output of the wordcount example [1]. This example is away from using the IdentityMapper.class that MapReduce offers, but this is the only way that I have found to make a working IdentityMapper for the Wordcount. The only problem is that this Mapper is taking much more time than I wanted. I am starting to think that maybe I am doing some redundant stuff. Any help to improve my IdentityMapper code?

[1] Identity mapper
public class WordCountIdentityMapper extends MyMapper<LongWritable, Text, Text, IntWritable> {
private Text word = new Text();

public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
word.set(itr.nextToken());
Integer val = Integer.valueOf(itr.nextToken());
context.write(word, new IntWritable(val));
}

public void run(Context context) throws IOException, InterruptedException {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
}
}


[2] Map class that generated the mapoutput

public static class MyMap extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}

public void run(Context context) throws IOException, InterruptedException {
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
}


Thanks,
Reply | Threaded
Open this post in threaded view
|

Re: Improve IdentityMapper code for wordcount

Chris Nauroth

Hello,

 

One quick win could be to change this line:

 

context.write(word, new IntWritable(val));

 

Instead of instantiating an IntWritable on each iteration, instantiate it once as a member variable (like what you’ve already done for word) and call IntWritable#set on each iteration.  This might save some object allocation and garbage collection churn.

 

Beyond that, I’d recommend either profiling or looking at the overall workflow to see if something can be changed at a higher level.  You mentioned that this is consuming the output of the wordcount example job.  Perhaps you can change the wordcount job’s code to write a more efficient representation of the data for your use case.  For example, if the counts were stored as binary integers directly, then the second job wouldn’t have to pay the cost of re-parsing them from strings by calling Integer#valueOf.

 

I hope this helps.

 

--Chris Nauroth

 

From: xeon Mailinglist <[hidden email]>
Date: Sunday, August 21, 2016 at 3:56 AM
To: "[hidden email]" <[hidden email]>
Subject: Improve IdentityMapper code for wordcount

 

Hi,

I have created a map method that reads the map output of the wordcount example [1]. This example is away from using the IdentityMapper.class that MapReduce offers, but this is the only way that I have found to make a working IdentityMapper for the Wordcount. The only problem is that this Mapper is taking much more time than I wanted. I am starting to think that maybe I am doing some redundant stuff. Any help to improve my IdentityMapper code?

[1] Identity mapper
public class WordCountIdentityMapper extends MyMapper<LongWritable, Text, Text, IntWritable> {
   
private Text word = new Text();

   
public void map(LongWritable key, Text value, Context context
    )
throws IOException, InterruptedException {
        StringTokenizer itr =
new StringTokenizer(value.toString());
       
word.set(itr.nextToken());
        Integer val = Integer.valueOf(itr.nextToken());
        context.write(
word, new IntWritable(val));
    }

   
public void run(Context context) throws IOException, InterruptedException {
       
while (context.nextKeyValue()) {
            map(context.getCurrentKey(), context.getCurrentValue(), context);
        }
    }
}

[2] Map class that generated the mapoutput

public static class MyMap extends Mapper<LongWritable, Text, Text, IntWritable> {
   
private final static IntWritable one = new IntWritable(1);
   
private Text word = new Text();

   
public void map(LongWritable key, Text value, Context context
    )
throws IOException, InterruptedException {
        StringTokenizer itr =
new StringTokenizer(value.toString());

       
while (itr.hasMoreTokens()) {
           
word.set(itr.nextToken());
            context.write(
word, one);
        }
    }

   
public void run(Context context) throws IOException, InterruptedException {
       
try {
           
while (context.nextKeyValue()) {
                map(context.getCurrentKey(), context.getCurrentValue(), context);
            }
        }
finally {
            cleanup(context);
        }
    }
}

 

Thanks,