[HADOOP] Terasort for numbers

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[HADOOP] Terasort for numbers

Teodor Macicas
Hi all,


I am using hadoop 0.20.2 and I want to use sort huge amount of data.
I've read about Terasort [from examples], but now it's using 10bytes
char keys.
Changing keys from char to integer wasn't a good solution as Terasort
builds a trie for creating total order partitions. I got stuck when I
tried to change the char trie to a one suitable for number keys.

Then, I've given a try to Sort [also from examples] and it did work for
integer keys, but without a total order partitioning. In the end of the
day, the final result can not be created only by putting together all
reducers' outputs. Each reducer sorts only a subset of data and no
merging is occured between two reducers.

Please can anyone advise me what and how to use in order to sort huge
amount of real numbers ?
Looking forward for your replies.


Thank you.
Best,
Teodor
Reply | Threaded
Open this post in threaded view
|

Re: [HADOOP] Terasort for numbers

Alex Kozlov
Hi Teodor,

I am not clear what you call 'real numbers'.  Terasort does work on bytes
(10 bytes key and 90 bytes payload).  The actual 'meaning' of the bytes
really does not matter as Hadoop uses binary comparators on the raw value.

Total order partitioning should also work with any  WritableComparable key
(if it doesn't, it's a bug).

My guess your problem is converting a char trie to WritableComparable.  Can
you provide more background?  Are the strings of fixed length?

Alex K

On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas <[hidden email]>wrote:

> Hi all,
>
>
> I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've
> read about Terasort [from examples], but now it's using 10bytes char keys.
> Changing keys from char to integer wasn't a good solution as Terasort
> builds a trie for creating total order partitions. I got stuck when I tried
> to change the char trie to a one suitable for number keys.
>
> Then, I've given a try to Sort [also from examples] and it did work for
> integer keys, but without a total order partitioning. In the end of the day,
> the final result can not be created only by putting together all reducers'
> outputs. Each reducer sorts only a subset of data and no merging is occured
> between two reducers.
>
> Please can anyone advise me what and how to use in order to sort huge
> amount of real numbers ?
> Looking forward for your replies.
>
>
> Thank you.
> Best,
> Teodor
>
Reply | Threaded
Open this post in threaded view
|

Re: [HADOOP] Terasort for numbers

Teodor Macicas
Hi Alex,

Thank you for your quick reply and sorry for not being so clear.
The job I want to do is simple to sort data having numbers [doubles] as
keys [0]. I noticed that Terasort is using 10b char key. How can I use
this for my particular job ?
Do I need to change the Terasort ?

[0] example of workload:
123.45    payload1
-34.56     payload2
752.10    payload3
10.25      payload4
....

Does this make sense now ?

Regards,
Teodor

On 08/02/2010 12:14 AM, Alex Kozlov wrote:

> Hi Teodor,
>
> I am not clear what you call 'real numbers'.  Terasort does work on bytes
> (10 bytes key and 90 bytes payload).  The actual 'meaning' of the bytes
> really does not matter as Hadoop uses binary comparators on the raw value.
>
> Total order partitioning should also work with any  WritableComparable key
> (if it doesn't, it's a bug).
>
> My guess your problem is converting a char trie to WritableComparable.  Can
> you provide more background?  Are the strings of fixed length?
>
> Alex K
>
> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[hidden email]>wrote:
>
>    
>> Hi all,
>>
>>
>> I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've
>> read about Terasort [from examples], but now it's using 10bytes char keys.
>> Changing keys from char to integer wasn't a good solution as Terasort
>> builds a trie for creating total order partitions. I got stuck when I tried
>> to change the char trie to a one suitable for number keys.
>>
>> Then, I've given a try to Sort [also from examples] and it did work for
>> integer keys, but without a total order partitioning. In the end of the day,
>> the final result can not be created only by putting together all reducers'
>> outputs. Each reducer sorts only a subset of data and no merging is occured
>> between two reducers.
>>
>> Please can anyone advise me what and how to use in order to sort huge
>> amount of real numbers ?
>> Looking forward for your replies.
>>
>>
>> Thank you.
>> Best,
>> Teodor
>>
>>      

Reply | Threaded
Open this post in threaded view
|

Re: [HADOOP] Terasort for numbers

Alex Kozlov
Hi Teodor,

I see the problem now:  There is no simple binary comparator for
DoubleWritable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html>.
So you can do 2 things:

1. Convert your doubles to ints (or bytes), say if the precision is always 2
decimal points, represent the number as 100 x double:  The problem is
reduced to sorting integers then.

2. Use DoubleWritable as the key and payload as value.  You can use generic
TotalOrderPartitioner<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html>which
does not use tries.  You also can just use a generic MR with
DoubleWritable keys: MR will sort the key for you with identity mapper and
identity reducer.

Option 2 is slightly less efficient since the code will need to call
Double.longBitsToDouble each time, but I don't see an easy way to avoid this
with the IEEE 754 encoding.

Alex K

On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas <[hidden email]>wrote:

> Hi Alex,
>
> Thank you for your quick reply and sorry for not being so clear.
> The job I want to do is simple to sort data having numbers [doubles] as
> keys [0]. I noticed that Terasort is using 10b char key. How can I use this
> for my particular job ?
> Do I need to change the Terasort ?
>
> [0] example of workload:
> 123.45    payload1
> -34.56     payload2
> 752.10    payload3
> 10.25      payload4
> ....
>
> Does this make sense now ?
>
> Regards,
> Teodor
>
>
> On 08/02/2010 12:14 AM, Alex Kozlov wrote:
>
>> Hi Teodor,
>>
>> I am not clear what you call 'real numbers'.  Terasort does work on bytes
>> (10 bytes key and 90 bytes payload).  The actual 'meaning' of the bytes
>> really does not matter as Hadoop uses binary comparators on the raw value.
>>
>> Total order partitioning should also work with any  WritableComparable key
>> (if it doesn't, it's a bug).
>>
>> My guess your problem is converting a char trie to WritableComparable.
>>  Can
>> you provide more background?  Are the strings of fixed length?
>>
>> Alex K
>>
>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[hidden email]
>> >wrote:
>>
>>
>>
>>> Hi all,
>>>
>>>
>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've
>>> read about Terasort [from examples], but now it's using 10bytes char
>>> keys.
>>> Changing keys from char to integer wasn't a good solution as Terasort
>>> builds a trie for creating total order partitions. I got stuck when I
>>> tried
>>> to change the char trie to a one suitable for number keys.
>>>
>>> Then, I've given a try to Sort [also from examples] and it did work for
>>> integer keys, but without a total order partitioning. In the end of the
>>> day,
>>> the final result can not be created only by putting together all
>>> reducers'
>>> outputs. Each reducer sorts only a subset of data and no merging is
>>> occured
>>> between two reducers.
>>>
>>> Please can anyone advise me what and how to use in order to sort huge
>>> amount of real numbers ?
>>> Looking forward for your replies.
>>>
>>>
>>> Thank you.
>>> Best,
>>> Teodor
>>>
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [HADOOP] Terasort for numbers

Teodor Macicas
Hi Alex,

Thank you again.
Yes, I'm also thinking of your first suggestion. But that would help me
only for 'reducing' the problem from floating points to integers. But I
also do not know how to use Terasort for integer keys !

I've tried to use the generic TotalOrderPartitioner instead of the one
nested in Terasort class, but I received a lot of errors [0]. I had
tried to modify the TeraInputFormat, TeraOutputFormat (and all nested
classes) and I've continued getting errors.

Now, it's not clear for me what do I have to change in order to make
your second solution working. Moreover, I was unable to find a generic
MR on my hadoop 0.20.2 version.
I'd prefer the first solution, so can you please give me some tips for
how to use Terasort for integers ?

p.s.: I've made a trick using fixed-length char keys and the program
worked for this kind of workload [1]. I think using integer keys instead
of this trick would be faster.

[0] java.io.IOException: wrong key class:
org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text

[1] it worked for this:
0000123.45 payload1
0005120.55 payload2
0000003.77 payload3
...

Best,
Teodor

On 08/02/2010 07:41 PM, Alex Kozlov wrote:

> Hi Teodor,
>
> I see the problem now:  There is no simple binary comparator for
> DoubleWritable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html>.
> So you can do 2 things:
>
> 1. Convert your doubles to ints (or bytes), say if the precision is always 2
> decimal points, represent the number as 100 x double:  The problem is
> reduced to sorting integers then.
>
> 2. Use DoubleWritable as the key and payload as value.  You can use generic
> TotalOrderPartitioner<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html>which
> does not use tries.  You also can just use a generic MR with
> DoubleWritable keys: MR will sort the key for you with identity mapper and
> identity reducer.
>
> Option 2 is slightly less efficient since the code will need to call
> Double.longBitsToDouble each time, but I don't see an easy way to avoid this
> with the IEEE 754 encoding.
>
> Alex K
>
> On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<[hidden email]>wrote:
>
>    
>> Hi Alex,
>>
>> Thank you for your quick reply and sorry for not being so clear.
>> The job I want to do is simple to sort data having numbers [doubles] as
>> keys [0]. I noticed that Terasort is using 10b char key. How can I use this
>> for my particular job ?
>> Do I need to change the Terasort ?
>>
>> [0] example of workload:
>> 123.45    payload1
>> -34.56     payload2
>> 752.10    payload3
>> 10.25      payload4
>> ....
>>
>> Does this make sense now ?
>>
>> Regards,
>> Teodor
>>
>>
>> On 08/02/2010 12:14 AM, Alex Kozlov wrote:
>>
>>      
>>> Hi Teodor,
>>>
>>> I am not clear what you call 'real numbers'.  Terasort does work on bytes
>>> (10 bytes key and 90 bytes payload).  The actual 'meaning' of the bytes
>>> really does not matter as Hadoop uses binary comparators on the raw value.
>>>
>>> Total order partitioning should also work with any  WritableComparable key
>>> (if it doesn't, it's a bug).
>>>
>>> My guess your problem is converting a char trie to WritableComparable.
>>>   Can
>>> you provide more background?  Are the strings of fixed length?
>>>
>>> Alex K
>>>
>>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[hidden email]
>>>        
>>>> wrote:
>>>>          
>>>
>>>
>>>        
>>>> Hi all,
>>>>
>>>>
>>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data. I've
>>>> read about Terasort [from examples], but now it's using 10bytes char
>>>> keys.
>>>> Changing keys from char to integer wasn't a good solution as Terasort
>>>> builds a trie for creating total order partitions. I got stuck when I
>>>> tried
>>>> to change the char trie to a one suitable for number keys.
>>>>
>>>> Then, I've given a try to Sort [also from examples] and it did work for
>>>> integer keys, but without a total order partitioning. In the end of the
>>>> day,
>>>> the final result can not be created only by putting together all
>>>> reducers'
>>>> outputs. Each reducer sorts only a subset of data and no merging is
>>>> occured
>>>> between two reducers.
>>>>
>>>> Please can anyone advise me what and how to use in order to sort huge
>>>> amount of real numbers ?
>>>> Looking forward for your replies.
>>>>
>>>>
>>>> Thank you.
>>>> Best,
>>>> Teodor
>>>>
>>>>
>>>>
>>>>          
>>>        
>>      

Reply | Threaded
Open this post in threaded view
|

Re: [HADOOP] Terasort for numbers

Alex Kozlov
Hi Teodor,

Certainly org.apache.hadoop.io.DoubleWritable and org.apache.hadoop.io.Text
are different classes.  For the approach (1) I suggested, you need just to
construct byte[10] array from an integer and create a new Text(byte[]) and
write it together with the value to a sequence file.

Since TeraSort was specifically created for just benchmarking purposes, I
think it might make sense for you to start with the approach (2).  Just
create a SequenceFile<DoubleWritable,Text> file with your <key,value> data
and do a simple MR job with an identity mapper and identity reducer.  I can
send you an example of a MR code, but there are plenty out
there<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html>.
One of them is TeraSort.java:run() itself, but you may want to use the new
mapreduce API<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html>.
Once you are comfortable with the MR framework, you can optimize it further.

Another good source of information is Tom White's 'Hadoop: The Definitive
Guide', particularly on the TotalOrderPartitioner.

Let me know if you have any further questions.

Alex K

On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicas <[hidden email]>wrote:

> Hi Alex,
>
> Thank you again.
> Yes, I'm also thinking of your first suggestion. But that would help me
> only for 'reducing' the problem from floating points to integers. But I also
> do not know how to use Terasort for integer keys !
>
> I've tried to use the generic TotalOrderPartitioner instead of the one
> nested in Terasort class, but I received a lot of errors [0]. I had tried to
> modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and
> I've continued getting errors.
>
> Now, it's not clear for me what do I have to change in order to make your
> second solution working. Moreover, I was unable to find a generic MR on my
> hadoop 0.20.2 version.
> I'd prefer the first solution, so can you please give me some tips for how
> to use Terasort for integers ?
>
> p.s.: I've made a trick using fixed-length char keys and the program worked
> for this kind of workload [1]. I think using integer keys instead of this
> trick would be faster.
>
> [0] java.io.IOException: wrong key class:
> org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text
>
> [1] it worked for this:
> 0000123.45 payload1
> 0005120.55 payload2
> 0000003.77 payload3
> ...
>
> Best,
> Teodor
>
>
> On 08/02/2010 07:41 PM, Alex Kozlov wrote:
>
>> Hi Teodor,
>>
>> I see the problem now:  There is no simple binary comparator for
>> DoubleWritable<
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html
>> >.
>>
>> So you can do 2 things:
>>
>> 1. Convert your doubles to ints (or bytes), say if the precision is always
>> 2
>> decimal points, represent the number as 100 x double:  The problem is
>> reduced to sorting integers then.
>>
>> 2. Use DoubleWritable as the key and payload as value.  You can use
>> generic
>> TotalOrderPartitioner<
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html
>> >which
>>
>> does not use tries.  You also can just use a generic MR with
>> DoubleWritable keys: MR will sort the key for you with identity mapper and
>> identity reducer.
>>
>> Option 2 is slightly less efficient since the code will need to call
>> Double.longBitsToDouble each time, but I don't see an easy way to avoid
>> this
>> with the IEEE 754 encoding.
>>
>> Alex K
>>
>> On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<[hidden email]
>> >wrote:
>>
>>
>>
>>> Hi Alex,
>>>
>>> Thank you for your quick reply and sorry for not being so clear.
>>> The job I want to do is simple to sort data having numbers [doubles] as
>>> keys [0]. I noticed that Terasort is using 10b char key. How can I use
>>> this
>>> for my particular job ?
>>> Do I need to change the Terasort ?
>>>
>>> [0] example of workload:
>>> 123.45    payload1
>>> -34.56     payload2
>>> 752.10    payload3
>>> 10.25      payload4
>>> ....
>>>
>>> Does this make sense now ?
>>>
>>> Regards,
>>> Teodor
>>>
>>>
>>> On 08/02/2010 12:14 AM, Alex Kozlov wrote:
>>>
>>>
>>>
>>>> Hi Teodor,
>>>>
>>>> I am not clear what you call 'real numbers'.  Terasort does work on
>>>> bytes
>>>> (10 bytes key and 90 bytes payload).  The actual 'meaning' of the bytes
>>>> really does not matter as Hadoop uses binary comparators on the raw
>>>> value.
>>>>
>>>> Total order partitioning should also work with any  WritableComparable
>>>> key
>>>> (if it doesn't, it's a bug).
>>>>
>>>> My guess your problem is converting a char trie to WritableComparable.
>>>>  Can
>>>> you provide more background?  Are the strings of fixed length?
>>>>
>>>> Alex K
>>>>
>>>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[hidden email]
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data.
>>>>> I've
>>>>> read about Terasort [from examples], but now it's using 10bytes char
>>>>> keys.
>>>>> Changing keys from char to integer wasn't a good solution as Terasort
>>>>> builds a trie for creating total order partitions. I got stuck when I
>>>>> tried
>>>>> to change the char trie to a one suitable for number keys.
>>>>>
>>>>> Then, I've given a try to Sort [also from examples] and it did work for
>>>>> integer keys, but without a total order partitioning. In the end of the
>>>>> day,
>>>>> the final result can not be created only by putting together all
>>>>> reducers'
>>>>> outputs. Each reducer sorts only a subset of data and no merging is
>>>>> occured
>>>>> between two reducers.
>>>>>
>>>>> Please can anyone advise me what and how to use in order to sort huge
>>>>> amount of real numbers ?
>>>>> Looking forward for your replies.
>>>>>
>>>>>
>>>>> Thank you.
>>>>> Best,
>>>>> Teodor
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [HADOOP] Terasort for numbers

Teodor Macicas
Hi Alex,

Why are you suggesting using SequenceFiles ? That implies changing the
TeraInputFormat class, right ?

Your second approach is similar with Sort example from hadoop. The
disadvantage of using it is that I don't have a total order partitioning
and thus more operations are neccessary for creating the final result.

Regards,
Teodor

On 08/03/2010 12:21 AM, Alex Kozlov wrote:

> Hi Teodor,
>
> Certainly org.apache.hadoop.io.DoubleWritable and org.apache.hadoop.io.Text
> are different classes.  For the approach (1) I suggested, you need just to
> construct byte[10] array from an integer and create a new Text(byte[]) and
> write it together with the value to a sequence file.
>
> Since TeraSort was specifically created for just benchmarking purposes, I
> think it might make sense for you to start with the approach (2).  Just
> create a SequenceFile<DoubleWritable,Text>  file with your<key,value>  data
> and do a simple MR job with an identity mapper and identity reducer.  I can
> send you an example of a MR code, but there are plenty out
> there<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html>.
> One of them is TeraSort.java:run() itself, but you may want to use the new
> mapreduce API<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html>.
> Once you are comfortable with the MR framework, you can optimize it further.
>
> Another good source of information is Tom White's 'Hadoop: The Definitive
> Guide', particularly on the TotalOrderPartitioner.
>
> Let me know if you have any further questions.
>
> Alex K
>
> On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicas<[hidden email]>wrote:
>
>    
>> Hi Alex,
>>
>> Thank you again.
>> Yes, I'm also thinking of your first suggestion. But that would help me
>> only for 'reducing' the problem from floating points to integers. But I also
>> do not know how to use Terasort for integer keys !
>>
>> I've tried to use the generic TotalOrderPartitioner instead of the one
>> nested in Terasort class, but I received a lot of errors [0]. I had tried to
>> modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and
>> I've continued getting errors.
>>
>> Now, it's not clear for me what do I have to change in order to make your
>> second solution working. Moreover, I was unable to find a generic MR on my
>> hadoop 0.20.2 version.
>> I'd prefer the first solution, so can you please give me some tips for how
>> to use Terasort for integers ?
>>
>> p.s.: I've made a trick using fixed-length char keys and the program worked
>> for this kind of workload [1]. I think using integer keys instead of this
>> trick would be faster.
>>
>> [0] java.io.IOException: wrong key class:
>> org.apache.hadoop.io.DoubleWritable is not class org.apache.hadoop.io.Text
>>
>> [1] it worked for this:
>> 0000123.45 payload1
>> 0005120.55 payload2
>> 0000003.77 payload3
>> ...
>>
>> Best,
>> Teodor
>>
>>
>> On 08/02/2010 07:41 PM, Alex Kozlov wrote:
>>
>>      
>>> Hi Teodor,
>>>
>>> I see the problem now:  There is no simple binary comparator for
>>> DoubleWritable<
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html
>>>        
>>>> .
>>>>          
>>> So you can do 2 things:
>>>
>>> 1. Convert your doubles to ints (or bytes), say if the precision is always
>>> 2
>>> decimal points, represent the number as 100 x double:  The problem is
>>> reduced to sorting integers then.
>>>
>>> 2. Use DoubleWritable as the key and payload as value.  You can use
>>> generic
>>> TotalOrderPartitioner<
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html
>>>        
>>>> which
>>>>          
>>> does not use tries.  You also can just use a generic MR with
>>> DoubleWritable keys: MR will sort the key for you with identity mapper and
>>> identity reducer.
>>>
>>> Option 2 is slightly less efficient since the code will need to call
>>> Double.longBitsToDouble each time, but I don't see an easy way to avoid
>>> this
>>> with the IEEE 754 encoding.
>>>
>>> Alex K
>>>
>>> On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<[hidden email]
>>>        
>>>> wrote:
>>>>          
>>>
>>>
>>>        
>>>> Hi Alex,
>>>>
>>>> Thank you for your quick reply and sorry for not being so clear.
>>>> The job I want to do is simple to sort data having numbers [doubles] as
>>>> keys [0]. I noticed that Terasort is using 10b char key. How can I use
>>>> this
>>>> for my particular job ?
>>>> Do I need to change the Terasort ?
>>>>
>>>> [0] example of workload:
>>>> 123.45    payload1
>>>> -34.56     payload2
>>>> 752.10    payload3
>>>> 10.25      payload4
>>>> ....
>>>>
>>>> Does this make sense now ?
>>>>
>>>> Regards,
>>>> Teodor
>>>>
>>>>
>>>> On 08/02/2010 12:14 AM, Alex Kozlov wrote:
>>>>
>>>>
>>>>
>>>>          
>>>>> Hi Teodor,
>>>>>
>>>>> I am not clear what you call 'real numbers'.  Terasort does work on
>>>>> bytes
>>>>> (10 bytes key and 90 bytes payload).  The actual 'meaning' of the bytes
>>>>> really does not matter as Hadoop uses binary comparators on the raw
>>>>> value.
>>>>>
>>>>> Total order partitioning should also work with any  WritableComparable
>>>>> key
>>>>> (if it doesn't, it's a bug).
>>>>>
>>>>> My guess your problem is converting a char trie to WritableComparable.
>>>>>   Can
>>>>> you provide more background?  Are the strings of fixed length?
>>>>>
>>>>> Alex K
>>>>>
>>>>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[hidden email]
>>>>>
>>>>>
>>>>>            
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> Hi all,
>>>>>>
>>>>>>
>>>>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data.
>>>>>> I've
>>>>>> read about Terasort [from examples], but now it's using 10bytes char
>>>>>> keys.
>>>>>> Changing keys from char to integer wasn't a good solution as Terasort
>>>>>> builds a trie for creating total order partitions. I got stuck when I
>>>>>> tried
>>>>>> to change the char trie to a one suitable for number keys.
>>>>>>
>>>>>> Then, I've given a try to Sort [also from examples] and it did work for
>>>>>> integer keys, but without a total order partitioning. In the end of the
>>>>>> day,
>>>>>> the final result can not be created only by putting together all
>>>>>> reducers'
>>>>>> outputs. Each reducer sorts only a subset of data and no merging is
>>>>>> occured
>>>>>> between two reducers.
>>>>>>
>>>>>> Please can anyone advise me what and how to use in order to sort huge
>>>>>> amount of real numbers ?
>>>>>> Looking forward for your replies.
>>>>>>
>>>>>>
>>>>>> Thank you.
>>>>>> Best,
>>>>>> Teodor
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>>
>>>>>            
>>>>
>>>>          
>>>        
>>      

Reply | Threaded
Open this post in threaded view
|

Re: [HADOOP] Terasort for numbers

Alex Kozlov
On Mon, Aug 2, 2010 at 3:41 PM, Teodor Macicas <[hidden email]>wrote:

> Hi Alex,
>
> Why are you suggesting using SequenceFiles ? That implies changing the
> TeraInputFormat class, right ?
>
>
Because text input file will not work for arbitrary bytes that can contain
new line bytes for example.  Yes, the old TeraInputFormat will not work.


> Your second approach is similar with Sort example from hadoop. The
> disadvantage of using it is that I don't have a total order partitioning and
> thus more operations are neccessary for creating the final result.
>
>
There is a generic total order partitioner: I provided the links.  See the
HTDG book as well.


> Regards,
> Teodor
>
>
> On 08/03/2010 12:21 AM, Alex Kozlov wrote:
>
>> Hi Teodor,
>>
>> Certainly org.apache.hadoop.io.DoubleWritable and
>> org.apache.hadoop.io.Text
>> are different classes.  For the approach (1) I suggested, you need just to
>> construct byte[10] array from an integer and create a new Text(byte[]) and
>> write it together with the value to a sequence file.
>>
>> Since TeraSort was specifically created for just benchmarking purposes, I
>> think it might make sense for you to start with the approach (2).  Just
>> create a SequenceFile<DoubleWritable,Text>  file with your<key,value>
>>  data
>> and do a simple MR job with an identity mapper and identity reducer.  I
>> can
>> send you an example of a MR code, but there are plenty out
>> there<http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html>.
>>
>> One of them is TeraSort.java:run() itself, but you may want to use the new
>> mapreduce API<
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html
>> >.
>>
>> Once you are comfortable with the MR framework, you can optimize it
>> further.
>>
>> Another good source of information is Tom White's 'Hadoop: The Definitive
>> Guide', particularly on the TotalOrderPartitioner.
>>
>> Let me know if you have any further questions.
>>
>> Alex K
>>
>> On Mon, Aug 2, 2010 at 2:43 PM, Teodor Macicas<[hidden email]
>> >wrote:
>>
>>
>>
>>> Hi Alex,
>>>
>>> Thank you again.
>>> Yes, I'm also thinking of your first suggestion. But that would help me
>>> only for 'reducing' the problem from floating points to integers. But I
>>> also
>>> do not know how to use Terasort for integer keys !
>>>
>>> I've tried to use the generic TotalOrderPartitioner instead of the one
>>> nested in Terasort class, but I received a lot of errors [0]. I had tried
>>> to
>>> modify the TeraInputFormat, TeraOutputFormat (and all nested classes) and
>>> I've continued getting errors.
>>>
>>> Now, it's not clear for me what do I have to change in order to make your
>>> second solution working. Moreover, I was unable to find a generic MR on
>>> my
>>> hadoop 0.20.2 version.
>>> I'd prefer the first solution, so can you please give me some tips for
>>> how
>>> to use Terasort for integers ?
>>>
>>> p.s.: I've made a trick using fixed-length char keys and the program
>>> worked
>>> for this kind of workload [1]. I think using integer keys instead of this
>>> trick would be faster.
>>>
>>> [0] java.io.IOException: wrong key class:
>>> org.apache.hadoop.io.DoubleWritable is not class
>>> org.apache.hadoop.io.Text
>>>
>>> [1] it worked for this:
>>> 0000123.45 payload1
>>> 0005120.55 payload2
>>> 0000003.77 payload3
>>> ...
>>>
>>> Best,
>>> Teodor
>>>
>>>
>>> On 08/02/2010 07:41 PM, Alex Kozlov wrote:
>>>
>>>
>>>
>>>> Hi Teodor,
>>>>
>>>> I see the problem now:  There is no simple binary comparator for
>>>> DoubleWritable<
>>>>
>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/DoubleWritable.html
>>>>
>>>>
>>>>> .
>>>>>
>>>>>
>>>> So you can do 2 things:
>>>>
>>>> 1. Convert your doubles to ints (or bytes), say if the precision is
>>>> always
>>>> 2
>>>> decimal points, represent the number as 100 x double:  The problem is
>>>> reduced to sorting integers then.
>>>>
>>>> 2. Use DoubleWritable as the key and payload as value.  You can use
>>>> generic
>>>> TotalOrderPartitioner<
>>>>
>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/TotalOrderPartitioner.html
>>>>
>>>>
>>>>> which
>>>>>
>>>>>
>>>> does not use tries.  You also can just use a generic MR with
>>>> DoubleWritable keys: MR will sort the key for you with identity mapper
>>>> and
>>>> identity reducer.
>>>>
>>>> Option 2 is slightly less efficient since the code will need to call
>>>> Double.longBitsToDouble each time, but I don't see an easy way to avoid
>>>> this
>>>> with the IEEE 754 encoding.
>>>>
>>>> Alex K
>>>>
>>>> On Mon, Aug 2, 2010 at 2:25 AM, Teodor Macicas<[hidden email]
>>>>
>>>>
>>>>> wrote:
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hi Alex,
>>>>>
>>>>> Thank you for your quick reply and sorry for not being so clear.
>>>>> The job I want to do is simple to sort data having numbers [doubles] as
>>>>> keys [0]. I noticed that Terasort is using 10b char key. How can I use
>>>>> this
>>>>> for my particular job ?
>>>>> Do I need to change the Terasort ?
>>>>>
>>>>> [0] example of workload:
>>>>> 123.45    payload1
>>>>> -34.56     payload2
>>>>> 752.10    payload3
>>>>> 10.25      payload4
>>>>> ....
>>>>>
>>>>> Does this make sense now ?
>>>>>
>>>>> Regards,
>>>>> Teodor
>>>>>
>>>>>
>>>>> On 08/02/2010 12:14 AM, Alex Kozlov wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi Teodor,
>>>>>>
>>>>>> I am not clear what you call 'real numbers'.  Terasort does work on
>>>>>> bytes
>>>>>> (10 bytes key and 90 bytes payload).  The actual 'meaning' of the
>>>>>> bytes
>>>>>> really does not matter as Hadoop uses binary comparators on the raw
>>>>>> value.
>>>>>>
>>>>>> Total order partitioning should also work with any  WritableComparable
>>>>>> key
>>>>>> (if it doesn't, it's a bug).
>>>>>>
>>>>>> My guess your problem is converting a char trie to WritableComparable.
>>>>>>  Can
>>>>>> you provide more background?  Are the strings of fixed length?
>>>>>>
>>>>>> Alex K
>>>>>>
>>>>>> On Sun, Aug 1, 2010 at 2:23 PM, Teodor Macicas<[hidden email]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>>
>>>>>>> I am using hadoop 0.20.2 and I want to use sort huge amount of data.
>>>>>>> I've
>>>>>>> read about Terasort [from examples], but now it's using 10bytes char
>>>>>>> keys.
>>>>>>> Changing keys from char to integer wasn't a good solution as Terasort
>>>>>>> builds a trie for creating total order partitions. I got stuck when I
>>>>>>> tried
>>>>>>> to change the char trie to a one suitable for number keys.
>>>>>>>
>>>>>>> Then, I've given a try to Sort [also from examples] and it did work
>>>>>>> for
>>>>>>> integer keys, but without a total order partitioning. In the end of
>>>>>>> the
>>>>>>> day,
>>>>>>> the final result can not be created only by putting together all
>>>>>>> reducers'
>>>>>>> outputs. Each reducer sorts only a subset of data and no merging is
>>>>>>> occured
>>>>>>> between two reducers.
>>>>>>>
>>>>>>> Please can anyone advise me what and how to use in order to sort huge
>>>>>>> amount of real numbers ?
>>>>>>> Looking forward for your replies.
>>>>>>>
>>>>>>>
>>>>>>> Thank you.
>>>>>>> Best,
>>>>>>> Teodor
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>