[jira] Created: (HADOOP-1283) Eliminate internal UTF8 to String and vice versa conversions in the name-node.

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (HADOOP-1283) Eliminate internal UTF8 to String and vice versa conversions in the name-node.

ASF GitHub Bot (Jira)
Eliminate internal UTF8 to String and vice versa conversions in the name-node.
------------------------------------------------------------------------------

                 Key: HADOOP-1283
                 URL: https://issues.apache.org/jira/browse/HADOOP-1283
             Project: Hadoop
          Issue Type: Improvement
          Components: dfs
    Affects Versions: 0.12.0
            Reporter: Konstantin Shvachko
             Fix For: 0.13.0


We have internal conversions of those two types inside name-node code. One example:
NameNode.complete(String src, String clientName)
then it calls
FSNamesystem.completeFile(new UTF8(src), new UTF8(clientName));
which in turn finally calls
FSDirectory.addNode(path.toString(), newNode )
and in another place
FSDirectory.getNode(src.toString())

So we have several conversions of the same parameter back and forth during computation.
We should keep the parameter type consistent within different methods.

The question is, which type should be used: String or Text.
From previous discussions I remember that Text is more efficient in space and time for non ASCII
data. Here we mostly deal with file names and network addresses, which are ASCII.
Does it make sense to use Text in this case?

UTF8 is also used as a key in two maps: pendingCreates and leases.
This should be replaced too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1283) Eliminate internal UTF8 to String and vice versa conversions in the name-node.

ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502469 ]

Konstantin Shvachko commented on HADOOP-1283:
---------------------------------------------

UTF8 elimination proposal.
# Protocols (ClientProtocol, DatanodeProtocol) - no change - parameters will remain Strings
# RPC - no change - will convert Strings into UTF8s, send, and convert them back to Strings on the server.
# Internal name-node method parameters will be Strings.
# Internal name-node data structures should use BytesWritable instead of UTF8 and String.
# EditsLog and FSImage entries will use BytesWritable instead of UTF8 for serialization into files.

> Eliminate internal UTF8 to String and vice versa conversions in the name-node.
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-1283
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1283
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.0
>            Reporter: Konstantin Shvachko
>
> We have internal conversions of those two types inside name-node code. One example:
> NameNode.complete(String src, String clientName)
> then it calls
> FSNamesystem.completeFile(new UTF8(src), new UTF8(clientName));
> which in turn finally calls
> FSDirectory.addNode(path.toString(), newNode )
> and in another place
> FSDirectory.getNode(src.toString())
> So we have several conversions of the same parameter back and forth during computation.
> We should keep the parameter type consistent within different methods.
> The question is, which type should be used: String or Text.
> From previous discussions I remember that Text is more efficient in space and time for non ASCII
> data. Here we mostly deal with file names and network addresses, which are ASCII.
> Does it make sense to use Text in this case?
> UTF8 is also used as a key in two maps: pendingCreates and leases.
> This should be replaced too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (HADOOP-1283) Eliminate internal UTF8 to String and vice versa conversions in the name-node.

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Shvachko reassigned HADOOP-1283:
-------------------------------------------

    Assignee: Konstantin Shvachko

> Eliminate internal UTF8 to String and vice versa conversions in the name-node.
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-1283
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1283
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>         Attachments: EliminateUTF8.patch
>
>
> We have internal conversions of those two types inside name-node code. One example:
> NameNode.complete(String src, String clientName)
> then it calls
> FSNamesystem.completeFile(new UTF8(src), new UTF8(clientName));
> which in turn finally calls
> FSDirectory.addNode(path.toString(), newNode )
> and in another place
> FSDirectory.getNode(src.toString())
> So we have several conversions of the same parameter back and forth during computation.
> We should keep the parameter type consistent within different methods.
> The question is, which type should be used: String or Text.
> From previous discussions I remember that Text is more efficient in space and time for non ASCII
> data. Here we mostly deal with file names and network addresses, which are ASCII.
> Does it make sense to use Text in this case?
> UTF8 is also used as a key in two maps: pendingCreates and leases.
> This should be replaced too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1283) Eliminate internal UTF8 to String and vice versa conversions in the name-node.

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Shvachko updated HADOOP-1283:
----------------------------------------

    Attachment: EliminateUTF8.patch

This patch does all the above except for 5. I don't want to change image and edits log format at this point.
AFAIK UTF8 and BytesWritable serializations differ only in the type of the length field.
UTF8 uses short, while in BytesWritable it is integer.

For the name-node in-memory structures I use a subclass of BytesWritable called StringBytesWritable.
It mostly contains conversion methods from/to String.

I removed implementations of the deprecated obtainLock() and releaseLock() methods in FSNamesystem.
The methods now returns OPERATION_FAILED.
Let me know if we need to keep the implementations. Otherwise we should remove them and related data-structures
on the name-node like activeLocks.


> Eliminate internal UTF8 to String and vice versa conversions in the name-node.
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-1283
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1283
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.0
>            Reporter: Konstantin Shvachko
>         Attachments: EliminateUTF8.patch
>
>
> We have internal conversions of those two types inside name-node code. One example:
> NameNode.complete(String src, String clientName)
> then it calls
> FSNamesystem.completeFile(new UTF8(src), new UTF8(clientName));
> which in turn finally calls
> FSDirectory.addNode(path.toString(), newNode )
> and in another place
> FSDirectory.getNode(src.toString())
> So we have several conversions of the same parameter back and forth during computation.
> We should keep the parameter type consistent within different methods.
> The question is, which type should be used: String or Text.
> From previous discussions I remember that Text is more efficient in space and time for non ASCII
> data. Here we mostly deal with file names and network addresses, which are ASCII.
> Does it make sense to use Text in this case?
> UTF8 is also used as a key in two maps: pendingCreates and leases.
> This should be replaced too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1283) Eliminate internal UTF8 to String and vice versa conversions in the name-node.

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505894 ]

Doug Cutting commented on HADOOP-1283:
--------------------------------------

> AFAIK UTF8 and BytesWritable serializations differ only in the type of the length field.

I think UTF8 may also use "modified UTF-8" when encoding Strings.

http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8

Note that, for back-compatibility, we can still read files written with UTF8 once UTF8 is gone by using DataInput, since the format is identical.  UTF8's implementation was optimized, but should be equivalent.

> Eliminate internal UTF8 to String and vice versa conversions in the name-node.
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-1283
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1283
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>         Attachments: EliminateUTF8.patch
>
>
> We have internal conversions of those two types inside name-node code. One example:
> NameNode.complete(String src, String clientName)
> then it calls
> FSNamesystem.completeFile(new UTF8(src), new UTF8(clientName));
> which in turn finally calls
> FSDirectory.addNode(path.toString(), newNode )
> and in another place
> FSDirectory.getNode(src.toString())
> So we have several conversions of the same parameter back and forth during computation.
> We should keep the parameter type consistent within different methods.
> The question is, which type should be used: String or Text.
> From previous discussions I remember that Text is more efficient in space and time for non ASCII
> data. Here we mostly deal with file names and network addresses, which are ASCII.
> Does it make sense to use Text in this case?
> UTF8 is also used as a key in two maps: pendingCreates and leases.
> This should be replaced too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1283) Eliminate internal UTF8 to String and vice versa conversions in the name-node.

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Shvachko updated HADOOP-1283:
----------------------------------------

    Attachment: EliminateUTF8-2.patch

I just updated the patch to reflect latest changes to the trunk.

> Eliminate internal UTF8 to String and vice versa conversions in the name-node.
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-1283
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1283
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>         Attachments: EliminateUTF8-2.patch, EliminateUTF8.patch
>
>
> We have internal conversions of those two types inside name-node code. One example:
> NameNode.complete(String src, String clientName)
> then it calls
> FSNamesystem.completeFile(new UTF8(src), new UTF8(clientName));
> which in turn finally calls
> FSDirectory.addNode(path.toString(), newNode )
> and in another place
> FSDirectory.getNode(src.toString())
> So we have several conversions of the same parameter back and forth during computation.
> We should keep the parameter type consistent within different methods.
> The question is, which type should be used: String or Text.
> From previous discussions I remember that Text is more efficient in space and time for non ASCII
> data. Here we mostly deal with file names and network addresses, which are ASCII.
> Does it make sense to use Text in this case?
> UTF8 is also used as a key in two maps: pendingCreates and leases.
> This should be replaced too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1283) Eliminate internal UTF8 to String and vice versa conversions in the name-node.

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Shvachko updated HADOOP-1283:
----------------------------------------

    Status: Patch Available  (was: Open)

> Eliminate internal UTF8 to String and vice versa conversions in the name-node.
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-1283
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1283
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>         Attachments: EliminateUTF8-2.patch, EliminateUTF8.patch
>
>
> We have internal conversions of those two types inside name-node code. One example:
> NameNode.complete(String src, String clientName)
> then it calls
> FSNamesystem.completeFile(new UTF8(src), new UTF8(clientName));
> which in turn finally calls
> FSDirectory.addNode(path.toString(), newNode )
> and in another place
> FSDirectory.getNode(src.toString())
> So we have several conversions of the same parameter back and forth during computation.
> We should keep the parameter type consistent within different methods.
> The question is, which type should be used: String or Text.
> From previous discussions I remember that Text is more efficient in space and time for non ASCII
> data. Here we mostly deal with file names and network addresses, which are ASCII.
> Does it make sense to use Text in this case?
> UTF8 is also used as a key in two maps: pendingCreates and leases.
> This should be replaced too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1283) Eliminate internal UTF8 to String and vice versa conversions in the name-node.

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506753 ]

Hadoop QA commented on HADOOP-1283:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12360240/EliminateUTF8-2.patch applied and successfully tested against trunk revision r549284.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/314/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/314/console

> Eliminate internal UTF8 to String and vice versa conversions in the name-node.
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-1283
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1283
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>         Attachments: EliminateUTF8-2.patch, EliminateUTF8.patch
>
>
> We have internal conversions of those two types inside name-node code. One example:
> NameNode.complete(String src, String clientName)
> then it calls
> FSNamesystem.completeFile(new UTF8(src), new UTF8(clientName));
> which in turn finally calls
> FSDirectory.addNode(path.toString(), newNode )
> and in another place
> FSDirectory.getNode(src.toString())
> So we have several conversions of the same parameter back and forth during computation.
> We should keep the parameter type consistent within different methods.
> The question is, which type should be used: String or Text.
> From previous discussions I remember that Text is more efficient in space and time for non ASCII
> data. Here we mostly deal with file names and network addresses, which are ASCII.
> Does it make sense to use Text in this case?
> UTF8 is also used as a key in two maps: pendingCreates and leases.
> This should be replaced too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-1283) Eliminate internal UTF8 to String and vice versa conversions in the name-node.

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1283:
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.14.0
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Konstantin!

> Eliminate internal UTF8 to String and vice versa conversions in the name-node.
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-1283
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1283
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>             Fix For: 0.14.0
>
>         Attachments: EliminateUTF8-2.patch, EliminateUTF8.patch
>
>
> We have internal conversions of those two types inside name-node code. One example:
> NameNode.complete(String src, String clientName)
> then it calls
> FSNamesystem.completeFile(new UTF8(src), new UTF8(clientName));
> which in turn finally calls
> FSDirectory.addNode(path.toString(), newNode )
> and in another place
> FSDirectory.getNode(src.toString())
> So we have several conversions of the same parameter back and forth during computation.
> We should keep the parameter type consistent within different methods.
> The question is, which type should be used: String or Text.
> From previous discussions I remember that Text is more efficient in space and time for non ASCII
> data. Here we mostly deal with file names and network addresses, which are ASCII.
> Does it make sense to use Text in this case?
> UTF8 is also used as a key in two maps: pendingCreates and leases.
> This should be replaced too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-1283) Eliminate internal UTF8 to String and vice versa conversions in the name-node.

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507209 ]

Hudson commented on HADOOP-1283:
--------------------------------

Integrated in Hadoop-Nightly #132 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/132/])

> Eliminate internal UTF8 to String and vice versa conversions in the name-node.
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-1283
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1283
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.12.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>             Fix For: 0.14.0
>
>         Attachments: EliminateUTF8-2.patch, EliminateUTF8.patch
>
>
> We have internal conversions of those two types inside name-node code. One example:
> NameNode.complete(String src, String clientName)
> then it calls
> FSNamesystem.completeFile(new UTF8(src), new UTF8(clientName));
> which in turn finally calls
> FSDirectory.addNode(path.toString(), newNode )
> and in another place
> FSDirectory.getNode(src.toString())
> So we have several conversions of the same parameter back and forth during computation.
> We should keep the parameter type consistent within different methods.
> The question is, which type should be used: String or Text.
> From previous discussions I remember that Text is more efficient in space and time for non ASCII
> data. Here we mostly deal with file names and network addresses, which are ASCII.
> Does it make sense to use Text in this case?
> UTF8 is also used as a key in two maps: pendingCreates and leases.
> This should be replaced too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.