[jira] Created: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
Namenode should identify DataNodes as ip:port instead of hostname:port
----------------------------------------------------------------------

                 Key: HADOOP-985
                 URL: https://issues.apache.org/jira/browse/HADOOP-985
             Project: Hadoop
          Issue Type: Improvement
            Reporter: Raghu Angadi



Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.

How should be calculate datanode ip:

            1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.

            2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.

            3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.

One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.

As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.

Thoughts?



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi reassigned HADOOP-985:
-----------------------------------

    Assignee: Raghu Angadi

> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-985:
--------------------------------

          Component/s: dfs
        Fix Version/s: 0.12.0
    Affects Version/s: 0.11.0

> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12470793 ]

Marco Nicosia commented on HADOOP-985:
--------------------------------------

I support option #2 (determining remote IP from the socket). From my comment on HADOOP-685:

I know it's not trivial, but I'd prefer that the nameNode record the IP address of a connection. That way there's no DNS involved at any level in the transaction, and we know exactly which interface/IP address is being used. Additionally, there's no worrying about /etc/hosts, or dhcp, or whatnot. It works for the entire time the dataNode's up, and making network connections.

Regarding option #1: On the dataNode's side, determining which IP address to use is even harder than determining administrative hostname, since you don't know what route packets will take to get to the nameNode (and on some OS's (solaris) if you have interface IPs and VIPs on that interface, you can't control which IP address will be used).

Regarding option #3: On startup, massive clusters really pound on the nameNode, delaying startup. The nameNode's already very busy. Worse, I'd hate if the cluster had extended difficulties coming up because DNS lookups were either slow or busted entirely.


> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471003 ]

Raghu Angadi commented on HADOOP-985:
-------------------------------------


I prefer #2 as well. This could be the default behavior and if dfs.datanode.dns.interface is specified, then we can use the ip of the specific interface (this might be required for some special cases).

Instead of modifying RPC so that namenode sees remote ip for this case, datanode can report the ip and hostname. Datanode can open a UDP socket to namenode and check the local ip of the socket. I think it does not even need to send any packets. Either case, it does not need namenode to be up or wait for namenode response.

Datanode can resolve the ip for hostname. This won't always match 'hostname -f'.. I will check how exactly we currently get the hostname.




> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471031 ]

Owen O'Malley commented on HADOOP-985:
--------------------------------------

I think the best way to support rpc calls being able to find the IP address of the caller would be to have a static method in RPC that uses a thread-local variable to return the IP address of the caller. Clearly the RPC framework would set the variable before calling the method on the server and clear it when it was done. Something like:

  /**
   * Get the host ip address of the caller. Only valid on the server while running the remote procedure.
   * @return the dotted ip address of the caller or NULL if not in an RPC call
   */
  public static String getHostAddress() { ... }

> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471033 ]

Doug Cutting commented on HADOOP-985:
-------------------------------------

I think Owen's design is good: a static method that references a thread local.  I'd put the static method on Server, though, not RPC, and call it getClientAddress().


> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471046 ]

Raghu Angadi commented on HADOOP-985:
-------------------------------------


I was thinking of thread local as well.. but was not sure if it was normal practice or not. will do that.

Regd hostname, should we just let Datanode behave pretty much how it does now and not bother resolving it Namenode?


> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-985:
--------------------------------

    Attachment: dfshealth.html


With this fix, what we displace on dfs front page changes. The href for datanode now will have ip address. See attached dfshealth.html. Following comment in dfshealth.jsp describes what we display:

    /* Say the datanode is dn1.hadoop.apache.org with ip 192.168.0.5
       we use:
       1) d.getHostName():d.getPort() to display.
           Domain and port are stipped if they are common across the nodes.
           i.e. "dn1"
       2) d.getHostName():d.Port() for "title".
          i.e. "dn1.hadoop.apache.org:50010"
       3) d.getHost():d.getInfoPort() for url.
          i.e. "http://192.168.0.5:50075/..."
          Note that "d.getHost():d.getPort()" is what DFS clients use
          to interact with datanodes.
    */

Yes, the datanode hrefs don't looks good. But one advantage is that we can easily see what namenode and clients see.
                                                                     

> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>         Attachments: dfshealth.html
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471851 ]

Raghu Angadi commented on HADOOP-985:
-------------------------------------


Ok, I switched (2) and (3) above. "title" (hover) shows 192.168.0.5:50010 and href will have hostname.


> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>         Attachments: dfshealth.html
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-985:
--------------------------------

    Attachment: HADOOP-985-1.patch

Attached patch for using ips in namenode. Added extra field hostName in DatanodeID but it is not serialized.

I tested with a deliberately wrong config so that each datanode gets "localhost" as its hostname. Namenode web page lists "localhost" for all the nodes but the cluster just-works :).

> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>         Attachments: dfshealth.html, HADOOP-985-1.patch
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-985:
--------------------------------

    Attachment: HADOOP-985-2.patch

2.patch : minor typo fix.


> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>         Attachments: dfshealth.html, HADOOP-985-1.patch, HADOOP-985-2.patch
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473558 ]

Hairong Kuang commented on HADOOP-985:
--------------------------------------

The open request takes the client host name as a parameter. Upon receiving an open request, the name node searches the datanode map to find the descriptor of the data node that runs on the client machine. Now that DatanodeDescriptor contains its ip address not its host name. This search always returns null.

> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>         Attachments: dfshealth.html, HADOOP-985-1.patch, HADOOP-985-2.patch
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-985:
--------------------------------

    Attachment: HADOOP-985-3.patch

attached 3.patch. Updated patch removed 'clientMachine' argument from ClientProtocol's open() and create(). This argument was part of rack-aware patch.



> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>         Attachments: dfshealth.html, HADOOP-985-1.patch, HADOOP-985-2.patch, HADOOP-985-3.patch
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-985:
--------------------------------

    Attachment: HADOOP-985-4.patch

Thanks Hairong.
minor change in 4.patch.


> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>         Attachments: dfshealth.html, HADOOP-985-1.patch, HADOOP-985-2.patch, HADOOP-985-3.patch, HADOOP-985-4.patch
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473574 ]

Hairong Kuang commented on HADOOP-985:
--------------------------------------

The patch looks good. I have two comments:

1. ClientProtocolVersionNumber should be bumped since the syntax of the open & create requests is changed.
2. DatanodeID contains the fields that need to be saved to the disk. Since the new field hostName does not need to serialized, it might better be put in DatanodeDescriptor.

> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>         Attachments: dfshealth.html, HADOOP-985-1.patch, HADOOP-985-2.patch, HADOOP-985-3.patch, HADOOP-985-4.patch
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473585 ]

Raghu Angadi commented on HADOOP-985:
-------------------------------------


Thanks Hairong. I will include both in a new patch.

This changes the what DFS returns for getDatanodeHints(), which is ultimately used by mapreduce. Two options for handling this:

a) we can modify getDatanodeHints() to return what it used return before this patch. i.e. return descriptor.getHostName() instead of descriptor.getHost(). Advantage is that no changes are necessary in mapreduce. But does not confirm to 'ip every where' policy.

b) Make Job and task tracker also deal in ips. I am not sure yet how intrusive this change is.

My preference is (a). comments?



> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>         Attachments: dfshealth.html, HADOOP-985-1.patch, HADOOP-985-2.patch, HADOOP-985-3.patch, HADOOP-985-4.patch
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473630 ]

Hairong Kuang commented on HADOOP-985:
--------------------------------------

I also prefer option (a). I would open another jira issue to investigate the use of ip in mapred.

> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>         Attachments: dfshealth.html, HADOOP-985-1.patch, HADOOP-985-2.patch, HADOOP-985-3.patch, HADOOP-985-4.patch
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-985:
--------------------------------

    Attachment: HADOOP-985-5.patch


5.patch : includes the changes Hairong suggested.

We now send hostname for hints. Thanks Ownen, verified that job tracker correctly assigns the jobs.



> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>         Attachments: dfshealth.html, HADOOP-985-1.patch, HADOOP-985-2.patch, HADOOP-985-3.patch, HADOOP-985-4.patch, HADOOP-985-5.patch
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (HADOOP-985) Namenode should identify DataNodes as ip:port instead of hostname:port

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/HADOOP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-985:
--------------------------------

    Status: Patch Available  (was: Open)

> Namenode should identify DataNodes as ip:port instead of hostname:port
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-985
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.11.0
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
>             Fix For: 0.12.0
>
>         Attachments: dfshealth.html, HADOOP-985-1.patch, HADOOP-985-2.patch, HADOOP-985-3.patch, HADOOP-985-4.patch, HADOOP-985-5.patch
>
>
> Right now NameNode keeps track of DataNodes with "hostname:port". One proposal is to keep track of datanodes with "ip:port". There are various concerns expressed regd hostnames and ip. Please add your experiences here so that we have better idea on what we should fix etc.
> How should be calculate datanode ip:
>             1) Just like how we calculate hostname currently with "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver". So if interface specified wrong, it could report ip like 127.0.0.1 which might or might not be intended.
>             2) Namenode can use the remove socket address when the datanode registers. Not sure how easy it to get this address in RPC or if this is desirable.
>             3) Namenode could just resolve the hostname when a datanode registers. It could print of a warning if the resolved ip and reported ip don't match.
> One advantage of using IPs is that DFSClient does not need to resolve them when it connects to datanode. This could save few milliseconds for each block. Also, DFSClient should check all its ips to see if a given ip is local or not.
> As far I see namenode does not resolve any DNS in normal operations since it does not actively contact datanodes. In that sense not sure if this have any change in Namenode performance.
> Thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

12