在集群安装 Hadoop 的过程中,出现了这样的问题。
所有 Node 都起来了,工作正常,唯独 secondary namenode 在 doCheckpoint
的时候报错,而且是诡异的 403 http error。
1 2 3 4 // secondary namenode log 2011-10-24 17:09:12,255 INFO org.apache.hadoop.security.UserGroupInformation: Initiating re-login for hadoop/[email protected] 2011-10-24 17:09:22,917 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint: 2011-10-24 17:09:22,918 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.io.IOException: Server returned HTTP response code: 403 for URL: https://hz169-91.i.site.com:50475/getimage?getimage=1 ...
于是怀疑 kerberos 认证问题,可是 secondary namenode 已经通过 Kerberos
验证了;
又怀疑 secondary namenode 向 namenode 请求服务被拒绝,可是 namenode
的 log
显示已经通过验证了。(hadoop/[email protected]
是 secondary namenode 的 kerberos
principal,hadoop/[email protected]
是
namenode 的 kerberos principal)
1 2 3 4 // namenode log 2011-10-25 11:24:33,927 WARN org.apache.hadoop.hdfs.server.namenode.GetImageServlet: Received non-NN/SNN request for image or edits from 123.58.169.92 2011-10-25 11:27:40,033 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successfull for hadoop/[email protected] 2011-10-25 11:27:40,100 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successfull for hadoop/[email protected] for protocol=interface org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol 2011-10-25 11:27:40,101 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 123.58.169.92 ...
为了便于测试,可以将 checkpoint 的周期调小:
1 2 3 4 5 // hdfs-site.xml <property > <name > fs.checkpoint.period</name > <value > 5</value > </property >
然后各种怀疑,各种猜测,各种尝试,无果。
网上 Hadoop 的资料很多,但使用 Kerberos 做 Hadoop
安全验证的很少。
决定自给自足,找出出错的这段代码,加 log,先从 doGet 开始分析:
代码下载自这里 ,吐槽下
Cloudera,cdh3u1 的代码真难找,我这是根据目录结构蒙出来的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 public void doGet (final HttpServletRequest request, final HttpServletResponse response ) throws ServletException, IOException { Map<string ,String[]> pmap = request.getParameterMap(); try { ServletContext context = getServletContext(); final FSImage nnImage = (FSImage)context.getAttribute("name.system.image" ); final TransferFsImage ff = new TransferFsImage(pmap, request, response); final Configuration conf = (Configuration)getServletContext().getAttribute(JspHelper.CURRENT_CONF); if (UserGroupInformation.isSecurityEnabled() && !isValidRequestor(request.getRemoteUser(), conf)) { response.sendError(HttpServletResponse.SC_FORBIDDEN, "Only Namenode and Secondary Namenode may access this servlet" ); LOG.warn("Received non-NN/SNN request for image or edits from " + request.getRemoteHost()); return ; } private boolean isValidRequestor (String remoteUser, Configuration conf) throws IOException { if (remoteUser == null ) { LOG.warn("Received null remoteUser while authorizing access to getImage servlet" ); return false ; } String[] validRequestors = { SecurityUtil.getServerPrincipal(conf .get(DFS_NAMENODE_KRB_HTTPS_USER_NAME_KEY), NameNode.getAddress( conf).getHostName()), SecurityUtil.getServerPrincipal(conf.get(DFS_NAMENODE_USER_NAME_KEY), NameNode.getAddress(conf).getHostName()), SecurityUtil.getServerPrincipal(conf .get(DFS_SECONDARY_NAMENODE_KRB_HTTPS_USER_NAME_KEY), SecondaryNameNode.getHttpAddress(conf).getHostName()), SecurityUtil.getServerPrincipal(conf .get(DFS_SECONDARY_NAMENODE_USER_NAME_KEY), SecondaryNameNode .getHttpAddress(conf).getHostName()) }; for (String v : validRequestors) { if (v != null && v.equals(remoteUser)) { if (LOG.isDebugEnabled()) LOG.debug("isValidRequestor is allowing: " + remoteUser); return true ; } } if (LOG.isDebugEnabled()) LOG.debug("isValidRequestor is rejecting: " + remoteUser); return false ; }
编译完之后,直接替换 hadoop-core-0.20.2-cdh3u1.jar 里面的 .class
文件就行。 替换的三个 .class 文件:
org/apache/hadoop/hdfs/server/namenode/GetImageServlet$1$1.class
org/apache/hadoop/hdfs/server/namenode/GetImageServlet$1.class
org/apache/hadoop/hdfs/server/namenode/GetImageServlet.class
打印结果如下:
1 2 3 4 5 6 7 8 9 10 11 // namenode log 2011-10-25 15:53:33,927 WARN org.apache.hadoop.hdfs.server.namenode.GetImageServlet: Received non-NN/SNN request for image or edits from 123.58.169.92 2011-10-25 15:53:38,969 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successfull for hadoop/[email protected] 2011-10-25 15:53:39,067 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successfull for hadoop/[email protected] for protocol=interface org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol 2011-10-25 15:53:39,068 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 123.58.169.92 2011-10-25 15:53:39,083 WARN org.apache.hadoop.hdfs.server.namenode.GetImageServlet: *********** RemoteUser is hadoop/[email protected] 2011-10-25 15:53:49,296 WARN org.apache.hadoop.hdfs.server.namenode.GetImageServlet: ******** validRequestors = hadoop/[email protected] 2011-10-25 15:53:49,297 WARN org.apache.hadoop.hdfs.server.namenode.GetImageServlet: ******** validRequestors = hadoop/[email protected] 2011-10-25 15:53:49,297 WARN org.apache.hadoop.hdfs.server.namenode.GetImageServlet: ******** validRequestors = host/[email protected] 2011-10-25 15:53:49,297 WARN org.apache.hadoop.hdfs.server.namenode.GetImageServlet: ******** validRequestors = hadoop/[email protected] 2011-10-25 15:53:49,298 WARN org.apache.hadoop.hdfs.server.namenode.GetImageServlet: Received non-NN/SNN request for image or edits from 123.58.169.92
很明显,remoteUser
(hadoop/[email protected]
) 跟 validRequestor
(hadoop/[email protected]
) 不一样的。
这时候,想起来了,hdfs-site.xml 里面的 principal 是如下设置的:
1 2 3 4 5 6 7 8 9 // hdfs-site.xml <property > <name > dfs.secondary.namenode.kerberos.principal</name > <value > hadoop/[email protected] </value > </property > <property > <name > dfs.secondary.namenode.kerberos.https.principal</name > <value > host/[email protected] </value > </property >
肯定是这个 _HOST
解析出了问题,尝试把 _HOST
改成 hz169-92.i.site.com
,重启,问题解决!
虽然问题解决了,但是为什么这个解析会出错呢?因为一开始,secondary
namenode 启动的时候,kerberos 验证是通过了的,登陆用户是
hadoop/[email protected]
,也就是说那个时候
_HOST
解析应该是正确的。
继续看代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 SecurityUtil.getServerPrincipal(conf .get(DFS_SECONDARY_NAMENODE_USER_NAME_KEY), SecondaryNameNode .getHttpAddress(conf).getHostName()) }; public static InetSocketAddress getHttpAddress (Configuration conf) { String infoAddr = NetUtils.getServerAddress(conf, "dfs.secondary.info.bindAddress" , "dfs.secondary.info.port" , "dfs.secondary.http.address" ); return NetUtils.createSocketAddr(infoAddr); }
那么来看 SecurityUtil.getServerPrincipal()
拿到 0.0.0.0
是做了什么?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 public static String getServerPrincipal (String principalConfig, String hostname) throws IOException { String[] components = getComponents(principalConfig); if (components == null || components.length != 3 || !components[1 ].equals(HOSTNAME_PATTERN)) { return principalConfig; } else { return replacePattern(components, hostname); } } private static String replacePattern (String[] components, String hostname) throws IOException { String fqdn = hostname; if (fqdn == null || fqdn.equals("" ) || fqdn.equals("0.0.0.0" )) { fqdn = getLocalHostName(); } return components[0 ] + "/" + fqdn + "@" + components[2 ]; } static String getLocalHostName () throws UnknownHostException { return InetAddress.getLocalHost().getCanonicalHostName(); }
OK, 配置上 dfs.seconary.http.address ,还原 principle
instance(hostname) 为 _HOST
,重启,问题解决!
1 2 3 4 5 6 7 8 9 // hdfs-site.xml <property > <name > dfs.secondary.http.address</name > <value > hz169-92.i.site.com:50090</value > <description > The secondary namenode http server address and port. If the port is 0 then the server will start on a free port. </description > </property >