环境:AIX6.1 + 11gR2 RAC
描述:重启服务器后,Oracle RAC集群服务无法启动。
1.通过crsctl启动时,会有以下提示错误
# /oracle/grid_home/bin/crsctl start crs
CRS-4124: Oracle High Availability Services startup failed. CRS-4000: Command Start failed, or completed with errors.
#
2.查看后台进程,发现只有ohasd.bin reboot# ps -ef|grep d.bin
root 9175248 31457456 0 18:11:19 pts/3 0:00 grep d.bin root 10092704 1 0 18:07:34 - 0:00 /oracle/grid_home/bin/ohasd.bin reboot
3.crsctl_root.log显示以下内容
Oracle Database 11g Clusterware Release 11.2.0.3.0 - Production Copyright 1996, 2011 Oracle. All rights reserved. 2015-12-21 18:47:49.957: [ OCRMSG][1]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2) 2015-12-21 18:47:49.957: [ OCRMSG][1]GIPC error [29] msg [gipcretConnectionRefused] 2015-12-21 18:47:49.958: [ OCRMSG][1]prom_connect: error while waiting for connection complete [24] 2015-12-21 19:08:04.063: [ OCRMSG][1]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2) 2015-12-21 19:08:04.063: [ OCRMSG][1]GIPC error [29] msg [gipcretConnectionRefused] 2015-12-21 19:08:04.063: [ OCRMSG][1]prom_connect: error while waiting for connection complete [24] 2015-12-21 19:25:27.773: [ OCRMSG][1]prom_waitconnect: CONN NOT ESTABLISHED (0,29,1,2) 2015-12-21 19:25:27.773: [ OCRMSG][1]GIPC error [29] msg [gipcretConnectionRefused]
3.通过truss ohasd.bin进程,可以看到以下信息
open("/tmp/.oracle/npohasd", O_WRONLY|O_NONBLOCK) = -1 ENXIO (No such device or address) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, {1, 0}) = 0 open("/tmp/.oracle/npohasd", O_WRONLY|O_NONBLOCK) = -1 ENXIO (No such device or address) rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, {1, 0}) = 0
5.ohasd.log日志显示内容
Created alert : (:OHAS00117:) : TIMED OUT WAITING FOR OHASD MONITOR
分析:通过排查,发现网络、存储正常,没有zombie的进程。从日志中也看不出有什么特别的错误提示,过一段时间ohasd.log会显示超时的提示,应该是ohasd被hung住了。最后发现是由于/etc/inittab中的安装助手(install_assist)引起的。oracle cluster启动进程和install_assist是同一个级别,并且在集群服务启动之前先启动,由于安装助手需要人为干预(因为是通过ssh工具连接,没有发现这个问题,没有处理),导致这个进程阻塞后面的进程,进而阻塞了集群的正常启动。
解决方案:禁止安装助手启动,注释下面的一行,重启服务器。
# grep install /etc/inittabinstall_assist:2:wait:/usr/sbin/install_assist </dev/console >/dev/console 2>&1
http://www.ibm.com/developerworks/cn/aix/redbooks/test191-3/