RAC Nodes Reboot due to hardware error(CPU)

时间: 2016-04-07 00:35:32 | 作者: ohsdba |
如非注明，本站文章皆为原创。欢迎转载，转载时请注明出处和作者信息。

A Machine Check Exception (MCE) is a type of computer hardware error that occurs when a computer's central processing unit detects a hardware problem.

MCE是一种计算机硬件错误类型，当计算机处理器检测到硬件错误的时候会记录这些错误到日志，有一些错误是可以忽略的。这些错误一般不容易判断是什么原因。下面的错误发生在X3850服务器上，服务器液晶面板显示正常，IMM查看也没什么异常，但这个服务器老是重启或宕机，有一点就是另外一台机器的/var/log/mcelog为0，出问题的这个服务器日志很大。下面为mcelog的内容：

Hardware event. This is not a software error.
MCE 0
CPU 28 BANK 2
ADDR 1ffff81470f70
TIME 1404889254 Wed Jul  9 15:00:54 2014
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
MCA: Instruction CACHE Level-0 Instruction-Fetch Error
STATUS 9400004000040150 MCGSTATUS 0
MCGCAP 1000c18 APICID e0 SOCKETID 3
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.
MCE 0
CPU 28 BANK 2
ADDR 1ffff8127cff0
TIME 1404889257 Wed Jul  9 15:00:57 2014
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
MCA: Instruction CACHE Level-0 Instruction-Fetch Error
STATUS 9400004000040150 MCGSTATUS 0
MCGCAP 1000c18 APICID e0 SOCKETID 3
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.

出问题的这个机器是RAC的一个节点，当然也存在集群导致服务器重启的可能性。根据重启的时间点，分析集群日志，操作系统日志，数据库日志，最后发现并不是RAC所致。但是IMM(http://192.168.70.125 USERID/PASSW0RD)显示正常，液晶面板显示正常，通过这些错误能判断出问题么？个人觉得存在2中可能性

1.内存错误

从mcelog中看这些与内存地址相关

2.CPU错误

从mcelog也可以看到CPU的信息，出问题的一直是CPU 28 Bank 2

换了内存后，日志依然在不断增加(linux上mcelog文件大小正常为0)，再次换了第二颗CPU(cpu 28,每颗16核)，通过近一段时间观察，运行稳定，没有再出现问题。

总结：出问题的不一定总是内存，CPU也是会出问题的

Reference

http://www.mcelog.org/

https://en.wikipedia.org/wiki/Machine-check_exception

http://www.oracle.com/technetwork/cn/articles/servers-storage-admin/fault-management-linux-2005816-zhs.html

RAC Nodes Reboot due to hardware error(CPU)

推荐文章

联系我们