HAIP异常导致RAC节点无法启动的解决方案_第1页
HAIP异常导致RAC节点无法启动的解决方案_第2页
HAIP异常导致RAC节点无法启动的解决方案_第3页
HAIP异常导致RAC节点无法启动的解决方案_第4页
HAIP异常导致RAC节点无法启动的解决方案_第5页
已阅读5页,还剩5页未读 继续免费阅读

VIP免费下载

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、HAIP异常,导致RAC节点无法启动的解决方案一个网友咨询一个问题,他的11.2.0.2 RAC(for Aix),没有安装任何patch或PSU。其中一个节点重启之后无法正常启动,查看ocssd日志如下:2014-08-09 14:21:46.094: CSSD5414clssnmSendingThread: sent 4 join msgs to all nodes2014-08-09 14:21:46.421: CSSD4900clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0s2014-08-09 14:2

2、1:47.042: CSSD4129clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958157, LATS 1518247992, lastSeqNo 255958154, uniqueness 1406064021, timestamp 1407565306/15017580722014-08-09 14:21:47.051: CSSD3358clssnmvDHBValidateNCopy: node 1, rac01,

3、has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958158, LATS 1518248002, lastSeqNo 255958155, uniqueness 1406064021, timestamp 1407565306/15017581902014-08-09 14:21:47.421: CSSD4900clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 02014-08-09 14:21:48.042: CSSD4129

4、clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958160, LATS 1518248993, lastSeqNo 255958157, uniqueness 1406064021, timestamp 1407565307/15017590802014-08-09 14:21:48.052: CSSD3358clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but

5、 no network HB, DHB has rcfg 217016033, wrtcnt, 255958161, LATS 1518249002, lastSeqNo 255958158, uniqueness 1406064021, timestamp 1407565307/15017591912014-08-09 14:21:48.421: CSSD4900clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 02014-08-09 14:21:49.043: CSSD4129clssnmvDHBValidate

6、NCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958163, LATS 1518249993, lastSeqNo 255958160, uniqueness 1406064021, timestamp 1407565308/15017600822014-08-09 14:21:49.056: CSSD3358clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DH

7、B has rcfg 217016033, wrtcnt, 255958164, LATS 1518250007, lastSeqNo 255958161, uniqueness 1406064021, timestamp 1407565308/15017601932014-08-09 14:21:49.421: CSSD4900clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 02014-08-09 14:21:50.044: CSSD4129clssnmvDHBValidateNCopy: node 1, rac

8、01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958166, LATS 1518250994, lastSeqNo 255958163, uniqueness 1406064021, timestamp 1407565309/15017610902014-08-09 14:21:50.057: CSSD3358clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 2170160

9、33, wrtcnt, 255958167, LATS 1518251007, lastSeqNo 255958164, uniqueness 1406064021, timestamp 1407565309/15017611952014-08-09 14:21:50.421: CSSD4900clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 02014-08-09 14:21:51.046: CSSD4129clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB,

10、 but no network HB, DHB has rcfg 217016033, wrtcnt, 255958169, LATS 1518251996, lastSeqNo 255958166, uniqueness 1406064021, timestamp 1407565310/15017621002014-08-09 14:21:51.057: CSSD3358clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958

11、170, LATS 1518252008, lastSeqNo 255958167, uniqueness 1406064021, timestamp 1407565310/15017622052014-08-09 14:21:51.102: CSSD5414clssnmSendingThread: sending join msg to all nodes2014-08-09 14:21:51.102: CSSD5414clssnmSendingThread: sent 5 join msgs to all nodes2014-08-09 14:21:51.421: CSSD4900clss

12、gmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 02014-08-09 14:21:52.050: CSSD4129clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958172, LATS 1518253000, lastSeqNo 255958169, uniqueness 1406064021, timestamp 1407565311/1501763

13、1102014-08-09 14:21:52.058: CSSD3358clssnmvDHBValidateNCopy: node 1, rac01, has a disk HB, but no network HB, DHB has rcfg 217016033, wrtcnt, 255958173, LATS 1518253008, lastSeqNo 255958170, uniqueness 1406064021, timestamp 1407565311/15017632302014-08-09 14:21:52.089: CSSD5671clssnmRcfgMgrThread: L

14、ocal Join2014-08-09 14:21:52.089: CSSD5671clssnmLocalJoinEvent: begin on node(2), waittime 1930002014-08-09 14:21:52.089: CSSD5671clssnmLocalJoinEvent: set curtime (1518253039) for my node2014-08-09 14:21:52.089: CSSD5671clssnmLocalJoinEvent: scanning 32 nodes2014-08-09 14:21:52.089: CSSD5671clssnmL

15、ocalJoinEvent: Node rac01, number 1, is in an existing cluster with disk state 32014-08-09 14:21:52.090: CSSD5671clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk2014-08-09 14:21:52.431: CSSD4900clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0从上面的信息,很容易

16、给人感觉是心跳的问题。这么理解也不错,只是这里的心跳不是指的我们说理解的传统的心跳网络。我让他在crs正常的一个节点查询如下信息,我们就知道原因了,如下:SQL> select name,ip_address from v$cluster_interconnects;NAME            IP_ADDRESS- -en0       

17、0;     169.254.116.242大家可以看到,这里心跳IP为什么是169网段呢?很明显跟我们的/etc/hosts设置不匹配啊?why ?这里我们要介绍下Oracle 11gR2 引入的HAIP特性,Oracle引入该特性的目的是为了通过自身的技术来实现心跳网络的冗余,而不再依赖于第三方技术,比如Linux的bond等等。在版本之前,如果使用了OS级别的心跳网卡绑定,那么Oracle仍然以OS绑定的为准。从开始,如果没有在OS层面进行心跳冗余的配置,那么Oracle自己的HAIP就启用了。所以你虽然设置的,然而实际上Oracle使用

18、是169.254这个网段。关于这一点,大家可以去看下alert log,从该日志都能看出来,这里不多说。我们可以看到,正常节点能看到如下的169网段的ip,问题节点确实看不到这个169的网段IP:Oracle MOS提供了一种解决方案,如下:crsctl start res -init经过测试,使用root进行操作,也是不行的。针对HAIP的无法启动,Oracle MOS文档说通常是如下几种情况:1) 心跳网卡异常2)   多播工作机制异常3)防火墙等原因4)Oracle bug对于心跳网卡异常,如果只有一块心跳网卡,那么ping其他的ip就可以进行验证了,这一点很容易排除

19、。对于多播的问题,可以通过Oracle提供的mcasttest.pl脚本进行检测(请参考Grid Infrastructure Startup During Patching, Install or Upgrade May Fail Due to Multicasting Requirement (ID 1212703.1),我这里的检查结果如下:$ ./mcasttest.pl -n rac02,rac01 -i en0# Setup for node rac02 #Checking node access 'rac02'Checking node login 'ra

20、c02'Checking/Creating Directory /tmp/mcasttest for binary on node 'rac02'Distributing mcast2 binary to node 'rac02'# Setup for node rac01 #Checking node access 'rac01'Checking node login 'rac01'Checking/Creating Directory /tmp/mcasttest for binary on node 'rac

21、01'Distributing mcast2 binary to node 'rac01'# testing Multicast on all nodes #Test for Multicast address 230.0.1.0Aug 11 21:39:39 | Multicast Failed for en0 using address 230.0.1.0:42000Test for Multicast address 224.0.0.251Aug 11 21:40:09 | Multicast Failed for en0 using address 224.0.

22、0.251:42001$虽然这里通过脚本检查,发现对于230和224网段都是不通的,然而这不见得一定说明是多播的问题导致的。虽然我们查看ocssd.log,通过搜索mcast关键可以看到相关的信息。实际上,我在自己的11.2.0.3 Linux RAC环境中测试,即使mcasttest.pl测试不通,也可以正常启动CRS的。由于网友这里是AIX,应该我就排除防火墙的问题了。因此最后怀疑Bug 9974223的可能性比较大。实际上,如果你去查询HAIP的相关信息,你会发现该特性其实存在不少的Oracle bug。其中 for knowns HAIP issues in 11gR2/12c Gri

23、d Infrastructure (1640865.1)就记录12个HAIP相关的bug。由于这里他的第1个节点无法操作,为了安全,是不能有太多的操作的。对于HAIP,如果没有使用多心跳网卡的情况下,我觉得完全是可以禁止掉的。但是昨天查MOS文档,具体说不能disabled。最后测试发现其实是可以禁止掉的。如下是我的测试过程:rootrac1 bin# ./crsctl modify res -attr "ENABLED=0" -initrootrac1 bin# ./crsctl stop crsCRS-2791: Starting shutdown of Oracle

24、High Availability Services-managed resources on 'rac1'CRS-2673: Attempting to stop 'ora.crsd' on 'rac1'CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'rac1'CRS-2673: Attempting to stop 'ora.oc4j' on 'rac1'CRS-2673: Attemptin

25、g to stop 'ora.cvu' on 'rac1'CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'rac1'CRS-2673: Attempting to stop '' on 'rac1'CRS-2673: Attempting to stop '' on 'rac1'CRS-2673: Attempting to stop 'ora.rac1.vip' on 'r

26、ac1'CRS-2677: Stop of 'ora.rac1.vip' on 'rac1' succeededCRS-2672: Attempting to start 'ora.rac1.vip' on 'rac2'CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'rac1' succeededCRS-2673: Attempting to stop 'ora.scan1.vip' on 'rac1'CRS-2

27、677: Stop of 'ora.scan1.vip' on 'rac1' succeededCRS-2672: Attempting to start 'ora.scan1.vip' on 'rac2'CRS-2676: Start of 'ora.rac1.vip' on 'rac2' succeededCRS-2676: Start of 'ora.scan1.vip' on 'rac2' succeededCRS-2672: Attempting to st

28、art 'ora.LISTENER_SCAN1.lsnr' on 'rac2'CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'rac2' succeededCRS-2677: Stop of '' on 'rac1' succeededCRS-2677: Stop of 'ora.oc4j' on 'rac1' succeededCRS-2677: Stop of 'ora.cvu' on 'r

29、ac1' succeededCRS-2677: Stop of '' on 'rac1' succeededCRS-2673: Attempting to stop 'ora.asm' on 'rac1'CRS-2677: Stop of 'ora.asm' on 'rac1' succeededCRS-2673: Attempting to stop 'ora.ons' on 'rac1'CRS-2677: Stop of 'ora.ons'

30、 on 'rac1' succeededCRS-2673: Attempting to stop 'work' on 'rac1'CRS-2677: Stop of 'work' on 'rac1' succeededCRS-2792: Shutdown of Cluster Ready Services-managed resources on 'rac1' has completedCRS-2677: Stop of 'ora.crsd' on 'rac1' su

31、cceededCRS-2673: Attempting to stop '' on 'rac1'CRS-2673: Attempting to stop 'ora.ctssd' on 'rac1'CRS-2673: Attempting to stop 'ora.evmd' on 'rac1'CRS-2673: Attempting to stop 'ora.asm' on 'rac1'CRS-2673: Attempting to stop 'ora.mdn

32、sd' on 'rac1'CRS-2677: Stop of 'ora.mdnsd' on 'rac1' succeededCRS-2677: Stop of 'ora.evmd' on 'rac1' succeededCRS-2677: Stop of 'ora.ctssd' on 'rac1' succeededCRS-2677: Stop of 'ora.asm' on 'rac1' succeededCRS-2673: Attempti

33、ng to stop '' on 'rac1'CRS-2677: Stop of '' on 'rac1' succeededCRS-2673: Attempting to stop 'ora.cssd' on 'rac1'CRS-2677: Stop of 'ora.cssd' on 'rac1' succeededCRS-2673: Attempting to stop 'ora.crf' on 'rac1'CRS-2677: St

34、op of '' on 'rac1' succeededCRS-2677: Stop of 'ora.crf' on 'rac1' succeededCRS-2673: Attempting to stop 'ora.gipcd' on 'rac1'CRS-2677: Stop of 'ora.gipcd' on 'rac1' succeededCRS-2673: Attempting to stop 'ora.gpnpd' on 'rac1&

35、#39;CRS-2677: Stop of 'ora.gpnpd' on 'rac1' succeededCRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has completedCRS-4133: Oracle High Availability Services has been stopped.rootrac1 bin# ./crsctl start crsCRS-4123: Oracle High Availabilit

36、y Services has been started.rootrac1 bin# ./crsctl check crsCRS-4638: Oracle High Availability Services is onlineCRS-4537: Cluster Ready Services is onlineCRS-4529: Cluster Synchronization Services is onlineCRS-4533: Event Manager is onlinerootrac1 bin# ./crsctl stat res -t -init-NAME TARGET STATE SERVER STATE_DETAILS-Cluster Resources-ora.asm 1 ONLINE ONLINE rac1 Started 1 ONLINE OFFLINEora.crf 1 ONLINE ONLINE rac1ora.crsd 1 ONLINE ONLINE rac1ora.cssd 1 ONLINE ONLINE rac1ora.cssdmonitor 1 ONLINE ONLINE rac1ora.ctssd 1 ONLINE ONLINE rac1 ACTI

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论