IT咨询服务PowerHA aix6.1

hacmp配置同步时的错误？

环境如下：

AIX6.1+hacmp 6.1,两节点配置如下：

节点1：

hostname

LPAR1

两网卡en0，en1配置如下：
en0:

HOSTNAME [LPAR1]
Internet ADDRESS (dotted decimal) [192.168.10.111]
Network MASK (dotted decimal) [255.255.255.0]

Network INTERFACE en0
NAMESERVER

     Internet ADDRESS (dotted decimal)         []
     DOMAIN Name                               []

Default Gateway

 Address (dotted decimal or symbolic name)     [192.168.1.1]

en1:

HOSTNAME [LPAR1]
Internet ADDRESS (dotted decimal) [192.168.20.111]
Network MASK (dotted decimal) [255.255.255.0]

Network INTERFACE en1
NAMESERVER

     Internet ADDRESS (dotted decimal)         []
     DOMAIN Name                               []

Default Gateway

 Address (dotted decimal or symbolic name)     [192.168.1.1]
 Cost                                          [0]

节点2配置：

hostname

LPAR2

两网卡en0,en1配置为：
en0:

HOSTNAME [LPAR2]
Internet ADDRESS (dotted decimal) [192.168.10.112]
Network MASK (dotted decimal) [255.255.255.0]

Network INTERFACE en0
NAMESERVER

     Internet ADDRESS (dotted decimal)         []
     DOMAIN Name                               []

Default Gateway

 Address (dotted decimal or symbolic name)     [192.168.1.1]
 Cost                                          [0]                                                                                     #
 Do Active Dead Gateway Detection?              no                                                                                    +

Your CABLE Type N/A +
START Now no

en1:

HOSTNAME [LPAR2]

Internet ADDRESS (dotted decimal) [192.168.20.112]
Network MASK (dotted decimal) [255.255.255.0]

Network INTERFACE en1
NAMESERVER

     Internet ADDRESS (dotted decimal)         []
     DOMAIN Name                               []

Default Gateway

 Address (dotted decimal or symbolic name)     [192.168.1.1]
 Cost                                          [0]                                                                                     #
 Do Active Dead Gateway Detection?              no                                                                                    +

Your CABLE Type N/A +
START Now no

两节点hosts文件内容为：
/etc/hosts文件内容：

boot ip

192.168.10.111 LPAR1_boot
192.168.10.112 LPAR2_boot

standby ip

192.168.20.111 LPAR1_standby
192.168.20.112 LPAR2_standby

persisit ip

192.168.1.111 LPAR1
192.168.1.112 LPAR2

service ip

192.168.1.110 LPAR_srv

现在遇到的问题是：
1. 完成hacmp配置，最后运行smit hacmp->Extended Configuratio->Extended Verification and Synchronization时，设置选项为：

                                                    [Entry Fields]

Verify, Synchronize or Both [Both] +
Automatically correct errors found during [Yes] +
verification?
Force synchronization if verification fails? [No] +
Verify changes only? [No] +
Logging [Standard]
最终运行结果是：OK
但是运行结果下方的日志中显示了这个错误：
rshexec: cannot connect to node LPAR1
Could not run clfilecollection -u on node LPAR1.
rshexec: cannot connect to node LPAR2
Could not run clfilecollection -u on node LPAR2.

Verification has completed normally.
rshexec: cannot connect to node LPAR1
ERROR: Cannot refresh clcomdES subsystem on node LPAR1rshexec: cannot connect to node LPAR2
ERROR: Cannot refresh clcomdES subsystem on node LPAR2

请问这个错误对hacmp配置有影响吗？怎么解决？
2. 运行上述命令后，发现/etc/hosts文件被自动修改成了下面的样子：

boot ip

192.168.10.112 LPAR2_boot

standby ip

192.168.20.112 LPAR2_standby

persisit ip

192.168.1.111 LPAR1
192.168.1.112 LPAR2

service ip

192.168.1.110 LPAR_srv
192.168.10.111 LPAR1_boot LPAR1
192.168.20.111 LPAR1_standby LPAR1
加了别名，这个别名是什么机制？
3. 完成上述配置后，运行smit clstart，选择启动两个节点，
运行结果是OK,但是下方日志显示：
migcheck[475]: cl_connect() error, nodename=LPAR1, rc=-1
migcheck[475]: cl_connect() error, nodename=LPAR2, rc=-1

WARNING: A communication error was encountered trying to get the VRMF from remote nodes. Please make sure clcomd is running
按提示检查clcomd,

lssrc -s clcomd

Subsystem Group PID Status
clcomd caa 4980916 active
两节点均显示active,既然是active，为什么会有上面的warning？

按步骤3启动服务后，查看Ip情况
节点LPAR1上

ifconfig -a|more

en0: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

    inet 192.168.10.111 netmask 0xffffff00 broadcast 192.168.10.255
    inet 192.168.1.111 netmask 0xffffff00 broadcast 192.168.1.255
     tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en1: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

    inet 192.168.20.111 netmask 0xffffff00 broadcast 192.168.20.255
    inet 192.168.1.110 netmask 0xffffff00 broadcast 192.168.1.255
     tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

lo0: flags=e08084b,c0<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,LARGESEND,CHAIN>

    inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
    inet6 ::1%1/0
     tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1

节点LPAR2上：

ifconfig -a|more

en0: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

    inet 192.168.10.112 netmask 0xffffff00 broadcast 192.168.10.255
     tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en1: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

    inet 192.168.20.112 netmask 0xffffff00 broadcast 192.168.20.255
    inet 192.168.1.112 netmask 0xffffff00 broadcast 192.168.1.255
     tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

（...)
ip查看没有发现异常，
使用smit hacmp->System Management (C-SPOC)--> HACMP Services-->Show Cluster Services
显示服务运行如下

Status of the RSCT subsystems used by HACMP:
Subsystem Group PID Status
topsvcs topsvcs 9633858 active
grpsvcs grpsvcs 13172936 active
grpglsm grpsvcs inoperative
emsvcs emsvcs 7733330 active
emaixos emsvcs inoperative
ctrmc rsct 5112004 active

Status of the HACMP subsystems:
Subsystem Group PID Status
clcomdES clcomdES 4063414 active
clstrmgrES cluster 6815944 active

Status of the optional HACMP subsystems:
Subsystem Group PID Status
clinfoES cluster 4128932 active
初步看到这些状态都是正常的，但是在LPAR1上运行stop service时，运行失败，提示
Command: failed stdout: yes stderr: no
cl_clstop: ERROR: Node LPAR1 has 1 event(s) outstanding as reported by command 'lssrc -ls clstrmgrES' and cannot be stopped until all outstandi
ng events have completed. The stop request has been aborted for all nodes. Please wait for all nodes to stabalize before attempting to stop c
luster services again.
根据提示，运行lssrc -ls clstrmgrES，结果如下

lssrc -ls clstrmgrES

Current state: ST_RP_FAILED
sccsid = "@(#)36 1.135.6.5 src/43haes/usr/sbin/cluster/hacmprd/main.C, hacmp.pe, 53haes_r610, 1442A_hacmp610 9/11/14 13:15:08"
i_local_nodeid 0, i_local_siteid -1, my_handle 1
ml_idx[1]=0 ml_idx[2]=1
tp is 20459278
Events on event queue:
te_type 4, te_nodeid 1, te_network -1
There are 0 events on the Ibcast queue
There are 0 events on the RM Ibcast queue
CLversion: 11
local node vrmf is 6111
cluster fix level is "1"
The following timer(s) are currently active:
Event error node list: LPAR1
Current DNP values
DNP Values for NodeId - 1 NodeName - LPAR1

PgSpFree = 128613  PvPctBusy = 0  PctTotalTimeIdle = 99.652258

DNP Values for NodeId - 2 NodeName - LPAR2

PgSpFree = 128973  PvPctBusy = 0  PctTotalTimeIdle = 99.790585

这个是什么原因？

关注3

参与18

2同行回答
全部行业
全部行业 银行
|
按赞同排序
按时间排序

wangmj系统运维工程师CES

hosts文件别名问题应该是与你配置HA时节点名导致的；
你把服务来拉起来后你这个2个persisit分别活在了不同vlan的网卡上，感觉也不太正常；
至于clcmd服务，建议你看下官方文档上面这部分具体要怎么配置，由于没配置过6所以也不是很清楚。

收起

银行 · 2018-01-16

查看赞同的人

chunchun2012
那是哪个地方配置出了问题？请指点一下，谢谢
2018-01-16
赞同
评论
wangmj
我感觉你还没搞清楚HA里面这些所谓的boot、standby、persisit ip的用途与意义，没必要生搬硬套的，那些只是一个名字，建议你下载官方的红皮书过一篇powerha 6的配置步骤。我只用过5跟7这2个版本，所以你6版本具体怎么配置我也不是还好说明。
2018-01-16
赞同1
评论
chunchun2012
这个是测试环境，取那些名字只是方便区分而已，之前也花了很长时间弄清楚这几个Ip的用途，boot ip是机器启动时的ip, 也就是网卡配置中指定的ip，standby 也可以理解为boot类的后备ip，作用与boot ip相同，主要是双机内部通讯用，persisit ip与节点绑定，可以在同一节点的不同的网卡上漂移，主要用于节点管理，service ip是对外提供服务的ip，可以在双机节点之间漂移，对外保证服务的可持续性，配置过程也是参照Powerha的文档来的，就是不知道哪个地方出了问题
2018-01-16
赞同
评论
wangmj
既然是测试环境，建议你把之前环境删掉，重新配置。配置时关于hosts文件的规划，建议你原来的boot ip直接对应主机名，采取下面的方式： boot ip 192.168.10.111 LPAR1 192.168.10.112 LPAR2 standby ip 192.168.20.111 LPAR1_st 192.168.20.112 LPAR2_st persisit ip 192.168.1.111 LPAR1_per 192.168.1.112 LPAR2_per service ip 192.168.1.110 LPAR_srv 确保使用host解析时主机名对应解析到正确的地址，由于per ip是在ha里面定义的，第一次同步ha之前并不能访问，所以最开始我们定义节点使用的名字最好一开始有对应的ip。 clcomd在你配置好了对应的那个hosts文件后，重启下这个服务并验证ok。至于你启动后停不下来的原因，你可以认真看下你的hacmp.out日志。
2018-01-17
赞同
评论

添加评论

crystalwmagic系统工程师浙商银行

看你配置的过程中是否忽略了 /usr/es/sbin/cluster/etc/rhosts文件的配置？

收起

银行 · 2018-01-16

查看赞同的人

chunchun2012
两边都配置了，LPAR1上是：# more /usr/es/sbin/cluster/etc/rhosts 192.168.10.112 192.168.20.112 LPAR2上是192.168.10.111 192.168.20.111
2018-01-16
赞同
评论
wangmj
添加了ip地址后是否重新启动过clcmd服务？
2018-01-16
赞同
评论
crystalwmagic回复 chunchun2012
建议rhosts文件中将cluster有关的所有地址都加入，boot ip、persist ip、service ip，而不是每个LPAR只加入对端的boot地址，rhost和hosts文件修改后建议重启clcomd和clcomdES服务，stop start，而不是refresh
2018-01-17
赞同
评论

添加评论

hacmp配置同步时的错误？

AIX6.1+hacmp 6.1,两节点配置如下：

hostname

hostname

boot ip

standby ip

persisit ip

service ip

boot ip

standby ip

persisit ip

service ip

lssrc -s clcomd

ifconfig -a|more

ifconfig -a|more

lssrc -ls clstrmgrES

2同行回答
全部行业
全部行业 银行
|
按赞同排序
按时间排序

提问者

相关问题

相关资料

相关文章

问题状态

hacmp配置同步时的错误？

AIX6.1+hacmp 6.1,两节点配置如下：

hostname

hostname

boot ip

standby ip

persisit ip

service ip

boot ip

standby ip

persisit ip

service ip

lssrc -s clcomd

ifconfig -a|more

ifconfig -a|more

lssrc -ls clstrmgrES

2同行回答全部行业全部行业银行|按赞同排序按时间排序

提问者

相关问题

相关资料

相关文章

问题状态

2同行回答
全部行业
全部行业银行
|
按赞同排序
按时间排序