环境如下:
节点1:
LPAR1
两网卡en0,en1配置如下:
en0:
Network INTERFACE en0
NAMESERVER
Internet ADDRESS (dotted decimal) []
DOMAIN Name []
Default Gateway
Address (dotted decimal or symbolic name) [192.168.1.1]
en1:
Network INTERFACE en1
NAMESERVER
Internet ADDRESS (dotted decimal) []
DOMAIN Name []
Default Gateway
Address (dotted decimal or symbolic name) [192.168.1.1]
Cost [0]
节点2配置:
LPAR2
两网卡en0,en1配置为:
en0:
Network INTERFACE en0
NAMESERVER
Internet ADDRESS (dotted decimal) []
DOMAIN Name []
Default Gateway
Address (dotted decimal or symbolic name) [192.168.1.1]
Cost [0] #
Do Active Dead Gateway Detection? no +
Your CABLE Type N/A +
START Now no
en1:
Network INTERFACE en1
NAMESERVER
Internet ADDRESS (dotted decimal) []
DOMAIN Name []
Default Gateway
Address (dotted decimal or symbolic name) [192.168.1.1]
Cost [0] #
Do Active Dead Gateway Detection? no +
Your CABLE Type N/A +
START Now no
两节点hosts文件内容为:
/etc/hosts文件内容:
192.168.10.111 LPAR1_boot
192.168.10.112 LPAR2_boot
192.168.20.111 LPAR1_standby
192.168.20.112 LPAR2_standby
192.168.1.111 LPAR1
192.168.1.112 LPAR2
192.168.1.110 LPAR_srv
现在遇到的问题是:
1. 完成hacmp配置,最后运行smit hacmp->Extended Configuratio->Extended Verification and Synchronization时,设置选项为:
[Entry Fields]
Verification has completed normally.
rshexec: cannot connect to node LPAR1
ERROR: Cannot refresh clcomdES subsystem on node LPAR1rshexec: cannot connect to node LPAR2
ERROR: Cannot refresh clcomdES subsystem on node LPAR2
请问这个错误对hacmp配置有影响吗?怎么解决?
2. 运行上述命令后,发现/etc/hosts文件被自动修改成了下面的样子:
192.168.10.112 LPAR2_boot
192.168.20.112 LPAR2_standby
192.168.1.111 LPAR1
192.168.1.112 LPAR2
192.168.1.110 LPAR_srv
192.168.10.111 LPAR1_boot LPAR1
192.168.20.111 LPAR1_standby LPAR1
加了别名,这个别名是什么机制?
3. 完成上述配置后,运行smit clstart,选择启动两个节点,
运行结果是OK,但是下方日志显示:
migcheck[475]: cl_connect() error, nodename=LPAR1, rc=-1
migcheck[475]: cl_connect() error, nodename=LPAR2, rc=-1
WARNING: A communication error was encountered trying to get the VRMF from remote nodes. Please make sure clcomd is running
按提示检查clcomd,
Subsystem Group PID Status
clcomd caa 4980916 active
两节点均显示active,既然是active,为什么会有上面的warning?
按步骤3启动服务后,查看Ip情况
节点LPAR1上
en0: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
inet 192.168.10.111 netmask 0xffffff00 broadcast 192.168.10.255
inet 192.168.1.111 netmask 0xffffff00 broadcast 192.168.1.255
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
en1: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
inet 192.168.20.111 netmask 0xffffff00 broadcast 192.168.20.255
inet 192.168.1.110 netmask 0xffffff00 broadcast 192.168.1.255
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
lo0: flags=e08084b,c0<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,LARGESEND,CHAIN>
inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
inet6 ::1%1/0
tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1
节点LPAR2上:
en0: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
inet 192.168.10.112 netmask 0xffffff00 broadcast 192.168.10.255
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
en1: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
inet 192.168.20.112 netmask 0xffffff00 broadcast 192.168.20.255
inet 192.168.1.112 netmask 0xffffff00 broadcast 192.168.1.255
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
(...)
ip查看没有发现异常,
使用smit hacmp->System Management (C-SPOC)--> HACMP Services-->Show Cluster Services
显示服务运行如下
Status of the RSCT subsystems used by HACMP:
Subsystem Group PID Status
topsvcs topsvcs 9633858 active
grpsvcs grpsvcs 13172936 active
grpglsm grpsvcs inoperative
emsvcs emsvcs 7733330 active
emaixos emsvcs inoperative
ctrmc rsct 5112004 active
Status of the HACMP subsystems:
Subsystem Group PID Status
clcomdES clcomdES 4063414 active
clstrmgrES cluster 6815944 active
Status of the optional HACMP subsystems:
Subsystem Group PID Status
clinfoES cluster 4128932 active
初步看到这些状态都是正常的,但是在LPAR1上运行stop service时,运行失败,提示
Command: failed stdout: yes stderr: no
cl_clstop: ERROR: Node LPAR1 has 1 event(s) outstanding as reported by command 'lssrc -ls clstrmgrES' and cannot be stopped until all outstandi
ng events have completed. The stop request has been aborted for all nodes. Please wait for all nodes to stabalize before attempting to stop c
luster services again.
根据提示,运行lssrc -ls clstrmgrES,结果如下
Current state: ST_RP_FAILED
sccsid = "@(#)36 1.135.6.5 src/43haes/usr/sbin/cluster/hacmprd/main.C, hacmp.pe, 53haes_r610, 1442A_hacmp610 9/11/14 13:15:08"
i_local_nodeid 0, i_local_siteid -1, my_handle 1
ml_idx[1]=0 ml_idx[2]=1
tp is 20459278
Events on event queue:
te_type 4, te_nodeid 1, te_network -1
There are 0 events on the Ibcast queue
There are 0 events on the RM Ibcast queue
CLversion: 11
local node vrmf is 6111
cluster fix level is "1"
The following timer(s) are currently active:
Event error node list: LPAR1
Current DNP values
DNP Values for NodeId - 1 NodeName - LPAR1
PgSpFree = 128613 PvPctBusy = 0 PctTotalTimeIdle = 99.652258
DNP Values for NodeId - 2 NodeName - LPAR2
PgSpFree = 128973 PvPctBusy = 0 PctTotalTimeIdle = 99.790585
这个是什么原因?
hosts文件别名问题应该是与你配置HA时节点名导致的;
你把服务来拉起来后你这个2个persisit分别活在了不同vlan的网卡上,感觉也不太正常;
至于clcmd服务,建议你看下官方文档上面这部分具体要怎么配置,由于没配置过6所以也不是很清楚。
看你配置的过程中是否忽略了 /usr/es/sbin/cluster/etc/rhosts文件的配置?
收起