在测试sentinel的时候遇到个问题,我在github上面也看到有人提出类似的问题。
测试环境在CentOS7.3, firewalld已经停掉, 同时已经禁用selinux, redis版本是3.2.9, 测试采用一主两从的结构
Node | Host | IP | ServerPort | SentinelPort |
---|---|---|---|---|
Node 1 | node59 | 172.16.60.59 | 6379 | 26379 |
Node 2 | node60 | 172.16.60.60 | 6379 | 26379 |
Node 3 | node61 | 172.16.60.61 | 6379 | 26379 |
其中172.16.60.59是主节点, info replication输出结果是
127.0.0.1:6379> info replication
# Replication
role:master
connected_slaves:2
slave0:ip=172.16.60.60,port=6379,state=online,offset=9496017,lag=0
slave1:ip=172.16.60.61,port=6379,state=online,offset=9496017,lag=0
master_repl_offset:9496017
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:9297179
repl_backlog_histlen:198839
接着node59配置sentinel.conf
port 26379
bind 127.0.0.1 172.16.60.59
dir "/data/redis/redis_6379"
logfile "/data/redis/redis_6379/logs/sentinel.log"
unixsocket "/tmp/sentinel_26379.socket"
daemonize yes
sentinel monitor redist 172.16.60.59 6379 2
sentinel down-after-milliseconds redist 2000
sentinel parallel-syncs redist 1
sentinel failover-timeout redist 10000
sentinel auth-pass redist z5gCkXvn9XHR93MEeZfkF2t9WHk1xwlmQH4GXJxOUBr0Ghe7YtDe5jJbBGHW8jEO
启动sentinel,可以看到master0:name=redist,status=ok,address=172.16.60.59:6379,slaves=2,sentinels=1
接着配置node60的sentinel, 只修改bind 127.0.0.1 172.16.60.60, 启动sentinel,执行info sentinel输出看到的结果是master0:name=redist,status=sdown,address=172.16.60.59:6379,slaves=0,sentinels=1;奇怪了,明明node59上的sentinel可以正常工作, 而node60的去无法正常加入集群;开始还以为是配置有问题,反复检查配置,把配置干掉重新搞还是一样;TDDL~~~;清醒下头脑,一步步排查:
(1) arping ip 检查连接IP的mac地址是正确, 网络连接没有问题;
(2) 使用redis-cli检查主从节点端口可以正常边通, 排除端口问题!
(3) 查看sentinel日志输出,只有+monitor master redist 172.16.60.59 6379 quorum 2,+sdown master redist 172.16.60.59 6379,并没有报错信息;
这些问题点都排除了,似乎没有什么问题;怎么办怎么办?把debug日志打开,看到不一样的日志输出:
4402:X 27 Jun 13:54:54.503 # Sentinel ID is 3f78b3beb8f303c4dc63609d880f442be919fe64
4402:X 27 Jun 13:54:54.503 # +monitor master redist 172.16.60.59 6379 quorum 2
4402:X 27 Jun 13:54:54.503 . -cmd-link-reconnection master redist 172.16.60.59 6379 #Invalid argument
4402:X 27 Jun 13:54:54.503 . -pubsub-link-reconnection master redist 172.16.60.59 6379 #Invalid argument
4402:X 27 Jun 13:54:55.575 . -cmd-link-reconnection master redist 172.16.60.59 6379 #Invalid argument
4402:X 27 Jun 13:54:55.575 . -pubsub-link-reconnection master redist 172.16.60.59 6379 #Invalid argument
4402:X 27 Jun 13:54:56.530 # +sdown master redist 172.16.60.59 6379
4402:X 27 Jun 13:54:56.585 . -cmd-link-reconnection master redist 172.16.60.59 6379 #Invalid argument
4402:X 27 Jun 13:54:56.585 . -pubsub-link-reconnection master redist 172.16.60.59 6379 #Invalid argument
Invalid argument???搞不明白, 我在sentinel的源码查找到-cmd-link-reconnection和-pubsub-link-reconnection相关的源码
/* Commands connection. */ if (link->cc == NULL) { link->cc = redisAsyncConnectBind(ri->addr->ip,ri->addr->port,NET_FIRST_BIND_ADDR); if (link->cc->err) { sentinelEvent(LL_DEBUG,"-cmd-link-reconnection",ri,"%@ #%s", link->cc->errstr); instanceLinkCloseConnection(link,link->cc); } else { link->pending_commands = 0; link->cc_conn_time = mstime(); link->cc->data = link; redisAeAttach(server.el,link->cc); redisAsyncSetConnectCallback(link->cc, sentinelLinkEstablishedCallback); redisAsyncSetDisconnectCallback(link->cc, sentinelDisconnectCallback); sentinelSendAuthIfNeeded(ri,link->cc); sentinelSetClientName(ri,link->cc,"cmd"); /* Send a PING ASAP when reconnecting. */ sentinelSendPing(ri); } } /* Pub / Sub */ if ((ri->flags & (SRI_MASTER|SRI_SLAVE)) && link->pc == NULL) { link->pc = redisAsyncConnectBind(ri->addr->ip,ri->addr->port,NET_FIRST_BIND_ADDR); if (link->pc->err) { sentinelEvent(LL_DEBUG,"-pubsub-link-reconnection",ri,"%@ #%s", link->pc->errstr); instanceLinkCloseConnection(link,link->pc); } else { int retval; link->pc_conn_time = mstime(); link->pc->data = link; redisAeAttach(server.el,link->pc); redisAsyncSetConnectCallback(link->pc, sentinelLinkEstablishedCallback); redisAsyncSetDisconnectCallback(link->pc, sentinelDisconnectCallback); sentinelSendAuthIfNeeded(ri,link->pc); sentinelSetClientName(ri,link->pc,"pubsub"); /* Now we subscribe to the Sentinels "Hello" channel. */ retval = redisAsyncCommand(link->pc, sentinelReceiveHelloMessages, ri, "SUBSCRIBE %s", SENTINEL_HELLO_CHANNEL); if (retval != C_OK) { /* If we can't subscribe, the Pub/Sub connection is useless * and we can simply disconnect it and try again. */ instanceLinkCloseConnection(link,link->pc); return; } } }
通过这段源码找到个关键点NET_FIRST_BIND_ADDR, 心想这不是取bind的第一个参数吗!!!然后我在server.h里找到
440 /* Get the first bind addr or NULL */
441 #define NET_FIRST_BIND_ADDR (server.bindaddr_count ? server.bindaddr[0] : NULL)
终于找到你,还好我没放弃~~~
把私网ip放在前面,重启sentinel, 执行info sentinel输出可以看到status=ok, 问题解决了!
总结: 如果sentinel配置了bind参数,sentinel将获取第一个ip去检测主节点状态, 由于127.0.0.1是个回环地址,所以当bind第一个ip配置成127.0.0.1时无法连接其他机器的ip,所以配置时第一个ip不能配置为回环地址,建议redis相关的ip绑定都先写私网ip再写回环ip
折腾一晚 终于解决一样的问题;礼貌留言
nice