现象
有测试反映test环境拉取产品列表的接口返回非常慢,试了一下确实这样,第一次猜测是查询数据库返回太慢了,修改代码reload nginx,结果现象没了。没办法只能追查日志,通过日志可以看到nginx上代理的游服务器出现了大量的访问超时
2018/01/12 15:09:55 [error] 69#0: *422291 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.244.1.1, server: , request: "GET XXX", upstream: "http://10.108.67.95/XXX", host: "xxx"
2018/01/12 15:09:55 [error] 69#0: *422291 upstream timed out (110: Connection timed out) while sending to client, client: 10.244.1.1, server: , request: "GET XXX", upstream: "http://10.108.67.95/XXX", host: "xxx"
2018/01/12 15:09:55 [error] 69#0: *422291 upstream timed out (110: Connection timed out) while sending to client, client: 10.244.1.1, server: , request: "GET XXX", upstream: "http://10.108.67.95/XXX", host: "xxx"
分析
根据接口,可以看出是访问上游的server超时,我们的服务全部跑在容器里,这里的服务地址也是kubernetes分配的地址,立马进入nginx容器,ping 我们上游server的内部域名,得到ip
# ping xxx.svc.cluster.local
PING xxx.svc.cluster.local (10.101.161.95) 56(84) bytes of data.
显然两个ip不一样,立马去查看当前生效的svc的cluster ip
[dev@k8s-cloud ~]$ kubectl get svc | grep xxx
xxx ClusterIP 10.101.161.95 <none> 7777/TCP 1d
到这里问题很显然了,nginx proxy_pass的地址不对,这里有几个疑问
- 为什么kubernetes 分配的service cluster ip变了
- 为什么内部域名xxx.svc.cluster.local没有正确解析到新的cluster ip上面
- 为什么正式上线服务的时候没有遇到这个问题
问题1:kubernetes 中svc重新创建后会分配新的cluster-ip, 对应上面svc的状态可以看到这个svc是一天前创建的,问了运维同事,在更新数据库脚本的时候会切断流量,具体的步骤就是
- 删除服务对应的svc
- 执行sql脚本
- 创建新的svc
回想起来,前一天正好上传执行了sql,因此上游server的cluster-ip变化了
问题2: nginx为什么没有proxy_pass到新的cluster-ip上, 查询资料得知,proxy_pass 中配置的域名仅会nginx在启动的时候解析一遍,之后会一直使用此IP进行访问
参考:https://www.nginx.com/blog/dns-service-discovery-nginx-plus/
server {
location / {
proxy_pass http://backends.example.com:8080;
}
}
As NGINX starts up or reloads its configuration, it queries a DNS server to resolve backends.example.com. The DNS server returns the list of three backends discussed above, and NGINX uses the default Round Robin algorithm to load balance requests among them. NGINX chooses the DNS server from the OS configuration file /etc/resolv.conf.
问题3:经过问题1和问题2可以知道,出现这个现象的原因是执行了sql脚本后,服务没有重启,而在正式环境更新的时候,通常都会伴随代码更新,并且顺序通常是先sql脚本再更新服务,因此新启动的nginx 始终解析到了新的cluster-ip上,因此正式环境没有出现这个问题。询问测试后得知测试环境是先更新了服务后更新了sql脚本,因此没有纠正nginx proxy_pass的解析
解决办法
Setting the Domain Name in a Variable
This method is a variant of the first, but enables us to control how often NGINX re‑resolves the domain name:
resolver 10.0.0.2 valid=10s;
server {
location / {
set $backend_servers backends.example.com;
proxy_pass http://$backend_servers:8080;
}
}
When you use a variable to specify the domain name in the proxy_pass directive, NGINX re‑resolves the domain name when its TTL expires. You must include the resolver directive to explicitly specify the name server (NGINX does not refer to /etc/resolv.conf as in the first two methods). By including the valid parameter to the resolver directive, you can tell NGINX to ignore the TTL and re‑resolve names at a specified frequency instead. Here we tell NGINX to re‑resolve names every 10 seconds.
location /proxy/s/ {
internal;
proxy_pass <S_SERVER>/;
}
改成
location /proxy/s/ {
internal;
set $service <S_SERVER>;
rewrite ^/proxy/s/(.*) /$1 break;
proxy_pass $service/;
}