问题描述
最近项目中遇到一个问题,在某一个时刻应用系统出现了大量的ES访问超时的现象,异常日志主要为:
JAVA.io.IOException: Connection reset by peer
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:793) ~[elasticsearch-rest-client-7.4.0.jar!/:7.4.0]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:218) ~[elasticsearch-rest-client-7.4.0.jar!/:7.4.0]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:205) ~[elasticsearch-rest-client-7.4.0.jar!/:7.4.0]
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1454) ~[elasticsearch-rest-high-level-client-7.4.0.jar!/:7.4.0]
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1439) ~[elasticsearch-rest-high-level-client-7.4.0.jar!/:7.4.0]
at org.elasticsearch.client.IndicesClient.exists(IndicesClient.java:785) ~[elasticsearch-rest-high-level-client-7.4.0.jar!/:7.4.0]
at com.free4inno.scheduler.adapter.service.elasticsearch.GenericEsService.isIndexExists(GenericEsService.java:60) ~[classes!/:0.0.1-SNAPSHOT]
at com.free4inno.scheduler.adapter.service.elasticsearch.MetricsToEsService.getTodayIndex(MetricsToEsService.java:56) ~[classes!/:0.0.1-SNAPSHOT]
at com.free4inno.scheduler.adapter.service.elasticsearch.MetricsToEsService.insert(MetricsToEsService.java:49) ~[classes!/:0.0.1-SNAPSHOT]
at com.free4inno.scheduler.adapter.service.Scheduler.sendMetrics(Scheduler.java:64) [classes!/:0.0.1-SNAPSHOT]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_261]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_261]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_261]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_261]
at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) [spring-context-5.2.8.RELEASE.jar!/:5.2.8.RELEASE]
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-5.2.8.RELEASE.jar!/:5.2.8.RELEASE]
at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:93) [spring-context-5.2.8.RELEASE.jar!/:5.2.8.RELEASE]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_261]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_261]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_261]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_261]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_261]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_261]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_261]
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_261]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_261]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_261]
at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[?:1.8.0_261]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:378) ~[?:1.8.0_261]
at org.Apache.http.impl.nio.reactor.SessionInputBufferImpl.fill(SessionInputBufferImpl.java:231) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.codecs.AbstractMessageParser.fillBuffer(AbstractMessageParser.java:136) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:241) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[httpcore-nio-4.4.13.jar!/:4.4.13]
... 1 more
异常现象表现为:
①第一次连接出现timeout,后续连接正常且timeout时长为配置的socket时长。
②抛出异常时间段不固定。
问题定位
这个问题在之前开发多集群管理平台的时候也出现,当时因为存在③跨版本且还没有正式投产的情况下,并没有太关注的此类问题,觉得此问题可能是版本兼容性。项目组报上来问题才认真地排查了一番。
应用系统通过ES High Level Rest Client客户端直连且版本相同,可以排查③的情况。起初怀疑是查询或者写入量大,后台处理不及时的原因导致异常。于是查看监控数据和日志,监控显示在几次出现异常的时间点上各项指标这都很正常,没有出现流量陡增的情况,结合现象②,可以排除是因为服务端压力的原因。
接着把影响异常的源头定位在客户端。ES客户端和服务端的连接采用的是长连接,查阅源码发现客户端创建了client连接池,每个client持有一个http连接,并且开启http的keep-alive策略复用连接。正是因为这个复用探活的原因导致该异常发生。下面具体分析该问题。
问题分析
在未处理前restClient客户端创建示意代码:
final RestClientBuilder restClientBuilder = RestClient.builder(redisHostList).setRequestConfigCallBack(requestConfigBuilder -> {;
requestConfigBuilder.setConnectTimeout(5000);
requestConfigBuilder.setSocketTimeout(60000);
requestConfigBuilder.setConnectionRequestTimeout(500);
return requestConfigBuilder;
}).setHttpClientConfigCallback((httpAsyncClientBuilder) -> {
httpAsyncClientBuilder.disableAuthCacheing();
return httpAsyncClientBuilder; final RestHighLevelClient restHighLevelClient = new RestHighLevelClient(restClientBuilder);
客户端与服务端的连接示意图:
ES high-level-client 对长连接的实现是把超时时间设置为-1,意味着客户端永远不超时,服务端设备为了资源的利用率会检测与此设备的连接是否在使用,如果一个连接长时间没有使用,服务端会主动把这个连接关闭,而此时客户端不知情,还处在连接状态,可以说当前处于半连接状态。当有流量进来的时候使用了该连接就会发现与服务端连接不上,产生timeout,客户端也断开此链接,如上图 1 所示。客户端使用新连接 2 与服务端通信。因此问题的根源在于客户端没有及时发现连接的不可用并断开,因此需要设置让客户端主动对tcp连接进行探测保活。
解决方案
一、客户端显式的设置setKeepAliveStrategy
httpAsyncClientBuilder.setKeepAliveStrategy((httpResponse,httpContext) -> TimeUtils.MINUTES.toMillis(3))
二、显示开启 tcp keepalive
httpAsyncClientBuilder.setDefaultIOReactorConfig(IOReactorConfig.sustom().setSoKeepAlive(true).build())
三、在系统层面设置tcp keepalive探测保活时间
net.ipv4.tcp_keepalive_time = 60
来源:
https://www.modb.pro/db/388569