一、背景
Mysql 的DBA給Mysql定義一套規則,mysql 服務器端的默認的超時時間wait_timeout為8小時,但DBA把wait_timeout改為600秒,我估計這規則本意是減少數據庫的長時間鏈接的情況,只要鏈接空閑超過600秒,服務器端會自動斷開鏈接。所有產生的影響必須由客戶端程序來保障。
1.1、版本說明:
mysql:5.7.17
druid:1.1.5
mysql-connector-java:5.1.44
1.2、druid重要屬性配置
<bean id="dataSource" class="com.alibaba.druid.pool.DruidDataSource" init-method="init" destroy-method="close">
<!-- 基本屬性 url、user、password -->
<property name="url" value="${jdbc_url}" />
<property name="username" value="${jdbc_user}" />
<property name="password" value="${jdbc_password}" />
<!-- 配置初始化大小、最小、最大 -->
<property name="initialSize" value="5" />
<property name="minIdle" value="10" />
<property name="maxActive" value="20" />
<!-- 配置獲取連接等待超時的時間 -->
<property name="maxWait" value="60000" />
<!-- 配置間隔多久才進行一次檢測,檢測需要關閉的空閑連接,單位是毫秒 -->
<property name="timeBetweenEvictionRunsMillis" value="50000" />
<!-- 配置一個連接在池中最小、最大生存的時間,單位是毫秒 -->
<property name="minEvictableIdleTimeMillis" value="60000" />
<property name="maxEvictableIdleTimeMillis" value="500000" />
<property name="validationQuery" value="select 1" />
<property name="testWhileIdle" value="true" />
<property name="testOnBorrow" value="false" />
<property name="testOnReturn" value="false" />
<property name="keepAlive" value="true" />
<property name="phyMaxUseCount" value="1000" />
<!-- 配置監控統計攔截的filters -->
<property name="filters" value="stat" />
</bean>
詳細配置請見官網文檔:DruidDataSource配置屬性列表
二、現象
系統上線一段時間后,在監控時而報錯如下:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 72,557 milliseconds ago. The last packet sent successfully to the server was 0 milliseconds ago.
根據錯誤日志初步判斷肯定是 mysql服務端把鏈接已經斷開,但客戶端不知道并,依然嘗試使用了一個已經斷開的鏈接才會引起這個錯誤發生。但是根據我們對 druid 了解,druid 有鏈接檢查功能,按理不會拿到一個無效鏈接才對。
三、分析
3.1、整體分析
圖片表示了druid在獲取線程池的大致的邏輯過程:druid在初始化時會創建兩個守護線程,分別承擔線程的創建(CreateConnectionThread)和銷毀任務(DestoryConnectionThread),
當用戶線程出現等待獲取線程的操作時(且線程池中的線程數不大于最大活動線程數),創建線程會自動創建新的連接并放到線程池中,所以當用戶線程需要新的連接時,只需要直接從線程池獲取即可。
當用戶線程從線程池中獲取到連接會根據用戶的配置決定是否線程進行有效性驗證,如果驗證線程有效則返回線程,如果無效則將該連接關閉,(DestoryConnectionThread自動回收已關閉的連接)
3.2、線程創建及銷毀任務
程序啟動在創建數據連接時,會自動創建兩個任務(job),也就是CreateConnectionThread和DestoryConnectionThread
- CreateConnectionThread比較簡單,也是個守候線程,代碼如下:
public class CreateConnectionThread extends Thread {
public CreateConnectionThread(String name){
super(name);
this.setDaemon(true);
}
public void run() {
initedLatch.countDown();
long lastDiscardCount = 0;
int errorCount = 0;
for (;;) {
// addLast
try {
lock.lockInterruptibly();
} catch (InterruptedException e2) {
break;
}
long discardCount = DruidDataSource.this.discardCount;
boolean discardChanged = discardCount - lastDiscardCount > 0;
lastDiscardCount = discardCount;
try {
boolean emptyWait = true;
if (createError != null
&& poolingCount == 0
&& !discardChanged) {
emptyWait = false;
}
if (emptyWait
&& asyncInit && createCount < initialSize) {
emptyWait = false;
}
if (emptyWait) {
// 必須存在線程等待,才創建連接
if (poolingCount >= notEmptyWaitThreadCount //
&& (!(keepAlive && activeCount + poolingCount < minIdle))
&& !isFailContinuous()
) {
empty.await();
}
// 防止創建超過maxActive數量的連接
if (activeCount + poolingCount >= maxActive) {
empty.await();
continue;
}
}
} catch (InterruptedException e) {
lastCreateError = e;
lastErrorTimeMillis = System.currentTimeMillis();
if (!closing) {
LOG.error("create connection Thread Interrupted, url: " + jdbcUrl, e);
}
break;
} finally {
lock.unlock();
}
PhysicalConnectionInfo connection = null;
try {
connection = createPhysicalConnection();
} catch (SQLException e) {
LOG.error("create connection SQLException, url: " + jdbcUrl + ", errorCode " + e.getErrorCode()
+ ", state " + e.getSQLState(), e);
errorCount++;
if (errorCount > connectionErrorRetryAttempts && timeBetweenConnectErrorMillis > 0) {
// fail over retry attempts
setFailContinuous(true);
if (failFast) {
lock.lock();
try {
notEmpty.signalAll();
} finally {
lock.unlock();
}
}
if (breakAfterAcquireFailure) {
break;
}
try {
Thread.sleep(timeBetweenConnectErrorMillis);
} catch (InterruptedException interruptEx) {
break;
}
}
} catch (RuntimeException e) {
LOG.error("create connection RuntimeException", e);
setFailContinuous(true);
continue;
} catch (Error e) {
LOG.error("create connection Error", e);
setFailContinuous(true);
break;
}
if (connection == null) {
continue;
}
boolean result = put(connection);
if (!result) {
JdbcUtils.close(connection.getPhysicalConnection());
LOG.info("put physical connection to pool failed.");
}
errorCount = 0; // reset errorCount
}
}
}
說明:
- 死循環 for (;;)
- 必須存在線程等待,才創建連接。
- 創超時不能超過maxActive數量的連接。
- DestoryConnectionThread
創建了一個死循環的任務,每過timeBetweenEvictionRunsMillis執行一次。public class DestroyConnectionThread extends Thread { public DestroyConnectionThread(String name){ super(name); this.setDaemon(true); } public void run() { initedLatch.countDown(); for (;;) { // 從前面開始刪除 try { if (closed) { break; } if (timeBetweenEvictionRunsMillis > 0) { Thread.sleep(timeBetweenEvictionRunsMillis); } else { Thread.sleep(1000); // } if (Thread.interrupted()) { break; } destroyTask.run(); } catch (InterruptedException e) { break; } } } }
其實真執行回收的以下的方法。
public void shrink(boolean checkTime, boolean keepAlive) {
try {
lock.lockInterruptibly();
} catch (InterruptedException e) {
return;
}
boolean needFill = false;
int evictCount = 0;
int keepAliveCount = 0;
try {
if (!inited) {
return;
}
final int checkCount = poolingCount - minIdle;
final long currentTimeMillis = System.currentTimeMillis();
for (int i = 0; i < poolingCount; ++i) {
DruidConnectionHolder connection = connections[i];
if (checkTime) {
if (phyTimeoutMillis > 0) {
long phyConnectTimeMillis = currentTimeMillis - connection.connectTimeMillis;
if (phyConnectTimeMillis > phyTimeoutMillis) {
evictConnections[evictCount++] = connection;
continue;
}
}
long idleMillis = currentTimeMillis - connection.lastActiveTimeMillis;
if (idleMillis < minEvictableIdleTimeMillis
&& idleMillis < keepAliveBetweenTimeMillis
) {
break;
}
if (idleMillis >= minEvictableIdleTimeMillis) {
if (checkTime && i < checkCount) {
evictConnections[evictCount++] = connection;
continue;
} else if (idleMillis > maxEvictableIdleTimeMillis) {
evictConnections[evictCount++] = connection;
continue;
}
}
if (keepAlive && idleMillis >= keepAliveBetweenTimeMillis) {
keepAliveConnections[keepAliveCount++] = connection;
}
} else {
if (i < checkCount) {
evictConnections[evictCount++] = connection;
} else {
break;
}
}
}
int removeCount = evictCount + keepAliveCount;
if (removeCount > 0) {
System.arraycopy(connections, removeCount, connections, 0, poolingCount - removeCount);
Arrays.fill(connections, poolingCount - removeCount, poolingCount, null);
poolingCount -= removeCount;
}
keepAliveCheckCount += keepAliveCount;
if (keepAlive && poolingCount + activeCount < minIdle) {
needFill = true;
}
} finally {
lock.unlock();
}
if (evictCount > 0) {
for (int i = 0; i < evictCount; ++i) {
DruidConnectionHolder item = evictConnections[i];
Connection connection = item.getConnection();
JdbcUtils.close(connection);
destroyCountUpdater.incrementAndGet(this);
}
Arrays.fill(evictConnections, null);
}
if (keepAliveCount > 0) {
// keep order
for (int i = keepAliveCount - 1; i >= 0; --i) {
DruidConnectionHolder holer = keepAliveConnections[i];
Connection connection = holer.getConnection();
holer.incrementKeepAliveCheckCount();
boolean validate = false;
try {
this.validateConnection(connection);
validate = true;
} catch (Throwable error) {
if (LOG.isDebugEnabled()) {
LOG.debug("keepAliveErr", error);
}
// skip
}
boolean discard = !validate;
if (validate) {
holer.lastKeepTimeMillis = System.currentTimeMillis();
boolean putOk = put(holer, 0L);
if (!putOk) {
discard = true;
}
}
if (discard) {
try {
connection.close();
} catch (Exception e) {
// skip
}
lock.lock();
try {
discardCount++;
if (activeCount + poolingCount <= minIdle) {
emptySignal();
}
} finally {
lock.unlock();
}
}
}
this.getDataSourceStat().addKeepAliveCheckCount(keepAliveCount);
Arrays.fill(keepAliveConnections, null);
}
if (needFill) {
lock.lock();
try {
int fillCount = minIdle - (activeCount + poolingCount + createTaskCount);
for (int i = 0; i < fillCount; ++i) {
emptySignal();
}
} finally {
lock.unlock();
}
}
}
說明:
- 空閑時間還沒有超最小生存時間(minEvictableIdleTimeMillis)時,是不會回收的。
- 空閑時間超過最小生存時間是并不會全部回收,只會回收前poolingCount - minIdle,minIdle數量暫時不會回收。
- 當空閑大于最大生存時間時(maxEvictableIdleTimeMillis)時,由客戶端口全部回收,druid默認的maxEvictableIdleTimeMillis為7小時。
- 當服務端的超時時間大于配置的wait_timeout時間時會由服務器全部斷開超時鏈接,不管客端的情況。
- keepAlive保持鏈接的邏輯也在這代碼中體現。
3.3、客戶端超時驗證機制
超時驗證機制是指客戶端拿到鏈接時,只要當時時間與鏈接最后活動時間的差大于檢測間隔時間(即currentTimeMillis - lastActiveTimeMillis > timeBetweenEvictionRunsMillis)則會發起鏈接檢測,執行testConnectionInternal檢測如代碼:
public DruidPooledConnection getConnectionDirect(long maxWaitMillis)
此處省略其它代碼
if (testWhileIdle) {
final DruidConnectionHolder holder = poolableConnection.holder;
long currentTimeMillis = System.currentTimeMillis();
long lastActiveTimeMillis = holder.lastActiveTimeMillis;
long lastKeepTimeMillis = holder.lastKeepTimeMillis;
if (lastKeepTimeMillis > lastActiveTimeMillis) {
lastActiveTimeMillis = lastKeepTimeMillis;
}
long idleMillis = currentTimeMillis - lastActiveTimeMillis;
long timeBetweenEvictionRunsMillis = this.timeBetweenEvictionRunsMillis;
if (timeBetweenEvictionRunsMillis <= 0) {
timeBetweenEvictionRunsMillis = DEFAULT_TIME_BETWEEN_EVICTION_RUNS_MILLIS;
}
if (idleMillis >= timeBetweenEvictionRunsMillis
|| idleMillis < 0 // unexcepted branch
) {
boolean validate = testConnectionInternal(poolableConnection.holder, poolableConnection.conn);
if (!validate) {
if (LOG.isDebugEnabled()) {
LOG.debug("skip not validate connection.");
}
discardConnection(realConnection);
continue;
}
}
}
}
說明:在testOnBorrow=false的情況下,這個參數不要輕易配置,否則浪費性能。
Mysql的檢測機制有兩種,兩種機制理論上能達到相同的效果。在執行的時間都能達到續約的效果(即執行后能重置lastActiveTimeMillis為當前執行的時間)
- ping
druid默認是ping機制,所以默認時配置validationQuery參數是無效的。如果要改變則進程的啟動參數中(jvm參數)設置-Ddruid.mysql.usePingMethod=false即可。 - 查詢SQL(select 1)
使用查詢的機制來檢測鏈接的可用性更的保險。
代碼如下:
public boolean isValidConnection(Connection conn, String validateQuery, int validationQueryTimeout) throws Exception {
if (conn.isClosed()) {
return false;
}
if (usePingMethod) {
if (conn instanceof DruidPooledConnection) {
conn = ((DruidPooledConnection) conn).getConnection();
}
if (conn instanceof ConnectionProxy) {
conn = ((ConnectionProxy) conn).getRawObject();
}
if (clazz.isAssignableFrom(conn.getClass())) {
if (validationQueryTimeout <= 0) {
validationQueryTimeout = DEFAULT_VALIDATION_QUERY_TIMEOUT;
}
try {
ping.invoke(conn, true, validationQueryTimeout * 1000);
} catch (InvocationTargetException e) {
Throwable cause = e.getCause();
if (cause instanceof SQLException) {
throw (SQLException) cause;
}
throw e;
}
return true;
}
}
String query = validateQuery;
if (validateQuery == null || validateQuery.isEmpty()) {
query = DEFAULT_VALIDATION_QUERY;
}
Statement stmt = null;
ResultSet rs = null;
try {
stmt = conn.createStatement();
if (validationQueryTimeout > 0) {
stmt.setQueryTimeout(validationQueryTimeout);
}
rs = stmt.executeQuery(query);
return true;
} finally {
JdbcUtils.close(rs);
JdbcUtils.close(stmt);
}
}
四、結論
1、理論上是不會出來問題的,因為空閑時間只要大于等于timeBetweenEvictionRunsMillis時間會驗測出來,則timeBetweenEvictionRunsMillis=60秒,遠小于MYSQL的wait_timeout=600秒。除非mysql-connect-java,在默認ping的機制有不穩定性因素。
2、可能網絡抖動在執行驗證時就已失敗,或非鏈接斷開原因。
五、解決方案
1、配置maxEvictableIdleTimeMillis=500000(500秒)
默認情況下maxEvictableIdleTimeMillis=25200000L (即7小時),因為數據庫的超時時間從8小時改為600秒,為減少風險理應該由客戶端在空閑時主動關閉鏈接。而非超時后由mysql服務器端把鏈接關閉。
這個值的合理再應該根據公式:
maxEvictableIdleTimeMillis+timeBetweenEvictionRunsMillis<mysql服務器的wait_timeout
即:500000ms+50000=550000<600000.
由客戶端主動釋放超時鏈接。
2、配置keepAlive=true
打開KeepAlive之后的效果:
- 初始化連接池時會填充到minIdle數量。
- 連接池中的minIdle數量以內的連接,空閑時間超過minEvictableIdleTimeMillis,則會執行keepAlive操作。
- 當網絡斷開等原因產生的由ExceptionSorter檢測出來的死連接被清除后,自動補充連接到minIdle數量。
KeepAlive的最大作用是以minIdle數量自動繼租,無論服務器端還是客戶端都無法釋放超時鏈接(因為不會超時)。
KeepAlive的使用條件是:建議使用druid 1.1.16或者更高版本。
詳細文檔見官網:KeepAlive配置
3、mysql服務端改回wait_timeout= 28800(8小時)
因為默認情況下客戶端最大存活時間maxEvictableIdleTimeMillis為7小時,所以在服務器斷開鏈接前,由客戶端主動釋放超時鏈接。(類似解決方案一)。
4、建議升級mysql-connect-java 5.1.48版本。
因為Mysql數據庫的與驅動包版本的兼容性可能存在問題,升級新的版本解決了很多存在的BUG,具體的解決哪個問題可以在官方文檔。