文/Bruce.Liu1
文章大綱
- MHA簡介
1.1. mha組件介紹
1.2. 背景和目標- MHA原理
2.1. MHA工作原理
2.2. MHA工具介紹
2.3. 當前高可用方案
2.4. MHA的優勢- MHA最佳實踐
3.1. 背景介紹
3.2. 安裝MySQL實例
3.3. 部署MySQL復制
3.4. 部署MHA軟件
3.5. 故障自動切換與在線切換
1.MHA簡介
MHA是什么?
MHA是由日本Mysql yoshinorim專家(原就職于DeNA現就職于FaceBook)用Perl寫的一套Mysql故障切換方案,來保障數據庫的高可用性,它的功能是能在0-30s之內實現主Mysql故障轉移(failover),MHA故障轉移可以很好的幫我們解決從庫數據的一致性問題,同時最大化挽回故障發生后數據的一致性。
官方網站:https://code.google.com/p/mysql-master-ha/
MHA(Master High Availability)目前在MySQL高可用方面是一個相對成熟的解決方案,它由日本DeNA公司youshimaton(現就職于Facebook公司)開發,是一套優秀的作為MySQL高可用性環境下故障切換和主從提升的高可用軟件。在MySQL故障切換過程中,MHA能做到在0~30秒之內自動完成數據庫的故障切換操作,并且在進行故障切換的過程中,MHA能在較大程度上保證數據的一致性,以達到真正意義上的高可用。
該軟件由兩部分組成:MHA Manager(管理節點)和MHA Node(數據節點)。MHA Manager可以單獨部署在一臺獨立的機器上管理多個master-slave集群,也可以部署在一臺slave節點上。MHA Node運行在每臺MySQL服務器上,MHA Manager會定時探測集群中的master節點,當master出現故障時,它可以自動將數據的slave提升為新的master,然后將所有其他的slave重新指向新的master。整個故障轉移過程對應用程序完全透明。
在MHA自動故障切換過程中,MHA試圖從宕機的主服務器上保存二進制日志,較大程度的保證數據的不丟失,但這并不總是可行的。例如,如果主服務器硬件故障或無法通過ssh訪問,MHA沒法保存二進制日志,只進行故障轉移而丟失了的數據。使用MySQL 5.5的半同步復制,可以大大降低數據丟失的風險。MHA可以與半同步復制結合起來。如果只有一個slave已經收到了的二進制日志,MHA可以將的二進制日志應用于其他所有的slave服務器上,因此可以保證所有節點的數據一致性。
1.1.mha組件介紹
MHA Manager
運行一些工具,比如masterha_manager工具實現自動監控MySQL Master和實現master故障切換,其它工具實現手動實現master故障切換、在線mater轉移、連接檢查等等。一個Manager可以管理多 個master-slave集群MHA Node
部署在所有運行MySQL的服務器上,無論是master還是slave。主要作用有三個。
1.保存二進制日志
如果能夠訪問故障master,會拷貝master的二進制日志
2.應用差異中繼日志
從擁有最新數據的slave上生成差異中繼日志,然后應用差異日志。
3.清除中繼日志
在不停止SQL線程的情況下刪除中繼日志
1.2.背景和目標
在早期的MySQL架構中最主流就就是MySQL復制的主從結構,但伴隨時間的推移以及數據的膨脹會出現一下幾類問題。
以前幾十臺DB服務器,人工登陸服務器就能維護好,也沒有高可用,當master掛了,通知業務將IP切換到slave然后重啟也能基本滿足業務要求,但是業務迅速發展,實例數不斷增加,復制集不斷增加,數據庫架構多樣化,而這種人工維護方式顯然大大增加了DBA工作量,而且效率低下、容易出錯。
DB規模的增大,機器故障、SQL故障、實例故障出現的概率也增加、還有來自業務方的DB變更,比如大表增加字段、增加索引、批量刪除數據等異常維護操作,當然這些在一定條件下可用采用在線變更,比如采用pt-online-schema-change工具,但是當不滿足在線變更條件、或者在線變更復雜的情況下,就需要采用滾動變更的方式,先在各個slave上變更、在線切換后再在master上變更,然后再進行一次切換還原,而這些切換操作如果全部手工敲命令來進行顯然是不可取的。
-
隨著用戶數的不斷增加,業務方對DB這種基礎服務的可用性也就越來越高,在魅族業務對DB的可用性要求是每個月需要達到四個9,也就意味著每個月的故障時間只有不到5分鐘,以前那種通知業務更改IP重啟的方式顯然是達不到這個要求的。
在這些背景和要求下,我們需要擺脫手工操作,需要一套有效的MySQL高可用方案和一個高效的高可用平臺來支撐DB的快速增長。MySQL高可用平臺需要達到的目標有以下幾點:
1.數據一致性保證這個是最基本的同時也是前提,如果主備的數據的不一致,那么切換就無法進行,當然這里的一致性也是一個相對的,但是要做到最終一致性。
2.故障快速切換,當master故障時這里可以是機器故障或者是實例故障,要確保業務能在最短時間切換到備用節點,使得業務受影響時間最短。這里也可以指業務例行維護操作,比如前面提到的無法使用在線進行DDL的DDL操作,很多分表批量的DDL操作,這些操作通過在線切換方式來滾動完成。
3.簡化日常維護,通過高可用平臺來自動完成高可用的部署、維護、監控等任務,能夠最大程度的解放DBA手動操作,提高日常運維效率。
4.統一管理,當復制集很多的情況下,能夠統一管理高可用實例信息、實例信息、監控信息、切換信息等。
高可用的部署要對現有的數據庫架構無影響,如果因為部署高可用,需要更改或者調整數據庫架構則會導致成本增加。
2.MHA原理
2.1.MHA工作原理
當master出現故障時,通過對比slave之間I/O線程讀取masterbinlog的位置,選取最接近的slave做為latestslave。 其它slave通過與latest slave對比生成差異中繼日志。在latest slave上應用從master保存的binlog,同時將latest slave提升為master。最后在其它slave上應用相應的差異中繼日志并開始從新的master開始復制。
在MHA實現Master故障切換過程中,MHA Node會試圖訪問故障的master(通過SSH),如果可以訪問(不是硬件故障,比如InnoDB數據文件損壞等),會保存二進制文件,以最大程度保 證數據不丟失。MHA和半同步復制一起使用會大大降低數據丟失的危險。流程如下:
- 從宕機崩潰的master保存二進制日志事件(binlog events)。
- 識別含有最新更新的slave。
- 應用差異的中繼日志(relay log)到其它slave。
- 應用從master保存的二進制日志事件(binlog events)。
- 提升一個slave為新master并記錄binlog file和position。
- 使其它的slave連接新的master進行復制。
- 完成切換manager主進程OFFLINE
2.2.MHA工具介紹
1.Manager工具:
- masterha_check_ssh : 檢查MHA的SSH配置。
- masterha_check_repl : 檢查MySQL復制。
- masterha_manager : 啟動MHA。
- masterha_check_status : 檢測當前MHA運行狀態。
- masterha_master_monitor : 監測master是否宕機。
- masterha_master_switch : 控制故障轉移(自動或手動)。
- masterha_conf_host : 添加或刪除配置的server信息。
2. Node工具
- save_binary_logs : 保存和復制master的二進制日志。
- apply_diff_relay_logs : 識別差異的中繼日志事件并應用于其它slave。
- filter_mysqlbinlog : 去除不必要的ROLLBACK事件(MHA已不再使用這個工具)。
- purge_relay_logs : 清除中繼日志(不會阻塞SQL線程)。
注意:Node這些工具通常由MHA Manager的腳本觸發,無需人手操作。
2.3.當前高可用方案
keepalived+mysql復制
該結構與MHA類似,但keepalived的優勢在于無狀態組件的故障切換,常用于web前端的故障轉移,應用于數據庫場景中,最致命的問題就是腦裂以后數據亂寫的風險,為企業帶來巨大困擾。MySQL Cluster
MySQL Cluster真正實現了高可用,但是使用的是NDB存儲引擎,并且SQL節點有單點故障問題。半同步復制(5.5+)
半同步復制大大減少了“binlog events只存在故障master上”的問題。在提交時,保證至少一個slave(并不是所有的)接收到binlog,因此一些slave可能沒有接收到binlog。PXC
PXC實現了服務高可用,數據同步時是并發復制。但是僅支持InnoDB引擎,所有的表都要有主鍵。鎖沖突、死鎖問題相對較多等等問題。
2.4.MHA的優勢
障切換快
在 主從復制集群中,只要從庫在復制上沒有延遲,MHA通常可以在數秒內實現故障切換。9-10秒內檢查到master故障,可以選擇在7-10秒關閉 master以避免出現裂腦,幾秒鐘內,將差異中繼日志(relay log)應用到新的master上,因此總的宕機時間通常為10-30秒。恢復新的master后,MHA并行的恢復其余的slave。即使在有數萬臺 slave,也不會影響master的恢復時間。
DeNA在超過150個MySQL(主要5.0/5.1版本)主從環境下使用了MHA。當mater故障后,MHA在4秒內就完成了故障切換。在傳統的主動/被動集群解決方案中,4秒內完成故障切換是不可能的。master故障不會導致數據不一致
當 目前的master出現故障是,MHA自動識別slave之間中繼日志(relay log)的不同,并應用到所有的slave中。這樣所有的salve能夠保持同步,只要所有的slave處于存活狀態。和Semi- Synchronous Replication一起使用,(幾乎)可以保證沒有數據丟失。需修改當前的MySQL設置
MHA的設計的重要原則之一就是盡可能地簡單易用。MHA工作在傳統的MySQL版本5.0和之后版本的主從復制環境中。和其它高可用解決方法比,MHA并不需要改變MySQL的部署環境。MHA適用于異步和半同步的主從復制。
啟動/停止/升級/降級/安裝/卸載MHA不需要改變(包擴啟動/停止)MySQL復制。當需要升級MHA到新的版本,不需要停止MySQL,僅僅替換到新版本的MHA,然后重啟MHA Manager就好了。
MHA運行在MySQL 5.0開始的原生版本上。一些其它的MySQL高可用解決方案需要特定的版本(比如MySQL集群、帶全局事務ID的MySQL等等),但并不僅僅為了 master的高可用才遷移應用的。在大多數情況下,已經部署了比較舊MySQL應用,并且不想僅僅為了實現Master的高可用,花太多的時間遷移到不 同的存儲引擎或更新的前沿發行版。MHA工作的包括5.0/5.1/5.5的原生版本的MySQL上,所以并不需要遷移。無需增加大量的服務器
MHA由MHA Manager和MHA Node組成。MHA Node運行在需要故障切換/恢復的MySQL服務器上,因此并不需要額外增加服務器。MHA Manager運行在特定的服務器上,因此需要增加一臺(實現高可用需要2臺),但是MHA Manager可以監控大量(甚至上百臺)單獨的master,因此,并不需要增加大量的服務器。即使在一臺slave上運行MHA Manager也是可以的。綜上,實現MHA并沒用額外增加大量的服務。無性能下降
MHA適用與異步或半同步的MySQL復制。監控master時,MHA僅僅是每隔幾秒(默認是3秒)發送一個ping包,并不發送重查詢。可以得到像原生MySQL復制一樣快的性能。適用于任何存儲引擎
MHA可以運行在只要MySQL復制運行的存儲引擎上,并不僅限制于InnoDB,即使在不易遷移的傳統的MyISAM引擎環境,一樣可以使用MHA。
3.MHA最佳實踐
3.1.背景介紹
3.1.1.軟件參考文檔
參考文檔:
MHA原理:https://code.google.com/p/mysql-master-ha/wiki/HowMHAWorks
MHA原理PPT:http://www.slideshare.net/matsunobu/automated-master-failover
Linux配置代理方法:http://blog.csdn.net/bojie5744/article/details/42148719
軟件下載:
Centos Base Yum Repository: http://mirrors.163.com/.help/CentOS6-Base-163.repo
epel(RHEL 6)Yum Repository:http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
epel(RHEL 7)Yum Repository:
https://mirrors.tuna.tsinghua.edu.cn/epel//7/x86_64/Packages/e/epel-release-7-11.noarch.rpm
MySQL5.7 Yum Repository:https://dev.mysql.com/get/mysql57-community-release-el6-11.noarch.rpm
mysql-master-ha(mgr):https://github.com/linyue515/mysql-master-ha/raw/master/mha4mysql-manager-0.57-0.el7.noarch.rpm
mysql-master-ha(node):https://github.com/linyue515/mysql-master-ha/raw/master/mha4mysql-node-0.57-0.el7.noarch.rpm
3.1.2.系統環境介紹
系統版本
CentOS release 6.7 (Final) x86_64MySQL版本
mysql-5.7.20.-x86_64(RPM)MHA版本
mha4mysql-manager-0.57
mha4mysql-node-0.57
3.1.3.安裝系統要求
- 涉及所有服務器關閉iptables、NetworkManager服務、selinux安全配置
# /etc/init.d/NetworkManager stop
# chkconfig NetworkManager off
# /etc/init.d/iptables stop
# chkconfig iptables off
/etc/selinux/config
改成disable
3.2.安裝MySQL實例
3.2.1.安裝mysql數據庫
- 創建mysql用戶
# useradd mysql
# passwd mysql
- 安裝軟件
# yum -y install mysql-community-server.x86_64
3.2.2.創建配置文件目錄
# mkdir /etc/mysql
3.2.3.創建配置文件
[mysqld]
# GENERAL #
user = mysql
port = 3389
default_storage_engine = InnoDB
socket = /data1/db3389/my3389.sock
pid_file = /data1/db3389/mysql.pid
#read-only =0
tmpdir = /data1/tmp
#key_buffer_size = 128M
max_allowed_packet = 32M
max_connect_errors = 1000000
datadir = /data1/db3389/
log_bin = 2171303389-bin
relay-log= 2171303389-relay-bin
expire_logs_days = 7
#sync_binlog = 0
tmp_table_size = 32M
max_heap_table_size = 32M
max_connections = 5000
thread_cache_size = 512
table_definition_cache = 4096
table_open_cache = 4096
wait_timeout = 28800
interactive_timeout = 28800
transaction-isolation = READ-COMMITTED
binlog-format=row
character-set-server=utf8
skip-name-resolve
back_log=1024
explicit_defaults_for_timestamp=true
server_id=2171303389
# INNODB #
innodb_flush_method = O_DIRECT
#innodb_data_home_dir = /data1/db3389
innodb_data_file_path = ibdata1:100M:autoextend
#redo log
#innodb_log_group_home_dir=./
innodb_log_files_in_group = 3
innodb_log_file_size = 128M
#innodb performance
innodb_flush_log_at_trx_commit = 0
innodb_file_per_table = 1
innodb_buffer_pool_instances = 8
innodb_io_capacity = 2000
innodb_lock_wait_timeout = 30
binlog_error_action = ABORT_SERVER
innodb_buffer_pool_size = 128M
innodb_max_dirty_pages_pct=90
innodb_file_format=Barracuda
innodb_support_xa=0
innodb_buffer_pool_dump_at_shutdown = 1
innodb_buffer_pool_load_at_startup = 1
#innodb undo log
innodb_undo_tablespaces=4
innodb_undo_logs=2048
innodb_purge_rseg_truncate_frequency=512
innodb_max_undo_log_size=2G
innodb_undo_log_truncate=1
log_error = error.log
#log_queries_not_using_indexes = 1
slow_query_log = 1
slow_query_log_file = slow-queries.log
long_query_time=2
gtid_mode=ON
enforce-gtid-consistency
log-slave-updates
master-info-repository=TABLE
relay-log-info-repository=TABLE
sync_master_info = 10000
slave_sql_verify_checksum=1
skip-slave-start
init-connect='SET NAMES utf8'
character-set-server=utf8
skip-character-set-client-handshake
bind-address=0.0.0.0
skip-external-locking
slave-parallel-workers=6
[mysql5.6]
myisam_recover = FORCE,BACKUP
3.2.4.創建授權目錄
# mkdir -p /data1/db3389
# mkdir -p /data1/tmp
# chown -R mysql:mysql /data1/db3389
# chown -R mysql:mysql /data1/tmp
3.2.5.初始化MySQL實例
# mysqld --defaults-file=/etc/mysql/my3389.cnf --initialize --user=mysql
# mysql_ssl_rsa_setup
3.2.6.啟動mysql實例
# mysqld_safe --defaults-file=/etc/mysql/my3389.cnf &
# cat /data1/db3389/error.log | grep temp
# mysql -S /data1/db3389/my3389.sock -p'z&Di4b_oSM*-'
mysql> set password=''; #重置密碼為空
3.3.部署MySQL復制
3.3.1.主庫創建復制用戶
mysql> grant replication slave, replication client on *.* to replica@'192.168.217.%' identified by 'mycatDBA';
3.3.2.主庫創建mha用戶
mysql> grant all privileges on *.* to mha@'192.168.217.132' identified by 'mysqlDBA';
3.3.3.主庫備份數據庫
# mysqldump -S /data1/db3389/my3389.sock --single-transaction --master-data=2 --opt -A | gzip > /data1/tmp/full_3389.tar.gz
3.3.4.主庫傳輸至從庫
# scp /data1/tmp/full_3389.tar.gz 192.168.217.131:/data1/tmp
3.3.5.從庫恢復數據庫
# gunzip < /data1/tmp/full_3389.tar.gz | mysql -S /data1/db3389/my3389.sock
注意:恢復數據庫前,從庫最好reset master;
,否則將出現一下錯誤:
ERROR 1840 (HY000) at line 24: @@GLOBAL.GTID_PURGED can only be set when @@GLOBAL.GTID_EXECUTED is empty.
3.3.6.從庫初始化同步數據
mysql> change master to master_host='192.168.217.130',master_port=3389,master_user='replica',master_password='mycatDBA',master_auto_position=1;
Query OK, 0 rows affected, 2 warnings (0.02 sec)
mysql> start slave;
Query OK, 0 rows affected (0.03 sec)
mysql> show slave status \G
*************************** 1. row ***************************
...... 省略 ......
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
...... 省略 ......
3.4.部署MHA軟件
3.4.1.安裝軟件
- epel yum源安裝方式
# yum -y install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes
# #根據MHA角色安裝對應的軟件包即可
# yum -y --nogpgcheck install mha4mysql-node-0.57-0.el7.noarch.rpm
# yum -y install --nogpgcheck mha4mysql-manager-0.57-0.el7.noarch.rpm
- 本地安裝方式
# yum -y --nogpgcheck install perl-DBD-MySQL*
# yum -y --nogpgcheck install perl-Config-Tiny*
# yum -y --nogpgcheck install perl-Parallel-ForkManager*
# yum -y --nogpgcheck install perl-MailTools*
# yum -y --nogpgcheck install perl-Email-Date-Format*
# yum -y --nogpgcheck install perl-Mail-Sender*
# yum -y --nogpgcheck install perl-MIME-Types*
# yum -y --nogpgcheck install perl-MIME-Lite*
# yum -y --nogpgcheck install perl-Mail-Sendmail*
# yum -y --nogpgcheck install perl-Log-Dispatch*
# #根據MHA角色安裝對應的軟件包即可
# yum -y --nogpgcheck install mha4mysql-node-0.57-0.el7.noarch.rpm
# yum -y install --nogpgcheck mha4mysql-manager-0.57-0.el7.noarch.rpm
3.4.2.掛在VIP
- master
# /sbin/ifconfig eth0:1 192.168.217.201 broadcast 192.168.217.255 netmask 255.255.255.0
# /sbin/arping -f -q -c 5 -w 5 -I eth0 -s 192.168.217.201 -U 192.168.217.2
3.4.3.配置SSH互信
在現網環境中幾乎都是禁止root遠程登陸服務器得,所以ssh免密碼登陸要在mysql用戶下進行配置,這是處于安全角度考慮出發。
- master:
# su - mysql
$ ssh-keygen -t rsa
$ rm -rf ~/.ssh/*
- slave:
# su - mysql
$ ssh-keygen -t rsa
$ rm -rf ~/.ssh/*
- manager:
# su - mysql
$ ssh-keygen -t rsa
$ cd ~/.ssh
$ mv id_rsa.pub authorized_keys
$ scp * 192.168.217.130:~/.ssh/
$ scp * 192.168.217.131:~/.ssh/
$ #測試ssh
$ ssh 192.168.217.130 date
Wed Nov 22 05:48:54 PST 2017
$ ssh 192.168.217.131 date
Wed Nov 22 05:47:58 PST 2017
3.4.4.配置mysql用戶sudo權限
- 添加普通用戶登陸tty終端權限
# vim /etc/sudoers
#將以下的參數注釋,意思就是sudo默認需要tty終端。注釋掉就可以在后臺執行了。
#Defaults requiretty
- 開放普通用戶執行sudo命令權限
# cd /etc/sudoers.d/
# vim mysql
User_Alias MYSQL_USERS = ALL
Runas_Alias MYSQL_RUNAS = root
Cmnd_Alias MYSQL_CMNDS = ALL
MYSQL_USERS ALL = (MYSQL_RUNAS) NOPASSWD: MYSQL_CMNDS
3.4.5.創建MHA配置文件
- 創建配置文件目錄
# mkdir /etc/mha
- 創建MHA配置文件
# cat app3389.cnf
[server default]
user=mha
password=mysqlDBA
manager_workdir=/data1/mha/masterha/app3389
manager_log=/data1/mha/masterha/app3389/app3389.log
remote_workdir=/data1/mha/masterha/app3389
ssh_user=mysql
repl_user=replica
repl_password=mycatDBA
ping_interval=3
secondary_check_script="masterha_secondary_check -s 192.168.1.122 -s 192.168.1.122"
master_ip_failover_script="/etc/mha/master_ip_failover.sh 192.168.1.201 1"
master_ip_online_change_script="/etc/mha/master_ip_online_change.sh 192.168.1.201 1"
shutdown_script="/etc/mha/power_manager"
#report_script="/etc/mha/end_report"
[server1]
hostname=192.168.1.120
port=3389
master_binlog_dir=/data1/db3389
candidate_master=1
master_pid_file=/data1/db3389/mysql.pid
[server2]
hostname=192.168.1.121
port=3389
master_binlog_dir=/data1/db3389
candidate_master=1
master_pid_file=/data1/db3389/mysql.pid
[binlog1]
hostname=192.168.1.122
master_binlog_dir=/data1/mha/binlog/3389
no_master=1
ignore_fail=1
3.4.6.上傳MHA切換腳本
master_ip_failover.sh
master_ip_online_change.sh
power_manager
注意:腳本內容中要修改網卡名字
my $vip = shift;
my $interface = 'eth1';
my $key = shift;
- 上傳故障切換腳本并授權
# chmod 755 master_ip_*
# chmod 755 power_manager
3.4.7.創建MHA、BINLOG工作目錄
# mkdir -p /data1/mha/masterha/app3389
# mkdir -p /data1/mha/binlog/3389
# chown -R mysql:mysql /data1/mha/binlog/3389
3.4.8.啟動BINLOG SERVER
# su - mysql
$ cd /data1/mha/binlog/3389;
$ mysqlbinlog -R --host=192.168.217.130 -P3389 --user=mha --password=mysqlDBA --raw --stop-never 2171303389-bin.000003 &
$ ps -ef | grep mysqlbinlog | grep -v grep # 驗證binlog server進程是否存在
root 7008 6233 0 07:00 pts/0 00:00:00 mysqlbinlog -R --host=192.168.217.130 -P3389 --user=mha --password=x xxxxxx --raw --stop-never 2171303389-bin.000003
3.4.9.驗證并啟動manager進程
$ masterha_check_ssh --conf=/etc/mha/app3389.cnf
Wed Nov 22 07:35:07 2017 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Wed Nov 22 07:35:07 2017 - [info] Reading application default configuration from /etc/mha/app3389.cnf..
Wed Nov 22 07:35:07 2017 - [info] Reading server configuration from /etc/mha/app3389.cnf..
Wed Nov 22 07:35:07 2017 - [info] Starting SSH connection tests..
Wed Nov 22 07:35:08 2017 - [debug]
Wed Nov 22 07:35:07 2017 - [debug] Connecting via SSH from root@192.168.217.130(192.168.217.130:22) to root@192.168.217.131(192.168.217.131:22)..
Wed Nov 22 07:35:08 2017 - [debug] ok.
Wed Nov 22 07:35:08 2017 - [debug]
Wed Nov 22 07:35:07 2017 - [debug] Connecting via SSH from root@192.168.217.131(192.168.217.131:22) to root@192.168.217.130(192.168.217.130:22)..
Wed Nov 22 07:35:08 2017 - [debug] ok.
Wed Nov 22 07:35:08 2017 - [info] All SSH connection tests passed successfully.
$ masterha_check_repl --conf=/etc/mha/app3389.cnf
Wed Nov 22 07:47:07 2017 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Wed Nov 22 07:47:07 2017 - [info] Reading application default configuration from /etc/mha/app3389.cnf..
Wed Nov 22 07:47:07 2017 - [info] Reading server configuration from /etc/mha/app3389.cnf..
Wed Nov 22 07:47:07 2017 - [info] MHA::MasterMonitor version 0.57.
Wed Nov 22 07:47:08 2017 - [info] GTID failover mode = 1
Wed Nov 22 07:47:08 2017 - [info] Dead Servers:
Wed Nov 22 07:47:08 2017 - [info] Alive Servers:
Wed Nov 22 07:47:08 2017 - [info] 192.168.217.130(192.168.217.130:3389)
Wed Nov 22 07:47:08 2017 - [info] 192.168.217.131(192.168.217.131:3389)
Wed Nov 22 07:47:08 2017 - [info] Alive Slaves:
Wed Nov 22 07:47:08 2017 - [info] 192.168.217.131(192.168.217.131:3389) Version=5.7.20-log (oldest major version between slaves) log-bin:enabled
Wed Nov 22 07:47:08 2017 - [info] GTID ON
Wed Nov 22 07:47:08 2017 - [info] Replicating from 192.168.217.130(192.168.217.130:3389)
Wed Nov 22 07:47:08 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 22 07:47:08 2017 - [info] Current Alive Master: 192.168.217.130(192.168.217.130:3389)
Wed Nov 22 07:47:08 2017 - [info] Checking slave configurations..
Wed Nov 22 07:47:08 2017 - [info] read_only=1 is not set on slave 192.168.217.131(192.168.217.131:3389).
Wed Nov 22 07:47:08 2017 - [info] Checking replication filtering settings..
Wed Nov 22 07:47:08 2017 - [info] binlog_do_db= , binlog_ignore_db=
Wed Nov 22 07:47:08 2017 - [info] Replication filtering check ok.
Wed Nov 22 07:47:08 2017 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Warning: Permanently added '192.168.217.132' (RSA) to the list of known hosts.
Wed Nov 22 07:47:08 2017 - [info] HealthCheck: SSH to 192.168.217.132 is reachable.
Wed Nov 22 07:47:14 2017 - [info] Binlog server 192.168.217.132 is reachable.
Wed Nov 22 07:47:14 2017 - [info] Checking recovery script configurations on 192.168.217.132(192.168.217.132:3306)..
Wed Nov 22 07:47:14 2017 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data1/mha/binlog/3389 --output_file=/data1/mha/masterha/app3389/save_binary_logs_test --manager_version=0.57 --start_file=2171303389-bin.000003
Wed Nov 22 07:47:14 2017 - [info] Connecting to root@192.168.217.132(192.168.217.132:22)..
Creating /data1/mha/masterha/app3389 if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /data1/mha/binlog/3389, up to 2171303389-bin.000003
Wed Nov 22 07:47:14 2017 - [info] Binlog setting check done.
Wed Nov 22 07:47:14 2017 - [info] Checking SSH publickey authentication settings on the current master..
Wed Nov 22 07:47:15 2017 - [info] HealthCheck: SSH to 192.168.217.130 is reachable.
Wed Nov 22 07:47:15 2017 - [info]
192.168.217.130(192.168.217.130:3389) (current master)
+--192.168.217.131(192.168.217.131:3389)
Wed Nov 22 07:47:15 2017 - [info] Checking replication health on 192.168.217.131..
Wed Nov 22 07:47:15 2017 - [info] ok.
Wed Nov 22 07:47:15 2017 - [info] Checking master_ip_failover_script status:
Wed Nov 22 07:47:15 2017 - [info] /etc/mha/master_ip_failover.sh 192.168.217.201 1 --command=status --ssh_user=root --orig_master_host=192.168.217.130 --orig_master_ip=192.168.217.130 --orig_master_port=3389
Checking the Status of the script.. OK
Wed Nov 22 07:47:15 2017 - [info] OK.
Wed Nov 22 07:47:15 2017 - [info] Checking shutdown script status:
Wed Nov 22 07:47:15 2017 - [info] /etc/mha/power_manager --command=status --ssh_user=root --host=192.168.217.130 --ip=192.168.217.130
Wed Nov 22 07:47:15 2017 - [info] OK.
Wed Nov 22 07:47:15 2017 - [info] Got exit code 0 (Not master dead).
MySQL Replication Health is OK.
$ nohup masterha_manager --conf=/etc/mha/app3389.cnf --ignore_last_failover &
[2] 7307
$ nohup: ignoring input and appending output to `nohup.out'
$ masterha_check_status --conf=/etc/mha/app3389.cnf
app3389 (pid:7307) is running(0:PING_OK), master:192.168.217.130
3.5.故障自動切換與在線切換
3.5.1.故障切換
- 主庫down或者主機down,然后測試切換是否成功。
3.5.2.在線切換
在線切換(Mha manager進程(binlog server進程可選)是關閉的,Mha結構是正常的環境,適用于生產系統硬件、軟件升級維護等場景)
-
--orig_master_is_new_slave
切換時加上此參數是講原master變成slave節點,不加該參數,原master將不啟動 -
--running_updates_limit=10000
切換時選master 如果有延遲的話,mha切換不會成功,加上此參數表示切換在此時間范圍內都可以切換(單位為 s),但是切換的時間長短是由recover時relay日志大小決定
注意:在備庫先執行DDL,一般先stop slave,一般不記錄mysql日志,可以通過set session sql_log_bin=0實現,然后進行一次主備切換操作,再在原來的主庫上執行DDL.這種方法適用于增減索引.
$ masterha_master_switch --master_state=alive --conf=/etc/mha/app3389.conf --orig_master_is_new_slave
Sat Nov 25 11:06:04 2017 - [info] MHA::MasterRotate version 0.57.
Sat Nov 25 11:06:04 2017 - [info] Starting online master switch..
Sat Nov 25 11:06:04 2017 - [info]
Sat Nov 25 11:06:04 2017 - [info] * Phase 1: Configuration Check Phase..
Sat Nov 25 11:06:04 2017 - [info]
Sat Nov 25 11:06:04 2017 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sat Nov 25 11:06:04 2017 - [info] Reading application default configuration from /etc/mha/app3389.conf..
Sat Nov 25 11:06:04 2017 - [info] Reading server configuration from /etc/mha/app3389.conf..
Sat Nov 25 11:06:04 2017 - [info] GTID failover mode = 1
Sat Nov 25 11:06:04 2017 - [info] Current Alive Master: 192.168.1.121(192.168.1.121:3389)
Sat Nov 25 11:06:04 2017 - [info] Alive Slaves:
Sat Nov 25 11:06:04 2017 - [info] 192.168.1.120(192.168.1.120:3389) Version=5.7.20-log (oldest major version between slaves) log-bin:enabled
Sat Nov 25 11:06:04 2017 - [info] GTID ON
Sat Nov 25 11:06:04 2017 - [info] Replicating from 192.168.1.121(192.168.1.121:3389)
Sat Nov 25 11:06:04 2017 - [info] Primary candidate for the new Master (candidate_master is set)
It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.1.121(192.168.1.121:3389)? (YES/no): YES
Sat Nov 25 11:06:07 2017 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Sat Nov 25 11:06:07 2017 - [info] ok.
Sat Nov 25 11:06:07 2017 - [info] Checking MHA is not monitoring or doing failover..
Sat Nov 25 11:06:07 2017 - [info] Checking replication health on 192.168.1.120..
Sat Nov 25 11:06:07 2017 - [info] ok.
Sat Nov 25 11:06:07 2017 - [info] Searching new master from slaves..
Sat Nov 25 11:06:07 2017 - [info] Candidate masters from the configuration file:
Sat Nov 25 11:06:07 2017 - [info] 192.168.1.120(192.168.1.120:3389) Version=5.7.20-log (oldest major version between slaves) log-bin:enabled
Sat Nov 25 11:06:07 2017 - [info] GTID ON
Sat Nov 25 11:06:07 2017 - [info] Replicating from 192.168.1.121(192.168.1.121:3389)
Sat Nov 25 11:06:07 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Nov 25 11:06:07 2017 - [info] 192.168.1.121(192.168.1.121:3389) Version=5.7.20-log log-bin:enabled
Sat Nov 25 11:06:07 2017 - [info] GTID ON
Sat Nov 25 11:06:07 2017 - [info] Non-candidate masters:
Sat Nov 25 11:06:07 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Sat Nov 25 11:06:07 2017 - [info]
From:
192.168.1.121(192.168.1.121:3389) (current master)
+--192.168.1.120(192.168.1.120:3389)
To:
192.168.1.120(192.168.1.120:3389) (new master)
+--192.168.1.121(192.168.1.121:3389)
Starting master switch from 192.168.1.121(192.168.1.121:3389) to 192.168.1.120(192.168.1.120:3389)? (yes/NO): YES
Sat Nov 25 11:06:11 2017 - [info] Checking whether 192.168.1.120(192.168.1.120:3389) is ok for the new master..
Sat Nov 25 11:06:11 2017 - [info] ok.
Sat Nov 25 11:06:11 2017 - [info] 192.168.1.121(192.168.1.121:3389): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.
Sat Nov 25 11:06:11 2017 - [info] 192.168.1.121(192.168.1.121:3389): Resetting slave pointing to the dummy host.
Sat Nov 25 11:06:11 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Sat Nov 25 11:06:11 2017 - [info]
Sat Nov 25 11:06:11 2017 - [info] * Phase 2: Rejecting updates Phase..
Sat Nov 25 11:06:11 2017 - [info]
Sat Nov 25 11:06:11 2017 - [info] Executing master ip online change script to disable write on the current master:
Sat Nov 25 11:06:11 2017 - [info] /etc/mha/master_ip_online_change.sh 192.168.1.200 1 --command=stop --orig_master_host=192.168.1.121 --orig_master_ip=192.168.1.121 --orig_master_port=3389 --orig_master_user='mha' --new_master_host=192.168.1.120 --new_master_ip=192.168.1.120 --new_master_port=3389 --new_master_user='mha' --orig_master_ssh_user=mysql --new_master_ssh_user=mysql --orig_master_is_new_slave --orig_master_password=xxx --new_master_password=xxx
Unknown option: orig_master_ssh_user
Unknown option: new_master_ssh_user
Unknown option: orig_master_is_new_slave
Sat Nov 25 11:06:11 2017 918769 Set read_only on the new master.. ok.
Sat Nov 25 11:06:11 2017 923401 Waiting all running 1 threads are disconnected.. (max 1500 milliseconds)
{'Time' => '78','Command' => 'Binlog Dump GTID','db' => undef,'Id' => '46','Info' => undef,'User' => 'replica','State' => 'Master has sent all binlog to slave; waiting for more updates','Host' => '192.168.1.120:39100'}
Sat Nov 25 11:06:12 2017 426422 Waiting all running 1 threads are disconnected.. (max 1000 milliseconds)
{'Time' => '79','Command' => 'Binlog Dump GTID','db' => undef,'Id' => '46','Info' => undef,'User' => 'replica','State' => 'Master has sent all binlog to slave; waiting for more updates','Host' => '192.168.1.120:39100'}
Sat Nov 25 11:06:12 2017 929834 Waiting all running 1 threads are disconnected.. (max 500 milliseconds)
{'Time' => '79','Command' => 'Binlog Dump GTID','db' => undef,'Id' => '46','Info' => undef,'User' => 'replica','State' => 'Master has sent all binlog to slave; waiting for more updates','Host' => '192.168.1.120:39100'}
Sat Nov 25 11:06:13 2017 433392 Set read_only=1 on the orig master.. ok.
Sat Nov 25 11:06:13 2017 436292 Waiting all running 1 queries are disconnected.. (max 500 milliseconds)
{'Time' => '80','Command' => 'Binlog Dump GTID','db' => undef,'Id' => '46','Info' => undef,'User' => 'replica','State' => 'Master has sent all binlog to slave; waiting for more updates','Host' => '192.168.1.120:39100'}
Disabling the VIP on old master: 192.168.1.121
===========sudo /sbin/ifconfig eth1:1 down===========================
Sat Nov 25 11:06:14 2017 071486 Killing all application threads..
Sat Nov 25 11:06:14 2017 072793 done.
Sat Nov 25 11:06:14 2017 - [info] ok.
Sat Nov 25 11:06:14 2017 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
Sat Nov 25 11:06:14 2017 - [info] Executing FLUSH TABLES WITH READ LOCK..
Sat Nov 25 11:06:14 2017 - [info] ok.
Sat Nov 25 11:06:14 2017 - [info] Orig master binlog:pos is 11213389-bin.000003:194.
Sat Nov 25 11:06:14 2017 - [info] Waiting to execute all relay logs on 192.168.1.120(192.168.1.120:3389)..
Sat Nov 25 11:06:14 2017 - [info] master_pos_wait(11213389-bin.000003:194) completed on 192.168.1.120(192.168.1.120:3389). Executed 0 events.
Sat Nov 25 11:06:14 2017 - [info] done.
Sat Nov 25 11:06:14 2017 - [info] Getting new master's binlog name and position..
Sat Nov 25 11:06:14 2017 - [info] 11203389-bin.000003:346
Sat Nov 25 11:06:14 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.1.120', MASTER_PORT=3389, MASTER_AUTO_POSITION=1, MASTER_USER='replica', MASTER_PASSWORD='xxx';
Sat Nov 25 11:06:14 2017 - [info] Executing master ip online change script to allow write on the new master:
Sat Nov 25 11:06:14 2017 - [info] /etc/mha/master_ip_online_change.sh 192.168.1.200 1 --command=start --orig_master_host=192.168.1.121 --orig_master_ip=192.168.1.121 --orig_master_port=3389 --orig_master_user='mha' --new_master_host=192.168.1.120 --new_master_ip=192.168.1.120 --new_master_port=3389 --new_master_user='mha' --orig_master_ssh_user=mysql --new_master_ssh_user=mysql --orig_master_is_new_slave --orig_master_password=xxx --new_master_password=xxx
Unknown option: orig_master_ssh_user
Unknown option: new_master_ssh_user
Unknown option: orig_master_is_new_slave
Sat Nov 25 11:06:14 2017 186596 Set read_only=0 on the new master.
Enabling the VIP - 192.168.1.200 on the new master - 192.168.1.120
===========sudo /sbin/ifconfig eth1:1 192.168.1.200 broadcast 192.168.1.255 netmask 255.255.255.0 && sudo /sbin/arping -f -q -c 5 -w 5 -I eth1 -s 192.168.1.200 -U 192.168.1.1===========================
Sat Nov 25 11:06:14 2017 - [info] ok.
Sat Nov 25 11:06:14 2017 - [info]
Sat Nov 25 11:06:14 2017 - [info] * Switching slaves in parallel..
Sat Nov 25 11:06:14 2017 - [info]
Sat Nov 25 11:06:14 2017 - [info] Unlocking all tables on the orig master:
Sat Nov 25 11:06:14 2017 - [info] Executing UNLOCK TABLES..
Sat Nov 25 11:06:14 2017 - [info] ok.
Sat Nov 25 11:06:14 2017 - [info] Starting orig master as a new slave..
Sat Nov 25 11:06:14 2017 - [info] Resetting slave 192.168.1.121(192.168.1.121:3389) and starting replication from the new master 192.168.1.120(192.168.1.120:3389)..
Sat Nov 25 11:06:14 2017 - [info] Executed CHANGE MASTER.
Sat Nov 25 11:06:14 2017 - [info] Slave started.
Sat Nov 25 11:06:14 2017 - [info] All new slave servers switched successfully.
Sat Nov 25 11:06:14 2017 - [info]
Sat Nov 25 11:06:14 2017 - [info] * Phase 5: New master cleanup phase..
Sat Nov 25 11:06:14 2017 - [info]
Sat Nov 25 11:06:14 2017 - [info] 192.168.1.120: Resetting slave info succeeded.
Sat Nov 25 11:06:14 2017 - [info] Switching master to 192.168.1.120(192.168.1.120:3389) completed successfully.
掃描下方二維碼關注本人微信號!歡迎大家交流學習!