mysql數(shù)據(jù)實時同步到Elasticsearch

業(yè)務(wù)需要把mysql的數(shù)據(jù)實時同步到ES,實現(xiàn)低延遲的檢索到ES中的數(shù)據(jù)或者進(jìn)行其它數(shù)據(jù)分析處理。本文給出以同步mysql binlog的方式實時同步數(shù)據(jù)到ES的思路, 實踐并驗證該方式的可行性,以供參考。

mysql binlog日志

mysql的binlog日志主要用于數(shù)據(jù)庫的主從復(fù)制與數(shù)據(jù)恢復(fù)。binlog中記錄了數(shù)據(jù)的增刪改查操作,主從復(fù)制過程中,主庫向從庫同步binlog日志,從庫對binlog日志中的事件進(jìn)行重放,從而實現(xiàn)主從同步。
mysql binlog日志有三種模式,分別為:

    ROW: 記錄每一行數(shù)據(jù)被修改的情況,但是日志量太大
    STATEMENT: 記錄每一條修改數(shù)據(jù)的SQL語句,減少了日志量,但是SQL語句使用函數(shù)或觸發(fā)器時容易出現(xiàn)主從不一致
    MIXED: 結(jié)合了ROW和STATEMENT的優(yōu)點,根據(jù)具體執(zhí)行數(shù)據(jù)操作的SQL語句選擇使用ROW或者STATEMENT記錄日志

要通過mysql binlog將數(shù)據(jù)同步到ES集群,只能使用ROW模式,因為只有ROW模式才能知道m(xù)ysql中的數(shù)據(jù)的修改內(nèi)容。

以UPDATE操作為例,ROW模式的binlog日志內(nèi)容示例如下:

    SET TIMESTAMP=1527917394/*!*/;
    BEGIN
    /*!*/;
    # at 3751
    #180602 13:29:54 server id 1  end_log_pos 3819 CRC32 0x8dabdf01     Table_map: `webservice`.`building` mapped to number 74
    # at 3819
    #180602 13:29:54 server id 1  end_log_pos 3949 CRC32 0x59a8ed85     Update_rows: table id 74 flags: STMT_END_F
    
    BINLOG '
    UisSWxMBAAAARAAAAOsOAAAAAEoAAAAAAAEACndlYnNlcnZpY2UACGJ1aWxkaW5nAAYIDwEPEREG
    wACAAQAAAAHfq40=
    UisSWx8BAAAAggAAAG0PAAAAAEoAAAAAAAEAAgAG///A1gcAAAAAAAALYnVpbGRpbmctMTAADwB3
    UkRNbjNLYlV5d1k3ajVbD64WWw+uFsDWBwAAAAAAAAtidWlsZGluZy0xMAEPAHdSRE1uM0tiVXl3
    WTdqNVsPrhZbD64Whe2oWQ==
    '/*!*/;
    ### UPDATE `webservice`.`building`
    ### WHERE
    ###   @1=2006 /* LONGINT meta=0 nullable=0 is_null=0 */
    ###   @2='building-10' /* VARSTRING(192) meta=192 nullable=0 is_null=0 */
    ###   @3=0 /* TINYINT meta=0 nullable=0 is_null=0 */
    ###   @4='wRDMn3KbUywY7j5' /* VARSTRING(384) meta=384 nullable=0 is_null=0 */
    ###   @5=1527754262 /* TIMESTAMP(0) meta=0 nullable=0 is_null=0 */
    ###   @6=1527754262 /* TIMESTAMP(0) meta=0 nullable=0 is_null=0 */
    ### SET
    ###   @1=2006 /* LONGINT meta=0 nullable=0 is_null=0 */
    ###   @2='building-10' /* VARSTRING(192) meta=192 nullable=0 is_null=0 */
    ###   @3=1 /* TINYINT meta=0 nullable=0 is_null=0 */
    ###   @4='wRDMn3KbUywY7j5' /* VARSTRING(384) meta=384 nullable=0 is_null=0 */
    ###   @5=1527754262 /* TIMESTAMP(0) meta=0 nullable=0 is_null=0 */
    ###   @6=1527754262 /* TIMESTAMP(0) meta=0 nullable=0 is_null=0 */
    # at 3949
    #180602 13:29:54 server id 1  end_log_pos 3980 CRC32 0x58226b8f     Xid = 182
    COMMIT/*!*/;

STATEMENT模式下binlog日志內(nèi)容示例為:

    SET TIMESTAMP=1527919329/*!*/;
    update building set Status=1 where Id=2000
    /*!*/;
    # at 688
    #180602 14:02:09 server id 1  end_log_pos 719 CRC32 0x4c550a7d  Xid = 200
    COMMIT/*!*/;

從ROW模式和STATEMENT模式下UPDATE操作的日志內(nèi)容可以看出,ROW模式完整地記錄了要修改的某行數(shù)據(jù)更新前的所有字段的值以及更改后所有字段的值,而STATEMENT模式只單單記錄了UPDATE操作的SQL語句。我們要將mysql的數(shù)據(jù)實時同步到ES, 只能選擇ROW模式的binlog, 獲取并解析binlog日志的數(shù)據(jù)內(nèi)容,執(zhí)行ES document api,將數(shù)據(jù)同步到ES集群中。

mysqldump工具

mysqldump是一個對mysql數(shù)據(jù)庫中的數(shù)據(jù)進(jìn)行全量導(dǎo)出的一個工具.
mysqldump的使用方式如下:

mysqldump -uelastic -p'Elastic_123' --host=172.16.32.5 -F webservice > dump.sql

上述命令表示從遠(yuǎn)程數(shù)據(jù)庫172.16.32.5:3306中導(dǎo)出database:webservice的所有數(shù)據(jù),寫入到dump.sql文件中,指定-F參數(shù)表示在導(dǎo)出數(shù)據(jù)后重新生成一個新的binlog日志文件以記錄后續(xù)的所有數(shù)據(jù)操作。
dump.sql中的文件內(nèi)容如下:

-- MySQL dump 10.13  Distrib 5.6.40, for Linux (x86_64)
--
-- Host: 172.16.32.5    Database: webservice
-- ------------------------------------------------------
-- Server version   5.5.5-10.1.9-MariaDBV1.0R012D002-20171127-1822

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

--
-- Table structure for table `building`
--

DROP TABLE IF EXISTS `building`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `building` (
  `Id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'ID',
  `BuildingId` varchar(64) NOT NULL COMMENT '虛擬建筑Id',
  `Status` tinyint(4) NOT NULL DEFAULT '0' COMMENT '虛擬建筑狀態(tài):0、處理中;1、正常;-1,停止;-2,銷毀中;-3,已銷毀',
  `BuildingName` varchar(128) NOT NULL DEFAULT '' COMMENT '虛擬建筑名稱',
  `CreateTime` timestamp NOT NULL DEFAULT '2017-12-03 16:00:00' COMMENT '創(chuàng)建時間',
  `UpdateTime` timestamp NOT NULL DEFAULT '2017-12-03 16:00:00' COMMENT '更新時間',
  PRIMARY KEY (`Id`),
  UNIQUE KEY `BuildingId` (`BuildingId`)
) ENGINE=InnoDB AUTO_INCREMENT=2010 DEFAULT CHARSET=utf8 COMMENT='虛擬建筑表';
/*!40101 SET character_set_client = @saved_cs_client */;

--
-- Dumping data for table `building`
--

LOCK TABLES `building` WRITE;
/*!40000 ALTER TABLE `building` DISABLE KEYS */;
INSERT INTO `building` VALUES (2000,'building-2',0,'6YFcmntKrNBIeTA','2018-05-30 13:28:31','2018-05-30 13:28:31'),(2001,'building-4',0,'4rY8PcVUZB1vtrL','2018-05-30 13:28:34','2018-05-30 13:28:34'),(2002,'building-5',0,'uyjHVUYrg9KeGqi','2018-05-30 13:28:37','2018-05-30 13:28:37'),(2003,'building-7',0,'DNhyEBO4XEkXpgW','2018-05-30 13:28:40','2018-05-30 13:28:40'),(2004,'building-1',0,'TmtYX6ZC0RNB4Re','2018-05-30 13:28:43','2018-05-30 13:28:43'),(2005,'building-6',0,'t8YQcjeXefWpcyU','2018-05-30 13:28:49','2018-05-30 13:28:49'),(2006,'building-10',0,'WozgBc2IchNyKyE','2018-05-30 13:28:55','2018-05-30 13:28:55'),(2007,'building-3',0,'yJk27cmLOVQLHf1','2018-05-30 13:28:58','2018-05-30 13:28:58'),(2008,'building-9',0,'RSbjotAh8tymfxs','2018-05-30 13:29:04','2018-05-30 13:29:04'),(2009,'building-8',0,'IBOMlhaXV6k226m','2018-05-30 13:29:31','2018-05-30 13:29:31');
/*!40000 ALTER TABLE `building` ENABLE KEYS */;
UNLOCK TABLES;

/*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */;

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;

-- Dump completed on 2018-06-02 14:23:51

從以上內(nèi)容可以看出,mysqldump導(dǎo)出的sql文件包含create table, drop table以及插入數(shù)據(jù)的sql語句,但是不包含create database建庫語句。

使用go-mysql-elasticsearch開源工具同步數(shù)據(jù)到ES

go-mysql-elasticsearch是用于同步mysql數(shù)據(jù)到ES集群的一個開源工具,項目github地址:https://github.com/siddontang/go-mysql-elasticsearch

go-mysql-elasticsearch的基本原理是:如果是第一次啟動該程序,首先使用mysqldump工具對源mysql數(shù)據(jù)庫進(jìn)行一次全量同步,通過elasticsearch client執(zhí)行操作寫入數(shù)據(jù)到ES;然后實現(xiàn)了一個mysql client,作為slave連接到源mysql,源mysql作為master會將所有數(shù)據(jù)的更新操作通過binlog event同步給slave, 通過解析binlog event就可以獲取到數(shù)據(jù)的更新內(nèi)容,之后寫入到ES.

另外,該工具還提供了操作統(tǒng)計的功能,每當(dāng)有數(shù)據(jù)增刪改操作時,會將對應(yīng)操作的計數(shù)加1,程序啟動時會開啟一個http服務(wù),通過調(diào)用http接口可以查看增刪改操作的次數(shù)。

使用限制:

    1. mysql binlog必須是ROW模式
    2. 要同步的mysql數(shù)據(jù)表必須包含主鍵,否則直接忽略,這是因為如果數(shù)據(jù)表沒有主鍵,UPDATE和DELETE操作就會因為在ES中找不到對應(yīng)的document而無法進(jìn)行同步
    3. 不支持程序運行過程中修改表結(jié)構(gòu)
    4. 要賦予用于連接mysql的賬戶RELOAD權(quán)限以及REPLICATION權(quán)限, SUPER權(quán)限:
       GRANT REPLICATION SLAVE ON *.* TO 'elastic'@'172.16.32.44';
       GRANT RELOAD ON *.* TO 'elastic'@'172.16.32.44';
       UPDATE mysql.user SET Super_Priv='Y' WHERE user='elastic' AND host='172.16.32.44';

使用方式:

  1. git clone https://github.com/siddontang/go-mysql-elasticsearch
  2. cd go-mysql-elasticsearch/src/github.com/siddontang/go-mysql-elasticsearch
  3. vi etc/river.toml, 修改配置文件,同步172.16.0.101:3306數(shù)據(jù)庫中的webservice.building表到ES集群172.16.32.64:9200的building index(更詳細(xì)的配置文件說明可以參考項目文檔)
    # MySQL address, user and password
    # user must have replication privilege in MySQL.
    my_addr = "172.16.0.101:3306"
    my_user = "bellen"
    my_pass = "Elastic_123"
    my_charset = "utf8"
    
    # Set true when elasticsearch use https
    #es_https = false
    # Elasticsearch address
    es_addr = "172.16.32.64:9200"
    # Elasticsearch user and password, maybe set by shield, nginx, or x-pack
    es_user = ""
    es_pass = ""
    
    # Path to store data, like master.info, if not set or empty,
    # we must use this to support breakpoint resume syncing.
    # TODO: support other storage, like etcd.
    data_dir = "./var"
    
    # Inner Http status address
    stat_addr = "127.0.0.1:12800"
    
    # pseudo server id like a slave
    server_id = 1001
    
    # mysql or mariadb
    flavor = "mariadb"
    
    # mysqldump execution path
    # if not set or empty, ignore mysqldump.
    mysqldump = "mysqldump"
    
    # if we have no privilege to use mysqldump with --master-data,
    # we must skip it.
    #skip_master_data = false
    
    # minimal items to be inserted in one bulk
    bulk_size = 128
    
    # force flush the pending requests if we don't have enough items >= bulk_size
    flush_bulk_time = "200ms"
    
    # Ignore table without primary key
    skip_no_pk_table = false
    
    # MySQL data source
    [[source]]
    schema = "webservice"
    tables = ["building"]
    [[rule]]
    schema = "webservice"
    table = "building"
    index = "building"
    type = "buildingtype"
  1. 在ES集群中創(chuàng)建building index, 因為該工具并沒有使用ES的auto create index功能,如果index不存在會報錯

  2. 執(zhí)行命令:./bin/go-mysql-elasticsearch -config=./etc/river.toml

  3. 控制臺輸出結(jié)果:

2018/06/02 16:13:21 INFO  create BinlogSyncer with config {1001 mariadb 172.16.0.101 3306 bellen   utf8 false false <nil> false false 0 0s 0s 0}
2018/06/02 16:13:21 INFO  run status http server 127.0.0.1:12800
2018/06/02 16:13:21 INFO  skip dump, use last binlog replication pos (mysql-bin.000001, 120) or GTID %!s(<nil>)
2018/06/02 16:13:21 INFO  begin to sync binlog from position (mysql-bin.000001, 120)
2018/06/02 16:13:21 INFO  register slave for master server 172.16.0.101:3306
2018/06/02 16:13:21 INFO  start sync binlog at binlog file (mysql-bin.000001, 120)
2018/06/02 16:13:21 INFO  rotate to (mysql-bin.000001, 120)
2018/06/02 16:13:21 INFO  rotate binlog to (mysql-bin.000001, 120)
2018/06/02 16:13:21 INFO  save position (mysql-bin.000001, 120)
  1. 測試:向mysql中插入、修改、刪除數(shù)據(jù),都可以反映到ES中

使用體驗

  • go-mysql-elasticsearch完成了最基本的mysql實時同步數(shù)據(jù)到ES的功能,業(yè)務(wù)如果需要更深層次的功能如允許運行中修改mysql表結(jié)構(gòu),可以進(jìn)行自行定制化開發(fā)。
  • 異常處理不足,解析binlog event失敗直接拋出異常
  • 據(jù)作者描述,該項目并沒有被其應(yīng)用于生產(chǎn)環(huán)境中,所以使用過程中建議通讀源碼,知其利弊。

使用mypipe同步數(shù)據(jù)到ES集群

mypipe是一個mysql binlog同步工具,在設(shè)計之初是為了能夠?qū)inlog event發(fā)送到kafka, 當(dāng)前版本可根據(jù)業(yè)務(wù)的需要也可以自定以將數(shù)據(jù)同步到任意的存儲介質(zhì),項目github地址 https://github.com/mardambey/mypipe.

使用限制

    1. mysql binlog必須是ROW模式
    2. 要賦予用于連接mysql的賬戶REPLICATION權(quán)限
       GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'elastic'@'%' IDENTIFIED BY 'Elastic_123'
    3. mypipe只是將binlog日志內(nèi)容解析后編碼成Avro格式推送到kafka broker, 并不是將數(shù)據(jù)推送到kafka,如果需要同步到ES集群,可以從kafka消費數(shù)據(jù)后,再寫入ES
    4. 消費kafka中的消息(mysql insert, update, delete操作及具體的數(shù)據(jù)),需要對消息內(nèi)容進(jìn)行Avro解析,獲取到對應(yīng)的數(shù)據(jù)操作內(nèi)容,進(jìn)行下一步處理;mypipe封裝了一個KafkaGenericMutationAvroConsumer類,可以直接繼承該類使用,或者自行解析
    5. mypipe只支持binlog同步,不支持存量數(shù)據(jù)同步,也即mypipe程序啟動后無法對mysql中已經(jīng)存在的數(shù)據(jù)進(jìn)行同步

使用方式

  1. git clone https://github.com/mardambey/mypipe.git
  2. ./sbt package
  3. 配置mypipe-runner/src/main/resources/application.conf
mypipe {

  # Avro schema repository client class name
  schema-repo-client = "mypipe.avro.schema.SchemaRepo"

  # consumers represent sources for mysql binary logs
  consumers {

    localhost {
      # database "host:port:user:pass" array
      source = "172.16.0.101:3306:elastic:Elastic_123"
    }
  }

  # data producers export data out (stdout, other stores, external services, etc.)
  producers {

    kafka-generic {
      class = "mypipe.kafka.producer.KafkaMutationGenericAvroProducer"
    }
  }

  # pipes join consumers and producers
  pipes {

    kafka-generic {
      enabled = true
      consumers = ["localhost"]
      producer {
        kafka-generic {
          metadata-brokers = "172.16.16.22:9092"
        }
      }
      binlog-position-repo {
        # saved to a file, this is the default if unspecified
        class = "mypipe.api.repo.ConfigurableFileBasedBinaryLogPositionRepository"
        config {
          file-prefix = "stdout-00"     # required if binlog-position-repo is specifiec
          data-dir = "/tmp/mypipe/data" # defaults to mypipe.data-dir if not present
        }
      }
    }
  }
}
  1. 配置mypipe-api/src/main/resources/reference.conf,修改include-event-condition選項,指定需要同步的database及table
include-event-condition = """ db == "webservice" && table =="building" """
  1. 在kafka broker端創(chuàng)建topic: webservice_building_generic, 默認(rèn)情況下mypipe以"${db}_${table}_generic"為topic名,向該topic發(fā)送數(shù)據(jù)

  2. 執(zhí)行:./sbt "project runner" "runMain mypipe.runner.PipeRunner"

  3. 測試:向mysql building表中插入數(shù)據(jù),寫一個簡單的consumer消費mypipe推送到kafka中的消息

  4. 消費到?jīng)]有經(jīng)過解析的數(shù)據(jù)如下:

ConsumerRecord(topic=u'webservice_building_generic', partition=0, offset=2, timestamp=None, timestamp_type=None, key=None,
 value='\x00\x01\x00\x00\x14webservice\x10building\xcc\x01\x02\x91,\xae\xa3fc\x11\xe8\xa1\xaaRT\x00Z\xf9\xab\x00\x00\x04\x18BuildingName\x06xxx\x14BuildingId\nId-10\x00\x02\x04Id\xd4%\x00', 
checksum=128384379, serialized_key_size=-1, serialized_value_size=88)

使用體驗

  • mypipe相比go-mysql-elasticsearch更成熟,支持運行時ALTER TABLE,同時解析binlog異常發(fā)生時,可通過配置不同的策略處理異常
  • mypipe不能同步存量數(shù)據(jù),如果需要同步存量數(shù)據(jù)可通過其它方式先全量同步后,再使用mypipe進(jìn)行增量同步
  • mypipe只同步binlog, 需要同步數(shù)據(jù)到ES需要另行開發(fā)
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

推薦閱讀更多精彩內(nèi)容