Druid 使用Tranquility從kafka實時導入數據

Druid 使用Tranquility從kafka實時導入數據


數據導入方式

通過前面的介紹我們知道在流式處理領域,有兩種數據處理模式,一種為Stream Push,另一種為Stream Pull。

  • Stream Pull 如果Druid以Stream Pull方式自主地從外部數據源拉取數據從而生成Indexing Service Tasks,我們則需要建立Real-Time Node。Real-Time Node主要包含兩大“工廠”:一個是連接流式數據源、負責數據接入的Firehose(中文翻譯為水管,很形象地描述了該組件的職責);另一個是負責Segment發布與轉移的Plumber(中文翻譯為搬運工,同樣也十分形象地描述了該組件的職責)。在Druid源代碼中,這兩個組件都是抽象工廠方法,使用者可以根據自己的需求創建不同類型的Firehose或者Plumber。Firehose和Plumber給我的感覺,更類似于Kafka_0.9.0版本后發布的Kafka Connect框架,Firehose類似于Kafka Connect Source,定義了數據的入口,但并不關心接入數據源的類型;而Plumber類似于Kafka Connect Sink,定義了數據的出口,也不關心最終輸出到哪里。

  • Stream Push 如果采用Stream Push策略,我們需要建立一個“copy service”,負責從數據源中拉取數據并生成Indexing Service Tasks,從而將數據“推入”到Druid中,我們在druid_0.9.1版本之前一直使用的是這種模式,不過這種模式需要外部服務Tranquility,Tranquility是一個發送數據流到Druid的http客戶端,Tranquility組件可以連接多種流式數據源,比如Spark-Streaming、Storm以及Kafka等,所以也產生了Tranquility-Storm、Tranquility-Kafka等外部組件。

實時數據流攝入方式

  • Standalone Realtime Node(Streaming pull)

  • Indexing-service + Tranquility(Streaming push)

  • KafkaIndex-indexing-service

Tranquility數據攝入特點

  • 可以視為Druid的客戶端

  • 可以作為Jar包,依賴到其他程序中使用,典型的可以嵌入到其他流計算框架中使用,如Flink、Spark-streaming、Samza等

  • 可以作為獨立的Java應用部署

  • 管理任務生命周期

  • 實時任務定時提交

  • 任務副本與任務數

  • 實時節點服務發現

  • 消費Kafka數據,通過HTTP服務推送到實時節點上

  • topicPatten:傳topic的名字,可以是正則匹配的
    https://github.com/druid-io/tranquility/blob/master/docs/configuration.md

  • 可以通過JS代碼對數據進行處理

Tranquility使用例子

我們解析的數據格式如下:

[192.168.11.11]    [11/Dec/2018:20:59:18 +0800]    [GET /log/xxad?userAgent=Mozilla/5.0%20(%20CPU%20iPhone%20OS%2012_1%20like%20Mac%20OS%20X)%20AppleWebKit/605.1.15%20(KHTML,%20like%20Gecko)%20Version/12.0%20Mobile/15E148%20Safari/604.1&os=2&networkId=17&logType=1&jsCodeId=790431196 HTTP/1.1] [-] [Mozilla/5.0 (iPhone; CPU iPhone OS 12_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1]   [https://xx.com/question/72aef851c9b495fe218d579ce4db682b.html] 204

\t 分割的

這里給出一個json文件偽代碼,僅供參考

{
    "dataSources" : {
        "xx_ad" : {
            "spec" : {
                "dataSchema" : {
                    "dataSource" : "xx_ad",
                    "parser" : {
                        "type" : "string",
                        "parseSpec" : {
                            "timestampSpec" : {
                                "column" : "req_time",
                                "format" : "yyyy-MM-dd HH:mm:ss"
                            },
                            "dimensionsSpec" : {
                                "dimensions" : ["jsCodeId","userAgent"]
                            },
                         "function" : "function(str) {
                            省略部分js代碼
                        var req=0,resp=0,show=0;if(logType==\"1\"){req=1;return {req_time:req_time,jsCodeId:jsCodeId,userAgent:userAgent,req:req,resp:resp,show:show}}else if(logType==\"2\"){resp=1;return {req_time:req_time,jsCodeId:jsCodeId,userAgent:userAgent,req:req,resp:resp,show:show}}else if(logType==\"3\"){show=1;return {req_time:req_time,jsCodeId:jsCodeId,userAgent:userAgent,req:req,resp:resp,show:show}}}",
"format" : "javascript"
                        }
                    },
                    "granularitySpec" : {
                        "type" : "uniform",
                        "segmentGranularity" : "hour",
                        "queryGranularity" : "hour"
                    },
                    "metricsSpec" : [{
                            "type" : "count",
                            "name" : "count"
                        },{
                            "name" : "req_sum",
                            "type" : "longSum",
                            "fieldName" : "req"
                        },{
                            "name" : "resp_sum",
                            "type" : "longSum",
                            "fieldName" : "resp"
                        },{
                            "name" : "show_sum",
                            "type" : "longSum",
                            "fieldName" : "show"
                        }
                    ]
                },
                "ioConfig" : {
                    "type" : "realtime"
                },
                "tuningConfig" : {
                    "type" : "realtime",
                    "maxRowsInMemory" : "100000",
                    "intermediatePersistPeriod" : "PT15M",
                    "windowPeriod" : "PT4H"
                }
            },
            "properties" : {
                "task.partitions" : "1",
                "task.replicants" : "1",
                "topicPattern" : "xxad"
            }
        }
    },
    "properties" : {
        "zookeeper.connect" : "192.168.11.21:2181",
        "druid.discovery.curator.path" : "/druid/discovery",
        "druid.selectors.indexing.serviceName" : "druid/overlord",
        "commit.periodMillis" : "15000",
        "consumer.numThreads" : "2",
        "kafka.zookeeper.connect" : "192.168.48.11:2181,192.168.48.12:2181,192.168.48.13:2181",
        "kafka.group.id" : "tranquility-xx-ad"
    }
}

js代碼做的事情是根據logType類型來計算請求量、展現量和點擊量,維度是jsCodeId和userAgent,
由于時間列式 [11/Dec/2018:20:59:18 +0800] 是這種格式,首先設置時間格式

"timestampSpec" : {
    "column" : "req_time",
    "format" : "dd/MMM/yyyyHH:mm:ss"
    }

發現時間并不能解析,我們通過官網發現
http://druid.io/docs/latest/ingestion/ingestion-spec.html

timestampSpec

Joda time

Joda time是java里的,通過一段java

        DateTime end_date = DateTime.parse("20-12-2018:20:20:20 +0800", DateTimeFormat.forPattern("dd-MM-yyyy:HH:mm:ss +0800"));
        System.out.println("end_date:" + end_date);

打印

start_date:2018-12-11T20:59:18.000+08:00
end_date:2018-12-20T20:20:20.000+08:00
dt5:2012-05-20T13:14:00.000+08:00

如果是下面這樣

 DateTime end_date = DateTime.parse("20-Dec-2018:20:20:20 +0800", DateTimeFormat.forPattern("dd-MM-yyyy:HH:mm:ss +0800"));
        System.out.println("end_date:" + end_date);

是解析不了的

Exception in thread "main" java.lang.IllegalArgumentException: Invalid format: "20-Dec-2018:20:20:20 +0800" is malformed at "Dec-2018:20:20:20 +0800"

時間解析時要注意時間格式符合joda-time要求的格式

要引入joda-time

 <dependency>
      <groupId>joda-time</groupId>
      <artifactId>joda-time</artifactId>
      <version>2.9.7</version>
    </dependency>

js代碼解析并沒有結束,我們還遇到了有一個問題,我們隊[Get ....] 請求做切分才能拿到請求參數

[GET /log/xxad?userAgent=Mozilla/5.0%20(%20CPU%20iPhone%20OS%2012_1%20like%20Mac%20OS%20X)%20AppleWebKit/605.1.15%20(KHTML,%20like%20Gecko)%20Version/12.0%20Mobile/15E148%20Safari/604.1&os=2&networkId=17&logType=1&jsCodeId=790431196 HTTP/1.1] [-] [Mozilla/5.0 (iPhone; CPU iPhone OS 12_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1]

開始時,我這里用的是Map,js偽代碼如下:

var queryParam = param.split("&");
    var queryMap = new Map();
for(var i=0;i< queryParam.length;i++){
    var arr = queryParam[i].split("=");
    queryMap.set(arr[0],arr[1]);
}

然后我們從中取出logType

var logType = queryMap.get("logType");

之后我們啟動這個json文件,發現數據并未解析成功

2018-12-14 20:15:03,842 [KafkaConsumer-CommitThread] INFO  c.m.tranquility.kafka.KafkaConsumer - Flushed {qbad={receivedCount=4438, sentCount=0, droppedCount=0, unparseableCount=4438}} pending messages in 0ms and committed offsets in 47ms.
2018-12-14 20:15:18,887 [KafkaConsumer-CommitThread] INFO  c.m.tranquility.kafka.KafkaConsumer - Flushed {qbad={receivedCount=4544, sentCount=0, droppedCount=0, unparseableCount=4544}} pending messages in 0ms and committed offsets in 44ms.

unparseableCount 表示未解析的數據條數

通過排查我們發現是Map在druid的js代碼塊中識別不了,于是我們采用下面的寫法

 for(var i=0;i< queryParam.length;i++){var arr = queryParam[i].split(\"=\");
                                if(arr[0]==\"logType\"){
                                    logType=arr[1];
                                }
                                if(arr[0]==\"jsCodeId\"){
                                    jsCodeId=arr[1];
                                }
                                if(arr[0]==\"userAgent\"){
                                    userAgent=arr[1];
                                }

                        }

在運行這個json文件

2018-12-14 21:23:23,136 [KafkaConsumer-CommitThread] INFO  c.m.tranquility.kafka.KafkaConsumer - Flushed {qbad={receivedCount=4180, sentCount=4180, droppedCount=0, unparseableCount=0}} pending messages in 0ms and committed offsets in 35ms.
2018-12-14 21:23:38,168 [KafkaConsumer-CommitThread] INFO  c.m.tranquility.kafka.KafkaConsumer - Flushed {qbad={receivedCount=4261, sentCount=4261, droppedCount=0, unparseableCount=0}} pending messages in 0ms and committed offsets in 31ms.

通過日志發現這次解析沒有問題

總結:
Tranquility方式雖然能引入js做數據解析,但不是所有的js用法在druid中都能使用,這個坑要慢慢去試了。

附上json文件

xx_ad.json
{
    "dataSources" : {
        "xx_ad" : {
            "spec" : {
                "dataSchema" : {
                    "dataSource" : "xx_ad",
                    "parser" : {
                        "type" : "string",
                        "parseSpec" : {
                            "timestampSpec" : {
                                "column" : "req_time",
                                "format" : "yyyy-MM-dd HH:mm:ss"
                            },
                            "dimensionsSpec" : {
                                "dimensions" : ["jsCodeId","userAgent"]
                            },
                         "function" : "function(str) {var infos = str.split(\"\t\");var time = infos[1].replace(\"[\",\"\").replace(\"]\",\"\");var tmp = time.split(\" \");var tmpTime = tmp[0].split(\":\");
                         var hhmmss= tmpTime[1] + \":\" + tmpTime[2] + \":\" +tmpTime[3];
                         var month = new Array();month[\"Jan\"] = 01;month[\"Feb\"] = 02;month[\"Mar\"] = 03;month[\"Apr\"] = 04;month[\"May\"] = 05;month[\"Jan\"] = 06;month[\"Jul\"] = 07;month[\"Aug\"] = 08;month[\"Sep\"] = 09;month[\"Oct\"] = 10;month[\"Nov\"] = 11;month[\"Dec\"] = 12;var yyyymmdd = tmpTime[0];
                         var yyyymmddStr = yyyymmdd.split(\"/\");var req_time = yyyymmddStr[2] + \"-\" + month[yyyymmddStr[1]] + \"-\" + yyyymmddStr[0] + \" \" + hhmmss;var firstIndex = infos[2].indexOf(\"[\");var firstInfo = infos[2].substring(firstIndex+1,infos[2].length);var lastIndex = firstInfo.indexOf(\"]\");var queryStr = firstInfo.substring(0,lastIndex);var queryPart = queryStr.split(\" \");var queryUrl = queryPart[1];var index = queryUrl.indexOf(\"?\");var param=queryUrl.substring(index + 1);var queryParam = param.split(\"&\");
                            var logType=0,jsCodeId=\"\",userAgent=\"\";
                            for(var i=0;i< queryParam.length;i++){var arr = queryParam[i].split(\"=\");
                                if(arr[0]==\"logType\"){
                                    logType=arr[1];
                                }
                                if(arr[0]==\"jsCodeId\"){
                                    jsCodeId=arr[1];
                                }
                                if(arr[0]==\"userAgent\"){
                                    userAgent=arr[1];
                                }

                        }
                        var req=0,resp=0,show=0;if(logType==\"1\"){req=1;return {req_time:req_time,jsCodeId:jsCodeId,userAgent:userAgent,req:req,resp:resp,show:show}}else if(logType==\"2\"){resp=1;return {req_time:req_time,jsCodeId:jsCodeId,userAgent:userAgent,req:req,resp:resp,show:show}}else if(logType==\"3\"){show=1;return {req_time:req_time,jsCodeId:jsCodeId,userAgent:userAgent,req:req,resp:resp,show:show}}}",
"format" : "javascript"
                        }
                    },
                    "granularitySpec" : {
                        "type" : "uniform",
                        "segmentGranularity" : "hour",
                        "queryGranularity" : "hour"
                    },
                    "metricsSpec" : [{
                            "type" : "count",
                            "name" : "count"
                        },{
                            "name" : "req_sum",
                            "type" : "longSum",
                            "fieldName" : "req"
                        },{
                            "name" : "resp_sum",
                            "type" : "longSum",
                            "fieldName" : "resp"
                        },{
                            "name" : "show_sum",
                            "type" : "longSum",
                            "fieldName" : "show"
                        }
                    ]
                },
                "ioConfig" : {
                    "type" : "realtime"
                },
                "tuningConfig" : {
                    "type" : "realtime",
                    "maxRowsInMemory" : "100000",
                    "intermediatePersistPeriod" : "PT15M",
                    "windowPeriod" : "PT4H"
                }
            },
            "properties" : {
                "task.partitions" : "1",
                "task.replicants" : "1",
                "topicPattern" : "xxad"
            }
        }
    },
    "properties" : {
        "zookeeper.connect" : "192.168.11.21:2181",
        "druid.discovery.curator.path" : "/druid/discovery",
        "druid.selectors.indexing.serviceName" : "druid/overlord",
        "commit.periodMillis" : "15000",
        "consumer.numThreads" : "2",
        "kafka.zookeeper.connect" : "192.168.48.11:2181,192.168.48.12:2181,192.168.48.13:2181",
        "kafka.group.id" : "tranquility-offline-qb-ad"
    }
}

參考:

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容

  • Druid基本概念及架構介紹 1.什么是Druid Druid是一個專為大型數據集上的高性能切片和OLAP分析而設...
    it_zzy閱讀 53,331評論 0 32
  • Druid.io(以下簡稱Druid)是面向海量數據的、用于實時查詢與分析的OLAP存儲系統。Druid的四大關鍵...
    大詩兄_zl閱讀 6,482評論 0 9
  • Kafka設計解析(七)- Kafka Stream 原創文章,轉載請務必將下面這段話置于文章開頭處。本文轉發自技...
    小小少年Boy閱讀 5,291評論 0 32
  • 本文介紹在Kafka和Druid整合使用中遇到的問題和解決方法。 1. 基本配置 Druid使用Kafka作為數據...
    MeazZa閱讀 1,852評論 1 0
  • Quickstart單機測試 http://druid.io/docs/0.10.1/tutorials/quic...
    大詩兄_zl閱讀 1,263評論 0 0