Neil Zhu,簡書ID Not_GOD,University AI 創始人 & Chief Scientist,致力于推進世界人工智能化進程。制定并實施 UAI 中長期增長戰略和目標,帶領團隊快速成長為人工智能領域最專業的力量。
作為行業領導者,他和UAI一起在2014年創建了TASA(中國最早的人工智能社團), DL Center(深度學習知識中心全球價值網絡),AI growth(行業智庫培訓)等,為中國的人工智能人才建設輸送了大量的血液和養分。此外,他還參與或者舉辦過各類國際性的人工智能峰會和活動,產生了巨大的影響力,書寫了60萬字的人工智能精品技術內容,生產翻譯了全球第一本深度學習入門書《神經網絡與深度學習》,生產的內容被大量的專業垂直公眾號和媒體轉載與連載。曾經受邀為國內頂尖大學制定人工智能學習規劃和教授人工智能前沿課程,均受學生和老師好評。
[本文譯自 elasticsearch-carrot2 用例]
引言
Carrot2 - Open Source Search Results Clustering Engine是一個開源搜索結果聚類引擎。它可以自動地根據內容將搜索結果組織成更小的主題分類。本文則是關于在 elasticsearch 中的 carrot2 插件的介紹。
基礎概念
carrot2是聚類插件,可以自動地將相似的文檔組織起來,并且給每個文檔的群組分類貼上相應的較為用戶可以理解的標簽。這樣的聚類也可以看做是一種動態的針對每個搜索和命中結果集合的動態 facet。可以在Carrot2 demo page 體驗一下這個工具。
每個需要聚類的文檔有若干邏輯單元:文檔標識符,原始的 URL,標題,主要的內容和語言代碼。只有標識符字段是強制的,其他部分都是可選得,但是至少一個其他字段是需要指定以保證操作的合理性的。
在 Elasticsearch 中索引的文檔不需要按照任何的預設 schema 所以一個 JSON 文檔的實際字段需要被映射到聚類插件要求的邏輯單元上。下面圖示了一個例子:
請注意文檔的兩個字段被映射到 TITLE 上。這不是一個錯誤,任意數目的字段都可以映射到 TITLE 或者 CONTENT 上——這些字段的內容可以被連接起來用作聚類。
邏輯單元也可以用生成的內容進行填充,例如使用 高亮 在文檔的字段上。這功能可以大大降低輸入給聚類算法的文檔數量(提高性能),同樣會讓聚類的內容更加與查詢相關(聚類效果更佳)。下面的 REST API 會展示字段映射的細節。
Java API
用作聚類查詢結果的 Java API 功能完備,也是下面提到的 REST 請求背后的工作原理的支撐。可以參考github 上插件的源碼,尤其是單元測試和集成測試部分。
HTTP (REST) API
HTTP REST API 包含反映了 Java API 功能的幾種方法。下面會詳細介紹。
列舉可用算法
/_algorithms
(GET
或者POST
)
這個操作列舉所有可用的聚類算法。返回的 標識符 可以用作 聚類 請求的參數。
請求 Request
簡單的 GET
或者 POST
到 /_algorithms
URL 的請求。
響應 Response
響應就是一個 JSON 對象有一個 algorithms
的屬性,其中存放一個算法的 標識符 列表。下面的例子展示了此插件用例的可用算法。默認算法就是出現在返回列表的第一個。
$.get("/_algorithms", function(response) {
$("#list-of-algorithms").text(
response.algorithms.join("\n"));
});
lingo
stc
kmeans
byurl
搜索和聚類結果
/_search_with_clusters
(POST
,GET
)/{index}/_search_with_clusters
(POST
,GET
)/{index}/{type}/_search_with_clusters
(POST
,GET
)
這個操作執行一個搜索的查詢,獲取匹配的命中結果,并對其進行聚類。
index
和 type
這兩個 URI 隱性地綁定了搜索請求到一個給定的索引和文檔類型上,正如搜索請求API所示。
聚類的請求是一個 HTTP REST 請求,其中整個的參數集合通過 包含一個JSON body 的 HTTP POST 請求完成。通過 HTTP GET 方法也可以得到聚類功能的一個子集。
請求 (HTTP POST)
HTTP POST 請求應當包含一個 JSON 對象,該對象有如下的屬性
-
search_request
必須 該搜索請求獲取用來聚類的文檔。這個部分完全依照 搜索DSL 指定的規范,包含所有功能比如說 sorting、filtering、query DSL、highlighter等等。 -
query_hint
必須 這是指定用來獲取匹配文檔的查詢 term 的屬性。query_hint 幫助聚類算法避免無意義的聚類結果。一般來說,這個查詢 term hint 會和用戶在輸入框中輸入的東西保持一致。可能的話,也會經過任何 boolean 或者搜索引擎具體相關的操作的處理,最終會影響聚類的過程。此項是強制性的,但也可以是空字符串。 -
field_mapping
必須 定義了如何去映射匹配search_request
的文檔的實際字段到需要被聚類的文檔的邏輯單元。該屬性是用 hash 表示的,key 是邏輯單元的字段,value 則是字段源定義的數組(由這些說明所定義的字段內容將被連接起來)。例如,下面的是有效的映射說明:
{
"url": [_source.urlSource],
"title": [fields.subject],
"content": [_source.abstract, highlight.main],
"language": [fields.lang]
}
-
url
是文檔的 URL -
title
是文檔的 title -
content
文檔的主體 -
language
可以選擇的對 title 和 content 的語言 tag。語言標記是兩個字母構成 ISO 639-1 code,中文簡體例外(zh_cn
code)。聚類引擎是否支持某個語言是由使用的算法決定的。Carrot2 算法支持的語言定義在LanguageCode
類中。
字段源說明定義了 value 從哪里取來:搜素命中的字段,存放文檔的內容,或者高亮的輸出。字段源說明的語法如下:
fields.{fieldname}
定義了搜索命中的字段(存儲的字段或者從源文檔重新 parse 但是在搜索請求中返回)highlight.{fieldname}
定義了搜索命中的高亮字段。高亮輸出必須同樣在搜索請求中被合理配置(參見用例)_source.{fieldname}
定義了源文檔的字段(這是 json 文檔的頂級屬性)。這里會重新 parse 源文檔并獲取合適的值。algorithm
可選 定義了采用哪種聚類算法。所有內置的聚類算法在啟動的時候都已經載入了,在上面的例子中返回算法列表中都已經展示了。如果沒有指定,則會默認使用第一個。include_hits
可選 此處設置為true
,聚類響應不會返回搜索的命中,只會包含聚類的標簽和文檔的引用。這個選項在降低聚類響應的規模和只需要返回聚類標簽時比較有效。max_hits
可選 如果設置為一個非負值,聚類響應會被限制在包含搜索命中不超過最大值數量的結果中。聚類仍將在整個原始搜素的結果的窗口中運行。這個選項可能在聚類標簽用作 facet 的時候能有效降低聚類響應的數量。注意:聚類可能會參考到那些并沒有在最終返回結果的那些文檔。attributes
可選 key value 的映射將會重載默認的對每個 query 的算法設置。典型的默認設置使用初始時 XML 配置文件。
注意
聚類需要至少一些文檔的結果以具有合理性。聚類插件只是對查詢的結果進行聚類(而不會在索引中查看,也不會看額外獲得的文檔)。確保自己指定獲取窗口的size
至少為 100. 如果響應不需要這么多的命中結果,命中結果可以使用max_hits
參數來對聚類請求進行刪減。
請求(HTTP GET)
HTTP GET 聚類請求支持一個 HTTP URI 參數(定義在 Elasticsearch 的 URI 搜索請求)的超集。所有額外的參數對應于這些聚類 POST 請求的 body 中典型定義。HTTP GET 支持下面的參數:
-
field_mapping_*
必須 這是參數的一個集合,每個參數定義了一個邏輯字段映射,類似于 HTTP POST 中field_mapping
。field_mapping_title
將指定邏輯 title 的映射,而field_mapping_url
將會指定邏輯 URL 映射。映射參數的值是逗號分隔映射說明的列表,正如POST 請求中的說明所示。 -
algorithm
可選 與 HTT POST 請求的algorithm
相同。 -
query_hint
可選 與 HTT POST 請求的query_hint
相同。對 GET 請求,query_hint 是可選的;如果沒有指定,q
屬性作為默認值。
Important
HTTP GET 請求提供了完全版 HTTP POST 請求所有功能的子集。例如,不能指定一個字段映射到高亮字段值,不能定義定制的算法屬性等等。推薦使用 HTTP POST。
下面給出一個使用 HTTP GET 聚類請求的例子。
var getUrl = "/test/test/_search_with_clusters?"
+ "q=data+mining&"
+ "size=100&"
+ "field_mapping_title=_source.title&"
+ "field_mapping_content=_source.content";
// Run HTTP GET via jquery and render cluster labels.
$.get(getUrl,
function(response) {
$("#cluster-httpget-result").text(
dumpClusters([], response.clusters).join("\n"));
});
Knowledge Discovery [13 documents]
Data Mining Process [12 documents]
Data Mining Applications [11 documents]
Data Mining Tools [9 documents]
Data Mining Conference [8 documents]
Data Mining Solutions [8 documents]
Data Mining Research [7 documents]
Data Mining Technology [7 documents]
Text Mining [7 documents]
Book on Data Mining [5 documents]
Predictive Modeling [5 documents]
Introduction to Data Mining [4 documents]
Machine Learning [4 documents]
Oracle Data Mining [4 documents]
Analysis Techniques [3 documents]
Association [3 documents]
Data Mining Consulting [3 documents]
Data Warehousing [3 documents]
People [3 documents]
Assist Management [2 documents]
Case Studies [2 documents]
Data Mining Institute [2 documents]
Data Mining Project [2 documents]
Data-mining Software [2 documents]
Downloads [2 documents]
Encyclopedia [2 documents]
Microsoft SQL Server [2 documents]
SIAM International Conference on Data Mining [2 documents]
Other Topics [8 documents]
響應
響應的格式和一個正常的查詢請求的響應基本相同,只是多了一些額外的屬性。
{
/* Typical search response fields. */
"hits": { /* ... */ },
/* Clustering response fields. */
"clusters": [
/* Each cluster is defined by the following. */
{
"id": /* identifier */,
"score": /* numeric score */,
"label": /* primary cluster label */,
"other_topics": /* if present, and true, this cluster groups
unrelated documents (no related topics) */,
"phrases": [
/* cluster label array, will include primary. */
],
"documents": [
/* This cluster's document ID references.
May be undefined if this cluster holds sub-clusters only. */
],
"clusters": [
/* This cluster's subclusters (recursive objects of the same
structure). May be undefined if this cluster holds documents only. */
],
},
/* ...more clusters */
],
"info": {
/* Additional information about the clustering: execution times,
the algorithm used, etc. */
}
}
給出下面的遞歸地抽取聚類的函數:
window.dumpClusters = function(arr, clusters, indent) {
indent = indent ? indent : "";
clusters.forEach(function(cluster) {
arr.push(
indent + cluster.label
+ (cluster.documents ? " [" + cluster.documents.length + " documents]" : "")
+ (cluster.clusters ? " [" + cluster.clusters.length + " subclusters]" : ""));
if (cluster.clusters) {
dumpClusters(arr, cluster.clusters, indent + " ");
}
});
return arr;
}
使用下面的 js 可以遞歸地獲取所有類別標簽:
var request = {
"search_request": {
"query": {"match" : { "_all": "data mining" }},
"size": 100
},
"max_hits": 0,
"query_hint": "data mining",
"field_mapping": {
"title": ["_source.title"],
"content": ["_source.content"]
}
};
$.post("/test/test/_search_with_clusters",
JSON.stringify(request),
function(response) {
$("#cluster-list-result").text(
dumpClusters([], response.clusters).join("\n"));
});
輸出
Knowledge Discovery [13 documents]
Data Mining Process [12 documents]
Data Mining Applications [11 documents]
Data Mining Tools [9 documents]
Data Mining Conference [8 documents]
Data Mining Solutions [8 documents]
Data Mining Research [7 documents]
Data Mining Technology [7 documents]
Text Mining [7 documents]
Book on Data Mining [5 documents]
Predictive Modeling [5 documents]
Introduction to Data Mining [4 documents]
Machine Learning [4 documents]
Oracle Data Mining [4 documents]
Analysis Techniques [3 documents]
Association [3 documents]
Data Mining Consulting [3 documents]
Data Warehousing [3 documents]
People [3 documents]
Assist Management [2 documents]
Case Studies [2 documents]
Data Mining Institute [2 documents]
Data Mining Project [2 documents]
Data-mining Software [2 documents]
Downloads [2 documents]
Encyclopedia [2 documents]
Microsoft SQL Server [2 documents]
SIAM International Conference on Data Mining [2 documents]
Other Topics [8 documents]
輸出依賴于采用的聚類算法。下面的例子給出了使用邏輯的 url
字段來產生的聚類結果。我們不需要每個搜索結果,所以在響應中取消了。
var request = {
"search_request": {
"query": {"match" : { "_all": "data mining" }},
"size": 100
},
"max_hits": 0,
"query_hint": "data mining",
"field_mapping": {
"url": ["_source.url"]
},
"algorithm": "byurl"
};
$.post("/test/test/_search_with_clusters",
JSON.stringify(request), function(response) {
$("#cluster-list-result2").text(
dumpClusters([], response.clusters).join("\n"));
});
輸出
com [13 subclusters]
microsoft.com [2 subclusters]
research.microsoft.com [2 documents]
Other Sites [2 documents]
yahoo.com [2 subclusters]
answers.yahoo.com [2 documents]
Other Sites [2 documents]
databases.about.com [2 documents]
datamining.typepad.com [2 documents]
dataminingconsultant.com [2 documents]
dmreview.com [2 documents]
oracle.com [2 documents]
spss.com [2 documents]
statsoft.com [2 documents]
the-data-mine.com [2 documents]
thearling.com [2 documents]
twocrows.com [2 documents]
Other Sites [32 documents]
org [3 subclusters]
en.wikipedia.org [2 documents]
siam.org [2 documents]
Other Sites [9 documents]
edu [2 subclusters]
ccsu.edu [2 documents]
Other Sites [10 documents]
ca [2 documents]
gov [2 documents]
net [2 documents]
Other Sites [2 documents]
下面是一個完全的響應請求可以對比其中的不同
var request = {
"search_request": {
"fields": [ "title", "content" ],
"query": {"match" : { "_all": "data mining" }},
"size": 100
},
"query_hint": "data mining",
"field_mapping": {
"title": ["fields.title"],
"content": ["fields.content"]
}
};
$.post("/test/test/_search_with_clusters",
JSON.stringify(request),
function(response) {
$("#simple-request-result").text(
JSON.stringify(response, false, " "));
});
輸出
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 93,
"max_score": 1.1545734,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "6",
"_score": 1.1545734,
"fields": {
"content": [
"... complete data mining customer ... Data mining applications, on the other hand, embed ... it, our daily lives are influenced by data mining applications. ..."
],
"title": [
"Data Mining Software, Data Mining Applications and Data Mining Solutions"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "44",
"_score": 1.1312462,
"fields": {
"content": [
"Data mining terms concisely defined. ... Accuracy is an important factor in assessing the success of data mining. ... data mining ..."
],
"title": [
"Two Crows: Data mining glossary"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "55",
"_score": 1.1312462,
"fields": {
"content": [
""
],
"title": [
"data mining institute"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "84",
"_score": 1.0323554,
"fields": {
"content": [
"... Walmart, Fundraising Data Mining, Data Mining Activities, Web-based Data Mining, ... in many industries makes us the best choice for your data mining needs. ..."
],
"title": [
"Data Mining, Data Mining Process, Data Mining Techniques, Outsourcing Mining Data Services"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "35",
"_score": 1.0323384,
"fields": {
"content": [
"... Sapphire-a semiautomated, flexible data-mining software infrastructure. ... Data mining is not a new field. ... scale, scientific data-mining efforts such ..."
],
"title": [
"Data Mining"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "18",
"_score": 0.9796879,
"fields": {
"content": [
"... high performance networking, internet computing, data mining and related areas. ... Peter Stengard, Oracle Data Mining Technologies. prudsys AG, Chemnitz, ..."
],
"title": [
"Data Mining Group - DMG"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "22",
"_score": 0.96633387,
"fields": {
"content": [
"Using data mining functionality embedded in ... Oracle Data Mining JDeveloper and SQL Developer ... Oracle Magazine: Using the Oracle Data Mining API ..."
],
"title": [
"Oracle Data Mining"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "39",
"_score": 0.96633387,
"fields": {
"content": [
"Some example application areas are listed under Applications Of Data Mining ... Crows Introduction - \"Introduction to Data Mining and Knowledge Discovery\"- http: ..."
],
"title": [
"Data Mining - Introduction To Data Mining (Misc)"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "66",
"_score": 0.96318483,
"fields": {
"content": [
"... business intelligence, data warehousing, data mining, CRM, analytics, ... M2007 Data Mining Conference Hitting 10th Year and Going Strong ..."
],
"title": [
"Data Mining"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.96179265,
"fields": {
"content": [
"Newsletter on the data mining and knowledge industries, offering information on data mining, knowledge discovery, text mining, and web mining software, courses, jobs, publications, and meetings."
],
"title": [
"KDnuggets: Data Mining, Web Mining, and Knowledge Discovery"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "7",
"_score": 0.96179265,
"fields": {
"content": [
"Commentary on text mining, data mining, social media and data visualization. ... Opinion Mining Startups ... in sentiment mining, deriving tuples of ..."
],
"title": [
"Data Mining: Text Mining, Visualization and Social Media"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "71",
"_score": 0.9561301,
"fields": {
"content": [
"Data Mining is the automated extraction of hidden predictive information from databases. ... The data mining tools can make this leap. ..."
],
"title": [
"Data Mining | NetworkDictionary"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "65",
"_score": 0.948279,
"fields": {
"content": [
"... Website for Data Mining Methods and ... data mining at Central Connecticut State University, he ... also provides data mining consulting and statistical ..."
],
"title": [
"DataMiningConsultant.com"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "36",
"_score": 0.9427052,
"fields": {
"content": [
"SQL Server Data Mining Portal ... information about our exciting data mining features. ... CTP of Microsoft SQL Server 2008 Data Mining Add-Ins for Office 2007 ..."
],
"title": [
"SQL Server Data Mining > Home"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "14",
"_score": 0.9127037,
"fields": {
"content": [
"From data mining tutorials to data warehousing techniques, you will find it all! ... Administration Design Development Data Mining Database Training Careers Reviews ..."
],
"title": [
"Data Mining and Data Warehousing"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "3",
"_score": 0.9124819,
"fields": {
"content": [
"Data mining is considered a subfield within the Computer Science field of knowledge discovery. ... claim to perform \"data mining\" by automating the creation ..."
],
"title": [
"Data mining - Wikipedia, the free encyclopedia"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "15",
"_score": 0.9124819,
"fields": {
"content": [
"Oracle Data Mining Product Center ... Using data mining functionality embedded in Oracle Database 10g, you can find ... Mining High-Dimensional Data for ..."
],
"title": [
"Oracle Data Mining"
]
}
}
...
]
},
"clusters": [
{
"id": 0,
"score": 71.07595656014654,
"label": "Knowledge Discovery",
"phrases": [
"Knowledge Discovery"
],
"documents": [
"39",
"2",
"3",
"5",
"61",
"25",
"17",
"34",
"4",
"9",
"43",
"62",
"74"
]
},
{
"id": 1,
"score": 66.46874714157775,
"label": "Data Mining Process",
"phrases": [
"Data Mining Process"
],
"documents": [
"84",
"13",
"63",
"67",
"86",
"34",
"77",
"83",
"8",
"54",
"4",
"87"
]
},
{
"id": 2,
"score": 71.44252901633597,
"label": "Data Mining Applications",
"phrases": [
"Data Mining Applications"
],
"documents": [
"6",
"39",
"85",
"82",
"33",
"76",
"41",
"60",
"43",
"16",
"87"
]
},
{
"id": 3,
"score": 81.34385135697781,
"label": "Data Mining Tools",
"phrases": [
"Data Mining Tools"
],
"documents": [
"71",
"23",
"32",
"56",
"86",
"52",
"77",
"74",
"79"
]
},
{
"id": 4,
"score": 49.66400793807237,
"label": "Data Mining Conference",
"phrases": [
"Data Mining Conference"
],
"documents": [
"66",
"85",
"50",
"33",
"60",
"46",
"29",
"57"
]
},
{
"id": 5,
"score": 64.44592124795795,
"label": "Data Mining Solutions",
"phrases": [
"Data Mining Solutions"
],
"documents": [
"6",
"28",
"37",
"77",
"42",
"54",
"89",
"53"
]
},
...
],
"info": {
"algorithm": "lingo",
"search-millis": "12",
"clustering-millis": "296",
"total-millis": "309",
"include-hits": "true",
"max-hits": ""
}
}
深入字段映射
字段映射提供了聯系實際數據和用以聚類的邏輯數據的方式。不同的字段映射源(_source.*
、hightlight.*
和 fields.*
)可以用來調整在請求中調整的數據的量以及傳遞給聚類引擎的文本的數量(最終反映在處理的成本上)。
- 如果
_source
作為搜索命中的一部分是可以獲得的話,_source.*
映射可以直接從源文檔中獲取數據。通過這個映射指向的內容不會作為請求的一個部分返回,這只是在聚類的內部過程中使用到。警告!-source
可能不會由 Elasticsearch 的內部搜素架構所發布,尤其僅有挑選出的fields
是過濾的時候,源是不可以獲得的。在未來的版本中應該會解決這個問題。 -
fields.*
映射必須與搜索請求中合適的fields
聲明相關。這些字段的內容和請求一同返回,可以用作展示(只展示每個文檔的標題)。 -
highlight.*
映射同樣必須與搜索請求中合適highlight
聲明相關。高亮請求說明可以用來調整傳輸給聚類引擎的內容的數量(分片的數量,寬度,限界等等)。這個在文檔很長的時候特別重要(全部的內容都存儲著):典型的情況是聚類算法在集中在包含查詢的上下文環境時效果要比在所有文檔的全部內容時好的多。任何高亮的內容將同樣被作為請求的一部分返回。
對比下面的兩個請求的輸出可以看出其中的不同。
code 1
var request = {
"search_request": {
"fields": ["url", "title", "content"],
"query": {"match" : { "_all": "computer" }},
"size": 100
},
"query_hint": "computer",
"field_mapping": {
"url": ["fields.url"],
"title": ["fields.title"],
"content": ["fields.content"]
}
};
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
$("#fields-request").text(JSON.stringify(response, false, " "));
});
code 2
var request = {
"search_request": {
"fields": ["url", "title"],
"query": {"match" : { "_all": "computer" }},
"size": 100,
"highlight" : {
"pre_tags" : ["", ""],
"post_tags" : ["", ""],
"fields" : {
"content" : { "fragment_size" : 100, "number_of_fragments" : 2 }
}
},
},
"query_hint": "computer",
"field_mapping": {
"url": ["fields.url"],
"title": ["fields.title"],
"content": ["highlight.content"]
}
};
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
$("#highlight-request").text(JSON.stringify(response, false, " "));
});
輸出 1
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.685061,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "62",
"_score": 0.685061,
"fields": {
"content": [
"Technical journal focused on the theory, techniques, and practice for extracting information from large databases."
],
"title": [
"Data Mining and Knowledge Discovery - Data Mining and Knowledge Discovery Journals, Books & Online Media | Springer"
],
"url": [
"http://www.springer.com/computer/database+management+&+information+retrieval/journal/10618"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "3",
"_score": 0.68239,
"fields": {
"content": [
"Data mining is considered a subfield within the Computer Science field of knowledge discovery. ... claim to perform \"data mining\" by automating the creation ..."
],
"title": [
"Data mining - Wikipedia, the free encyclopedia"
],
"url": [
"http://en.wikipedia.org/wiki/Data-mining"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "51",
"_score": 0.5480488,
"fields": {
"content": [
"This page describes the term data mining and lists other pages on the Web where you can find additional information. ... Data Mining and Analytic Technologies ..."
],
"title": [
"What is data mining? - A Word Definition From the Webopedia Computer Dictionary"
],
"url": [
"http://www.webopedia.com/TERM/D/data_mining.html"
]
}
}
]
},
"clusters": [
{
"id": 0,
"score": 0.18077730227849886,
"label": "Data Mining and Knowledge Discovery",
"phrases": [
"Data Mining and Knowledge Discovery"
],
"documents": [
"62",
"3"
]
},
{
"id": 1,
"score": 0,
"label": "Other Topics",
"phrases": [
"Other Topics"
],
"other_topics": true,
"documents": [
"51"
]
}
],
"info": {
"algorithm": "lingo",
"search-millis": "22",
"clustering-millis": "25",
"total-millis": "47",
"include-hits": "true",
"max-hits": ""
}
}
輸出 2
{
"took": 305,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.685061,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "62",
"_score": 0.685061,
"fields": {
"title": [
"Data Mining and Knowledge Discovery - Data Mining and Knowledge Discovery Journals, Books & Online Media | Springer"
],
"url": [
"http://www.springer.com/computer/database+management+&+information+retrieval/journal/10618"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "3",
"_score": 0.68239,
"fields": {
"title": [
"Data mining - Wikipedia, the free encyclopedia"
],
"url": [
"http://en.wikipedia.org/wiki/Data-mining"
]
},
"highlight": {
"content": [
"Data mining is considered a subfield within the Computer Science field of knowledge discovery"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "51",
"_score": 0.5480488,
"fields": {
"title": [
"What is data mining? - A Word Definition From the Webopedia Computer Dictionary"
],
"url": [
"http://www.webopedia.com/TERM/D/data_mining.html"
]
}
}
]
},
"clusters": [
{
"id": 0,
"score": 0.1807764758253202,
"label": "Data Mining and Knowledge Discovery",
"phrases": [
"Data Mining and Knowledge Discovery"
],
"documents": [
"62",
"3"
]
},
{
"id": 1,
"score": 0,
"label": "Other Topics",
"phrases": [
"Other Topics"
],
"other_topics": true,
"documents": [
"51"
]
}
],
"info": {
"algorithm": "lingo",
"search-millis": "305",
"clustering-millis": "11",
"total-millis": "317",
"include-hits": "true",
"max-hits": ""
}
}
選擇算法
聚類插件包含了幾種 Carrot2 項目開源的算法也有商業版本 Lingo3G 的聚類算法。
如何選擇算法依賴于傳輸量(STC 比 Lingo 更快,但產生的結果較差;Lingo3G是更加快速的算法但不是開源免費的)和期望的結果(Lingo3G 提供層次化的聚類,Lingo 和 STC 提供扁平的聚類),以及輸入的數據(每個算法都有微小的聚類差別)。對于這個問題,答案是不確定的。
下面的例子展示了選擇不同的算法的效果
lingo 算法
var request = {
"search_request": {
"query": {"match" : { "_all": "data mining" }},
"size": 100
},
"query_hint": "data mining",
"field_mapping": {
"title": ["_source.title"],
"content": ["_source.content"]
},
"algorithm": "lingo"
};
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
$("#request-algorithm1").text(dumpClusters([], response.clusters).join("\n"));
});
STC 算法
var request = {
"search_request": {
"query": {"match" : { "_all": "data mining" }},
"size": 100
},
"query_hint": "data mining",
"field_mapping": {
"title": ["_source.title"],
"content": ["_source.content"]
},
"algorithm": "stc"
};
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
$("#request-algorithm2").text(dumpClusters([], response.clusters).join("\n"));
});
lingo 算法輸出
Knowledge Discovery [13 documents]
Data Mining Process [12 documents]
Data Mining Applications [11 documents]
Data Mining Tools [9 documents]
Data Mining Conference [8 documents]
Data Mining Solutions [8 documents]
Data Mining Research [7 documents]
Data Mining Technology [7 documents]
Text Mining [7 documents]
Book on Data Mining [5 documents]
Predictive Modeling [5 documents]
Introduction to Data Mining [4 documents]
Machine Learning [4 documents]
Oracle Data Mining [4 documents]
Analysis Techniques [3 documents]
Association [3 documents]
Data Mining Consulting [3 documents]
Data Warehousing [3 documents]
People [3 documents]
Assist Management [2 documents]
Case Studies [2 documents]
Data Mining Institute [2 documents]
Data Mining Project [2 documents]
Data-mining Software [2 documents]
Downloads [2 documents]
Encyclopedia [2 documents]
Microsoft SQL Server [2 documents]
SIAM International Conference on Data Mining [2 documents]
Other Topics [8 documents]
STC 算法結果
Knowledge Discovery [19 documents]
Data Mining Tools [9 documents]
Data Mining Solutions [7 documents]
Data Mining and Knowledge [5 documents]
Machine Learning [5 documents]
Text Mining [7 documents]
SQL, Microsoft SQL Server [4 documents]
Software [13 documents]
Process [11 documents]
Applications [10 documents]
Modeling [10 documents]
Predictive [9 documents]
Techniques [9 documents]
Databases [8 documents]
Developing [8 documents]
Other Topics [26 documents]
重載算法屬性
默認算法集合包含對每個算法的所有初始屬性的空的 stub。這些文件根據 {algorithm-name}-attributes.xml
進行命名,并由當前的 resources
配置設置進行處理(參考插件配置。
例如,為了對 lingo
算法的所有請求重載默認屬性,我們需要創建一個 {es.home}/config/lingo-attributes.xml
文件,并把任何重載的屬性放在那兒,如下:
<attribute-sets default="overridden-attributes">
<attribute-set id="overridden-attributes">
<value-set>
<label>overridden-attributes</label>
<attribute key="LingoClusteringAlgorithm.desiredClusterCountBase">
<value type="java.lang.Integer" value="5"/>
</attribute>
</value-set>
</attribute-set>
</attribute-sets>
也許最為方便的方式是直接從 Carrot2 Workbench 中直接導出其配置的 XML 文件。
運行時重載算法屬性
每個聚類算法都包含若干能夠改變其行為的參數(Carrot2 Workbench可以用來調整這些)。某個屬性可以對每個查詢請求進行定制,正如下面的例子展示的那樣,我們可以隨機地改變需要的聚類個數(多執行幾遍下面的例子看看不同的結果)。
var request = {
"search_request": {
"query": {"match" : { "_all": "data mining" }},
"size": 100
},
"query_hint": "data mining",
"field_mapping": {
"title": ["_source.title"],
"content": ["_source.content"]
},
"algorithm": "lingo",
"attributes": {
"LingoClusteringAlgorithm.desiredClusterCountBase": Math.round(5 + Math.random() * 5)
}
};
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
$("#request-attributes").text(dumpClusters([], response.clusters).join("\n"));
});
輸出 1
Knowledge Discovery [13 documents]
Data Mining Process [12 documents]
Data Mining Applications [11 documents]
Data Mining Technology [7 documents]
Microsoft SQL Server [2 documents]
Other Topics [54 documents]
輸出 2
Knowledge Discovery [13 documents]
Data Mining Tools [9 documents]
Data Mining Conference [8 documents]
Data Mining Research [7 documents]
Text Mining [7 documents]
Predictive Modeling [5 documents]
Other Topics [50 documents]
輸出 3
Knowledge Discovery [13 documents]
Data Mining Applications [11 documents]
Data Mining Techniques [9 documents]
Data Mining Tools [9 documents]
Text Mining [7 documents]
Association [3 documents]
Assist Management [2 documents]
Case Studies [2 documents]
Microsoft SQL Server [2 documents]
Other Topics [43 documents]
多語言聚類
字段映射說明可以包含一個 language
元素,定義了標題和文檔所采用的語言編碼 ISO 639-1 。這個信息可以根據先驗的知識(文檔的源或者在索引時候執行語言探測過濾)存放在索引中。
Carrot2 框架中的算法接受定義在 Language
中使用 enum 的 ISO 語言代碼。
語言 hint 讓聚類算法更好地分析文檔的內容,并選擇正確的語言資源來進行聚類。如果你有多語言的查詢結果(或者查詢結果不同于英語),強烈建議對該項合理地進行設置。
下面的例子對整個文檔應用了一個聚類算法。一些文檔是德語的(他們擁有一個 de
語言代碼),一些是英語的(則使用了 en
語言代碼)。我們額外地設置語言聚合策略在 FLATTEN_NONE
上,使得頂層的群類表示在子群類的文檔的語言。注意下面例子中在輸出中頂層群類名稱。
var request = {
"search_request": {
"query": {"match_all" : {}},
"size": 100
},
"query_hint": "bundestag",
"field_mapping": {
"title": ["_source.title"],
"content": ["_source.content"],
"language": ["_source.lang"]
},
"attributes": {
"MultilingualClustering.languageAggregationStrategy": "FLATTEN_NONE"
}
};
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
$("#language-fieldmapping").text(dumpClusters([], response.clusters).join("\n"));
});
輸出
German [23 subclusters]
Parlament [8 documents]
K?fer im Bundestag in Berlin [5 documents]
Mitglieder [5 documents]
Seite [5 documents]
MdB [4 documents]
Bundestag Nachrichten [3 documents]
Restaurant [3 documents]
Tag [3 documents]
Abs.1 [2 documents]
Bundestagsfraktion Bündnis 90 die Grünen [2 documents]
Informationssystem für Parlamentarische Vorg?nge [2 documents]
LINKE [2 documents]
Mehr [2 documents]
Nebeneinkünfte der Abgeordneten im Deutschen Bundestag [2 documents]
Petitionen Unterstützen Facebook [2 documents]
Reichstag Bundestag [2 documents]
Schule [2 documents]
Seite L?sst Dies Jedoch [2 documents]
Susanne Wiest [2 documents]
Tiergarten Telefon 030 22629933 Gerne weiter Empfehlen [2 documents]
Virtuelle W?hlerged?chtnis [2 documents]
Zentrale [2 documents]
Other Topics [15 documents]
English [19 subclusters]
Software [6 documents]
Data Mining Process [5 documents]
Conference [4 documents]
Data Mining Techniques [4 documents]
Knowledge Discovery [4 documents]
Web Mining [3 documents]
Analytic [2 documents]
Association [2 documents]
Business [2 documents]
Data Mining Technology [2 documents]
Data Warehousing [2 documents]
Downloads [2 documents]
Extraction of Hidden Predictive [2 documents]
Oracle Data Mining [2 documents]
Papers [2 documents]
SIAM International Conference on Data Mining [2 documents]
Visualization and Social Media [2 documents]
Website for Data Mining Methods [2 documents]
Other Topics [6 documents]
插件配置
插件有一些默認的設置可以直接使用。建議在非常必須得時候使用這些功能。
下面的配置文件和屬性可以用來修改模型的插件配置。
{path.conf}/elasticsearch.yml
,
{path.conf}/elasticsearch.json
,
{path.conf}/elasticsearch.properties
主要的 ES 配置文件可以用來 啟用/關閉 插件,對賦值給聚類請求的資源進行微調。
-
carrot2.enable
如果設置為false
,則關閉插件;甚至插件已經安裝。 -
threadpool.search.*
聚類請求在 ES 內部的搜索 線程池 中執行。可能也有調整線程池的配置來限制并發的在計算節點上的聚類請求(因為聚類是非常消耗 CPU 的)。參見 ES 文檔中相應的 線程池 部分。
{path.conf}/carrot2.yml
,
{path.conf}/carrot2.json
,
{path.conf}/carrot2.properties
可選的包含插件相關的配置文件。
-
suite
算法套件 XML。資源在path.conf
中和 classpath 查找。默認的套件資源名是carrot2.suite.xml
,包含了對所有開源算法的默認值并嘗試載入 Lingo3G。 -
resources
供載入 Carrot2
lexical resources、 Lingo3G's lexical resources 和算法描述文件(包含任何初始時的屬性)資源查找路徑。相對路徑通過 ES 的path.conf
變量進行解決(一般在config
文件夾)。該值也可以是絕對路徑。任何不在這個位置的資源會從 classpath 進行裝載。 -
controller.pool-size
算法實例的內部池的大小。該池規模依賴于 ES 搜索線程池的配置自動變化。如果太多的資源被消耗,這個池可以通過此項改成固定大小。