COMP9318_WEEK3

聲明:由于本人也是處于學習階段,有些理解可能并不深刻,甚至會攜帶一定錯誤,因此請以批判的態度來進行閱讀,如有錯誤,請留言或直接聯系本人。

本周內容參照Jiawei.Han&Micheline.Kamber&Jian.Pei, DATA MINING: Concepts and Techniques, Third Edition. 版本的部分內容。

本周內容:1) Logical Model; 2) Query Language; 3) Physical Model and Query Processing Technologies; 4) Materialized Cuboids and Efficient Computing Cuboids

關鍵詞:Star Schema; Snowflake Schema; Fact Collection; Normalization; Denormalization; SQL; MDX; ROLAP; MOLAP; Bitmap; Join Index; Arbitrary selections; Coarse-grain Aggregations; Top-down Approach; Bottom-up Approach

問題一,什么是Logical Model?它的實現方式有哪些?
什么是Logical Model(Logical Data Model)?
Wikipedia解釋:A logical data model or logical schema is a data model of a specific problem domain expressed independently of a particular database management product or storage technology (physical data model) but in terms of data structures such as relational tables and columns, object-oriented classes, or XML tags.

Data Warehouse的Logical Model有兩種主要實現方式:
1)relational DB technology:
1.1)Star schema,
1.2)Snowflake schema,
1.3)Fact constellation
2)multidimensional technology:
2.1)Just as multidimensional data cube

問題二,什么是Star schema?
Star schema: The most common modeling paradigm is the star schema, in which the data warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table.(DATA MINING P.114)


Star Schema

在這里,SALES是Fact Table, 其他的是Dimension Table。

那么這里的star schema是怎么由universal schema轉變過來的呢?
其實這里的star schema是由universal schema經過normalization轉化而來,具體有:
1)Each dimension is represented by a dimension-table
1.1)LOCATION (location_key, store, street_address, city, state, country, region)
1.2)dimension tables are not normalized
2)Transactions are described through a fact-table
2.1)each tuple consists of a pointer to each of the dimension-tables (foreign-key) and a list of measures (例如上圖中的,units_sold; amount)

使用Star schema有什么好處呢?
Facts and dimensions are clearly depicted
1)dimension tables are relatively static, data is loaded (append mostly) into fact table(s)
2)easy to comprehend (and write queries)

問題三,什么是Snowflake schema?
Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake.(DATA MINING P.114)


Snowflak Schema

因此,我們可知,Snowflake schema只要經過Denormalization就可以變回Star schema.

問題四,什么是Fact Collection?
Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.(DATA MINING P.116)


Fact Collection

由上圖可以得到Fact Collection是由多個fact table共享它們相同的dimension table 的內容,可以達到降低冗余的需求。

名詞解釋:
Normalization:Database normalization, or simply normalization, is the process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design).(Wikipedia)(Tips:可以思考relational database中的1NF->2NF->3NF等等的轉換)
為什么要在relational database中使用normalization呢?
1)節省空間。例如,street1:Bark Street;City1:Kingsford;Stae1:NSW。Street2:Harry stree;City1:Kingsford;Stae1:NSW。納悶我們將City進行normalization,即city_key指向city、stae,然后我們不同的street就可以共用這一city_key,以達到減少redundancy的目的。
2)便于update。任然是上面的例子,如果Bark Street現在劃歸為Randwick,那么只需要將city_key的指向指為Randwick所屬的那個city_key就可以了。

Denormalization:Denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.[1][2] It is often motivated by performance or scalability in relational database software needing to carry out very large numbers of read operations. Denormalization should not be confused with Unnormalized form. Databases/tables must first be normalized to efficiently denormalize them.(Wikipedia)
還可參照以下更為通俗的解釋:https://medium.com/@katedoesdev/normalized-vs-denormalized-databases-210e1d67927d

問題五,Data Warehouse的query language 有哪些?
1)Using relational DB technology: SQL (with extensions such as CUBE/PIVOT/UNPIVOT)
2)Using multidimensional technology: MDX


query language

上圖中,左邊的是SQL,右邊的是MDX。相較而言,MDX更為簡單,因為它的語句更加直觀簡潔。它們所共同表達的意思就是:Operations: Slice (Loc.Region.Europe) + Pivot (Prod.category, Measures.amnt)

問題六,Data Warehouse 的層次結構模型有什么?
Data Warehouse architecture分為3層:1)最底層,Data Warehouse Server; 2)中間層,OLAP Server; 3)最高層,Front-end Tools


Data Warehouse 的層次結構模型

我們著重介紹中間層:OLAP Server。
OLAP Server主要采用兩種方式:
1)ROLAP(which is used in relational database technology)
2)MOLAP(which is used in multidimensional technology)
還有一種混合的OLAP Server,HOLAP

Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a relational back-end server and client front-end tools. They use a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services. ROLAP technology tends to have greater scalability than MOLAP technology. The DSS server of Microstrategy, for example, adopts the ROLAP approach.(DATA MINING P.135)

Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of data through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures. The advantage of using a data136 Chapter 3 Data Warehouse and OLAP Technology: An Overview cube is that it allows fast indexing to precomputed summarized data. Notice that with multidimensional data stores, the storage utilization may be low if the data set is sparse. In such cases, sparse matrix compression techniques should be explored (Chapter 4). Many MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets: denser subcubes are identified and stored as array structures, whereas sparse subcubes employ compression technology for efficient storage utilization.(DATA MINING P.135)

Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP server.(DATA MINING P.136)

image.png

例如上圖,就是利用HOLAP來進行數據存放及處理,base cuboid我們可以理解為star schema中的fact table,它使用ROLAP來進行存儲處理,因為relational database query更加方便;然而下面的可以用MOLAP來進行存儲處理,這樣使得數據存儲更加立體,索引起來更加方便。

問題七,當我們我們針對OLAP中不同數據情況時,我們該采用何種所應方式來方便query呢?
1)Selection on low-cardinality attributes,Bitmap Index(BI)是個不錯的選擇。
BI on dimension tables
1.1) Index on an attribute (column) with low distinct values
1.2) Each distinct values, v, is associated with a n-bit vector (n = #rows)
1.2.1) The i-th bit is set if the i-th row of the table has the value v for the indexed column
1.3) Multiple BIs can be efficiently combined to enable optimized scan of the table

Bitmap Index

這里將region作為”distinct values”,在Asia行時,讀出C1,C3的Region為Asia,所以bitmap 的第1,3列表為1。以此類推。如果需要加入新元素,C6,C7,C8......Cn,則只需要將bitmap后面增加n-5個column即可。(記住,data warehouse一般只能增加數據,一般不可刪除修改數據或)
Bitmap的優點:1)提高存儲空間利用率;2)提高運行速率

2)Selection on high-cardinality attributes,Join Indices是個不錯的選擇
2.1)Join index relates the values of the dimensions of a star schema to rows in the fact table.
2.1.1)a join index on city maintains for each distinct city a list of ROW-IDs of the tuples recording the sales in the city
2.2)Join indices can span multiple dimensions OR
2.2.1)can be implemented as bitmap indexes (per dimension)
2.2.2)use bit-op for multiple-joins

Join Index

我們可以將Kingsford在fact table Sales中的Row ID記錄在Kingsford的join index table(黃色區域)下,以方便我們索引Kingsford ;同時,還可以為Time中的year= ‘2017’建立join index table。假如,Kingsford join index table[R102, R117, R118, R124], ‘2017’join index table[R111, R117, R 119, R124],那么當我們需要查找位于Kingsford 的2017年的sales數據則可以索引join index table得到R117和R124符合條件。

3)Arbitrary selections on Dimensions

Arbitrary selections

可以使用正則來篩選。

(以下是Relational Database 和 Data Warehouse之間的數據檢索的差別)


image.png
Star Query

上面的語句就是Query,與traditional relational DB相同,Sales與select出來的Time相join,然后和select出來的Time相join,然后和select出來的Time相join。請注意,這里的join只能是binary combine,而且是fact table with dimension table only(因為dimension table間沒有common foreign key,因此無法combine,然而fact table中存儲了所有的foreign key)。(binary combine比較耗費時間。)

Star Join

在Data Warehouse中我們使用上面star join這種方法,先將select出的Time, Customer, Loc進行Cross Join生成笛卡兒積,再join fact table(因為fact table太大了所以最后join)。例如例子所給,我們將select出的3組數據cross join得到8個tuple,然后得到找出想要的2個tuple,接著通過scan fact table中的數據來對比出符合這2個tuple中3個foreign key的tuples(fact table中的tuples)。這樣做相較于star query,只需要再最后join fact table,而再star query中,需要在3次join中與fact table相join,因此star join更加高效。

4)Coarse-grain Aggregations

即使Join Index可以幫助我們清除大量的無用信息,但是留下的信息仍然非常龐大,如果我們直接進行aggregation的話那會消耗非常多的資源。于是需要pre-compute, 我們采用cuboid來幫助最后Group By,語句如下
image.png

我們將Time聚合為Year,customer聚合為Type,Loc聚合為City,以此來形成一個Cuboid,方便。當然,需要根據實際情況需求來形成cuboid。

問題八,How to store the materialized cuboids? How to compute the cuboids efficiently?
有4種方式:
1)ROLAP(Store all data into one single table)


image.png

當需要用到某些cuboids來應對query時,可以使用selection語句進行查詢(這里的(store, product), (store), (product),()都是獨立的cuboid)。

2)Top-down Approach


Top-down Approach

在這里,晶格圖中最上層是所有dimension的detail,往下走是一步步group by。例如A是product,B是quarter,那么Cuboid AB表示(product, quarter), Cuboid A表示第三層的(product),Cuboid B表示第三層的(quarter). Cuboid A中,是以product為Group by的,所以tuple1(A: 1; B: 7; C: 100), tuple2(A: 2; B: 4; C: 50)。所以group by和computing是從頂向下的。

3)Bottom-up Approach


Bottom-up Approach

自底向上,從A, B, C, D各自group by,再到AB等等間group by直到最后達到ABCD聚合。
這里介紹的計算方式是Recursive,具體見PPT60

4)MOLAP
詳見PPT69

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 228,412評論 6 532
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 98,514評論 3 416
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 176,373評論 0 374
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 62,975評論 1 312
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 71,743評論 6 410
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 55,199評論 1 324
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,262評論 3 441
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,414評論 0 288
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 48,951評論 1 336
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 40,780評論 3 354
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 42,983評論 1 369
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,527評論 5 359
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,218評論 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,649評論 0 26
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 35,889評論 1 286
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 51,673評論 3 391
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 47,967評論 2 374

推薦閱讀更多精彩內容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,390評論 0 10
  • 運營的三個階段 啟動階段:產品和運營溝通,制定產品推廣計劃,核心競爭力賣點,功能。 官網搭建,產品 功能模塊,運營...
    Lilyhh閱讀 419評論 0 0
  • 圖文/by Holly° 2016年過去了.2017年正式到來.起床看到99+的信息.謝謝我的親朋好友一...
    Holly丫閱讀 438評論 0 1
  • 小雨淅瀝的夏季 生命無與倫比的旖旎 朦朧蕩漾 吵醒湖中熟睡的錦鯉 那一滴一滴 像你微笑的痕跡 是我的粗心 還是你的...
    樹下俗人閱讀 283評論 0 2
  • 中國老百姓的投資分為以下幾類: 1、低收益中風險。購房出租等購買這類硬資產收益率很低,且有資產折價的風險。 2...
    章安閱讀 218評論 0 0