聲明:由于本人也是處于學習階段,有些理解可能并不深刻,甚至會攜帶一定錯誤,因此請以批判的態度來進行閱讀,如有錯誤,請留言或直接聯系本人。
本周內容參照Jiawei.Han&Micheline.Kamber&Jian.Pei, DATA MINING: Concepts and Techniques, Third Edition. 版本的部分內容。
本周內容:1) Logical Model; 2) Query Language; 3) Physical Model and Query Processing Technologies; 4) Materialized Cuboids and Efficient Computing Cuboids
關鍵詞:Star Schema; Snowflake Schema; Fact Collection; Normalization; Denormalization; SQL; MDX; ROLAP; MOLAP; Bitmap; Join Index; Arbitrary selections; Coarse-grain Aggregations; Top-down Approach; Bottom-up Approach
問題一,什么是Logical Model?它的實現方式有哪些?
什么是Logical Model(Logical Data Model)?
Wikipedia解釋:A logical data model or logical schema is a data model of a specific problem domain expressed independently of a particular database management product or storage technology (physical data model) but in terms of data structures such as relational tables and columns, object-oriented classes, or XML tags.
Data Warehouse的Logical Model有兩種主要實現方式:
1)relational DB technology:
1.1)Star schema,
1.2)Snowflake schema,
1.3)Fact constellation
2)multidimensional technology:
2.1)Just as multidimensional data cube
問題二,什么是Star schema?
Star schema: The most common modeling paradigm is the star schema, in which the data warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table.(DATA MINING P.114)
在這里,SALES是Fact Table, 其他的是Dimension Table。
那么這里的star schema是怎么由universal schema轉變過來的呢?
其實這里的star schema是由universal schema經過normalization轉化而來,具體有:
1)Each dimension is represented by a dimension-table
1.1)LOCATION (location_key, store, street_address, city, state, country, region)
1.2)dimension tables are not normalized
2)Transactions are described through a fact-table
2.1)each tuple consists of a pointer to each of the dimension-tables (foreign-key) and a list of measures (例如上圖中的,units_sold; amount)
使用Star schema有什么好處呢?
Facts and dimensions are clearly depicted
1)dimension tables are relatively static, data is loaded (append mostly) into fact table(s)
2)easy to comprehend (and write queries)
問題三,什么是Snowflake schema?
Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake.(DATA MINING P.114)
因此,我們可知,Snowflake schema只要經過Denormalization就可以變回Star schema.
問題四,什么是Fact Collection?
Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.(DATA MINING P.116)
由上圖可以得到Fact Collection是由多個fact table共享它們相同的dimension table 的內容,可以達到降低冗余的需求。
名詞解釋:
Normalization:Database normalization, or simply normalization, is the process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design).(Wikipedia)(Tips:可以思考relational database中的1NF->2NF->3NF等等的轉換)
為什么要在relational database中使用normalization呢?
1)節省空間。例如,street1:Bark Street;City1:Kingsford;Stae1:NSW。Street2:Harry stree;City1:Kingsford;Stae1:NSW。納悶我們將City進行normalization,即city_key指向city、stae,然后我們不同的street就可以共用這一city_key,以達到減少redundancy的目的。
2)便于update。任然是上面的例子,如果Bark Street現在劃歸為Randwick,那么只需要將city_key的指向指為Randwick所屬的那個city_key就可以了。
Denormalization:Denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.[1][2] It is often motivated by performance or scalability in relational database software needing to carry out very large numbers of read operations. Denormalization should not be confused with Unnormalized form. Databases/tables must first be normalized to efficiently denormalize them.(Wikipedia)
還可參照以下更為通俗的解釋:https://medium.com/@katedoesdev/normalized-vs-denormalized-databases-210e1d67927d
問題五,Data Warehouse的query language 有哪些?
1)Using relational DB technology: SQL (with extensions such as CUBE/PIVOT/UNPIVOT)
2)Using multidimensional technology: MDX
上圖中,左邊的是SQL,右邊的是MDX。相較而言,MDX更為簡單,因為它的語句更加直觀簡潔。它們所共同表達的意思就是:Operations: Slice (Loc.Region.Europe) + Pivot (Prod.category, Measures.amnt)
問題六,Data Warehouse 的層次結構模型有什么?
Data Warehouse architecture分為3層:1)最底層,Data Warehouse Server; 2)中間層,OLAP Server; 3)最高層,Front-end Tools
我們著重介紹中間層:OLAP Server。
OLAP Server主要采用兩種方式:
1)ROLAP(which is used in relational database technology)
2)MOLAP(which is used in multidimensional technology)
還有一種混合的OLAP Server,HOLAP
Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a relational back-end server and client front-end tools. They use a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services. ROLAP technology tends to have greater scalability than MOLAP technology. The DSS server of Microstrategy, for example, adopts the ROLAP approach.(DATA MINING P.135)
Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of data through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures. The advantage of using a data136 Chapter 3 Data Warehouse and OLAP Technology: An Overview cube is that it allows fast indexing to precomputed summarized data. Notice that with multidimensional data stores, the storage utilization may be low if the data set is sparse. In such cases, sparse matrix compression techniques should be explored (Chapter 4). Many MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets: denser subcubes are identified and stored as array structures, whereas sparse subcubes employ compression technology for efficient storage utilization.(DATA MINING P.135)
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP server.(DATA MINING P.136)
例如上圖,就是利用HOLAP來進行數據存放及處理,base cuboid我們可以理解為star schema中的fact table,它使用ROLAP來進行存儲處理,因為relational database query更加方便;然而下面的可以用MOLAP來進行存儲處理,這樣使得數據存儲更加立體,索引起來更加方便。
問題七,當我們我們針對OLAP中不同數據情況時,我們該采用何種所應方式來方便query呢?
1)Selection on low-cardinality attributes,Bitmap Index(BI)是個不錯的選擇。
BI on dimension tables
1.1) Index on an attribute (column) with low distinct values
1.2) Each distinct values, v, is associated with a n-bit vector (n = #rows)
1.2.1) The i-th bit is set if the i-th row of the table has the value v for the indexed column
1.3) Multiple BIs can be efficiently combined to enable optimized scan of the table
這里將region作為”distinct values”,在Asia行時,讀出C1,C3的Region為Asia,所以bitmap 的第1,3列表為1。以此類推。如果需要加入新元素,C6,C7,C8......Cn,則只需要將bitmap后面增加n-5個column即可。(記住,data warehouse一般只能增加數據,一般不可刪除修改數據或)
Bitmap的優點:1)提高存儲空間利用率;2)提高運行速率
2)Selection on high-cardinality attributes,Join Indices是個不錯的選擇
2.1)Join index relates the values of the dimensions of a star schema to rows in the fact table.
2.1.1)a join index on city maintains for each distinct city a list of ROW-IDs of the tuples recording the sales in the city
2.2)Join indices can span multiple dimensions OR
2.2.1)can be implemented as bitmap indexes (per dimension)
2.2.2)use bit-op for multiple-joins
我們可以將Kingsford在fact table Sales中的Row ID記錄在Kingsford的join index table(黃色區域)下,以方便我們索引Kingsford ;同時,還可以為Time中的year= ‘2017’建立join index table。假如,Kingsford join index table[R102, R117, R118, R124], ‘2017’join index table[R111, R117, R 119, R124],那么當我們需要查找位于Kingsford 的2017年的sales數據則可以索引join index table得到R117和R124符合條件。
3)Arbitrary selections on Dimensions
可以使用正則來篩選。
(以下是Relational Database 和 Data Warehouse之間的數據檢索的差別)
上面的語句就是Query,與traditional relational DB相同,Sales與select出來的Time相join,然后和select出來的Time相join,然后和select出來的Time相join。請注意,這里的join只能是binary combine,而且是fact table with dimension table only(因為dimension table間沒有common foreign key,因此無法combine,然而fact table中存儲了所有的foreign key)。(binary combine比較耗費時間。)
在Data Warehouse中我們使用上面star join這種方法,先將select出的Time, Customer, Loc進行Cross Join生成笛卡兒積,再join fact table(因為fact table太大了所以最后join)。例如例子所給,我們將select出的3組數據cross join得到8個tuple,然后得到找出想要的2個tuple,接著通過scan fact table中的數據來對比出符合這2個tuple中3個foreign key的tuples(fact table中的tuples)。這樣做相較于star query,只需要再最后join fact table,而再star query中,需要在3次join中與fact table相join,因此star join更加高效。
4)Coarse-grain Aggregations
我們將Time聚合為Year,customer聚合為Type,Loc聚合為City,以此來形成一個Cuboid,方便。當然,需要根據實際情況需求來形成cuboid。
問題八,How to store the materialized cuboids? How to compute the cuboids efficiently?
有4種方式:
1)ROLAP(Store all data into one single table)
當需要用到某些cuboids來應對query時,可以使用selection語句進行查詢(這里的(store, product), (store), (product),()都是獨立的cuboid)。
2)Top-down Approach
在這里,晶格圖中最上層是所有dimension的detail,往下走是一步步group by。例如A是product,B是quarter,那么Cuboid AB表示(product, quarter), Cuboid A表示第三層的(product),Cuboid B表示第三層的(quarter). Cuboid A中,是以product為Group by的,所以tuple1(A: 1; B: 7; C: 100), tuple2(A: 2; B: 4; C: 50)。所以group by和computing是從頂向下的。
3)Bottom-up Approach
自底向上,從A, B, C, D各自group by,再到AB等等間group by直到最后達到ABCD聚合。
這里介紹的計算方式是Recursive,具體見PPT60
4)MOLAP
詳見PPT69