前言
上一節,我們介紹了如何繪制韋恩圖來顯示集合間的交疊關系
但是,隨著集合的增多,韋恩圖顯示的關系會越來越復雜,很難一眼看出其中的信息。
今天,我們要介紹的是,當集合數目較多時,該如何繪制
我們將使用 UpSetR
包來繪制下面這種圖
該圖由三個子圖組成:
- 表示交集大小的柱狀圖(上方)
- 表示集合大小的條形圖(下左)
- 表示集合之間的交疊矩陣(下右),矩陣的列表示每種交集組合,對應于柱狀圖的橫坐標;矩陣的行表示集合,對應于條形圖的縱坐標
通過這樣一張圖,可以展示多個集合之間的交疊關系,且很容易從圖中看出集合之間的交集信息
那怎么繪制出這樣一張圖呢?
基礎
1. 安裝導入
install.packages("UpSetR")
library(UpSetR)
我們使用該包自帶的示例數據
movies <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"),
header = T, sep = ";")
2. 數據
在開始繪制之前,我們需要知道輸入數據的格式。
UpSetR
提供了兩個轉換函數 fromList
和 fromExpression
用于格式化數據
-
fromList
函數接受一個list
(每個變量表示一個集合),并將其轉換為數據框,例如
listInput <- list(
one = c(1, 2, 3, 5, 7, 8, 11, 12, 13),
two = c(1, 2, 4, 5, 10),
three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))
-
fromExpression
函數接受一個命名向量表達式,包含了每個集合的大小,以及交集的大小,交集的名稱通過&
符號相連,例如
expressionInput <- c(
one = 2, two = 1, three = 2,
`one&two` = 1, `one&three` = 4,
`two&three` = 1, `one&two&three` = 2)
根據上面的數據,可以繪制如下圖形
upset(fromList(listInput), order.by = "freq")
# upset(fromExpression(expressionInput), order.by = "freq")
3. 繪制部分集合
在這里,我們通過設置 nsets = 6
將集合范圍限制在最大的 6
個集合
upset(movies, nsets = 6,
number.angles = 30,
point.size = 3.5,
line.size = 2,
mainbar.y.label = "Genre Intersections",
sets.x.label = "Movies Per Genre",
text.scale = c(1.3, 1.3, 1, 1, 2, 0.75))
同時,可以指定參數,來調整圖形屬性,例如,使用 number.angles
來設置柱狀圖柱子上方數字的傾斜角度;使用 point.size
和 line.size
來設置矩陣點圖中點和線的大小;mainbar.y.label
和 sets.x.label
可以設置柱狀圖和條形圖的軸標簽;text.scale
包含 6
個值,用于指定圖上所有文本標簽的大小。
text.scale
參數值的順序為:
- 柱狀圖的軸標簽和刻度
- 條形圖的軸標簽和刻度
- 集合名稱
- 柱子上方表示交集大小的數值
我們也可以指定需要展示的集合
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45)
)
mb.ratio
用于控制上下圖形所占比例
4. 排序
我們可以設置 order.by
參數,來對交集進行排序。
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = "freq",
decreasing = TRUE
)
freq
默認是升序,可以使用 decreasing = TRUE
讓其降序排列
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = "degree",
decreasing = FALSE
)
degree
默認為降序排序,設置 decreasing = FALSE
使其升序排列
也可以同時指定這兩個值
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = c("degree", "freq"),
decreasing = c(TRUE, FALSE)
)
如果想要讓集合按照 sets
參數中指定的出現的順序排列,可以設置 keep.order = TRUE
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = c("degree", "freq"),
decreasing = c(TRUE, FALSE),
keep.order = TRUE
)
如果想要顯示交集為空的組合,可以設置 empty.intersections
參數
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
empty.intersections = "on"
)
查詢
查詢通過 queries
參數來執行,接受一個嵌套的 list
來表示多個查詢條件,每個查詢條件包含四個字段:
-
query
:需要執行的查詢 -
params
:查詢參數列表 -
color
:設置滿足查詢條件的元素在圖中的顏色 -
active
:如果為TRUE
,柱狀圖顏色將會被覆蓋,為FALSE
則會在柱子上添加帶有隨機擾動的點
例如
1. 內置交集查詢
我們使用內置的交集查詢:intersects
,用來尋找或顯示特定的交集,并將找到的交集進行上色
upset(movies, queries = list(
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T),
list(
query = intersects,
params = list("Drama"),
color = "red",
active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T)
)
)
2. 內置元素查詢
我們使用 elements
來進行元素查詢,來展示元素在交集中的分布情況
upset(movies,
queries = list(
list(
query = elements,
params = list("AvgRating", 3.5, 4.1),
color = "blue",
active = T),
list(
query = elements,
params = list("ReleaseDate", 1980, 1990, 2000),
color = "red",
active = F)
)
)
3. 使用表達式
我們可以為 expression
參數設置過濾表達式來提取查詢結果的子集。
upset(movies,
queries = list(
list(
query = intersects,
params = list("Action", "Drama"),
active = T),
list(
query = elements,
params = list("ReleaseDate", 1980, 1990, 2000),
color = "red",
active = F)),
expression = "AvgRating > 3 & Watches > 100"
)
4. 自定義查詢
查詢函數會應用于數據的每一行中,我們可以定義如下查詢函數
Myfunc <- function(row, release, rating) {
data <- (row["ReleaseDate"] %in% release) & (row["AvgRating"] > rating)
}
篩選發行日期在 release
內,且平均評分大于某個值的電影
執行查詢
upset(movies,
queries = list(
list(
query = Myfunc,
params = list(c(1970, 1980, 1990, 1999, 2000), 2.5),
color = "blue",
active = T)
)
)
5. 添加查詢圖例
可以使用 query.legend
參數來指定查詢圖例的位置,top
或 bottom
在查詢條件中,使用 query.name
來設置查詢的名稱,如果為設置,會自動生成
upset(movies,
query.legend = "top",
queries = list(
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange", active = T,
query.name = "Funny action"),
list(
query = intersects,
params = list("Drama"),
color = "red", active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T,
query.name = "Emotional action")
)
)
屬性圖
attribute.plots
參數用于執行屬性圖的繪制,包含 3
個字段:
-
gridrows
:設置屬性圖的空間大小,UpSet plot
默認為100 X 100
,如果設置為50
,則整個圖形變成150 X 100
-
plots
:圖形列表,每個元素包含4
個參數:-
plot
:返回ggplot
對象的函數 -
x
:圖形的x
軸變量 -
y
:圖形的y
軸變量 -
queries
:是否使用已經存在的查詢來覆蓋繪圖數據
-
-
ncols
:設置列數
1. 內置繪圖函數
我們使用包中自帶的 histogram
函數來繪制直方圖
upset(movies,
main.bar.color = "black",
queries = list(
list(
query = intersects,
params = list("Drama"),
active = T)
),
attribute.plots = list(
gridrows = 50,
plots = list(
list(
plot = histogram,
x = "ReleaseDate",
queries = F),
list(
plot = histogram,
x = "AvgRating",
queries = T)
),
ncols = 2
)
)
使用 scatter_plot
函數繪制散點圖
upset(movies,
main.bar.color = "black",
queries = list(
list(
query = intersects,
params = list("Drama"),
color = "red",
active = F),
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T)
),
attribute.plots = list(
gridrows = 45,
plots = list(
list(
plot = scatter_plot,
x = "ReleaseDate",
y = "AvgRating",
queries = T),
list(plot = scatter_plot,
x = "AvgRating",
y = "Watches",
queries = F)
),
ncols = 2),
query.legend = "bottom"
)
2. 自定義繪圖函數
我們先定義兩個基于 ggplot2
的函數,用于繪制散點圖和密度圖
my_scatter <- function(data, x, y) {
p <- ggplot(data, aes_string(x, y, colour = "color")) +
geom_point() +
scale_colour_identity() +
theme(
plot.margin = unit(c(0, 0, 0, 0), "cm")
)
p
}
my_density <- function(data, x, y) {
data$decades <- data[, y] %/% 10 * 10
data <- data[which(data$decades >= 1970), ]
p <- ggplot(data, aes_string(x)) +
geom_density(aes(fill = factor(decades)), alpha = 0.3) +
theme(
plot.margin = unit(c(0, 0, 0, 0), "cm"),
legend.key.size = unit(0.4, "cm")
)
p
}
然后應用在屬性圖中
upset(movies,
main.bar.color = "black",
queries = list(
list(
query = intersects,
params = list("Drama"),
color = "red", active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T),
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange", active = T)
),
attribute.plots = list(
gridrows = 45,
plots = list(
list(
plot = my_scatter,
x = "ReleaseDate",
y = "AvgRating",
queries = T),
list(
plot = my_density,
x = "AvgRating",
y = "ReleaseDate",
queries = F)
),
ncols = 2)
)
3. 繪制箱線圖
想要繪制箱線圖,可以使用 boxplot.summary
參數,最多只能同時繪制兩個變量的箱線圖。
upset(movies, boxplot.summary = c("AvgRating", "ReleaseDate"))
當然,用自定義的方式也能實現
集合元數據
set.metadata
參數可以用來設置集合的元數據,包含 3
個字段:
-
data
:數據框,第一列為集合名,后面的列為對應的集合屬性 -
ncols
:列數 -
plots
:也是一個list
,每個元素包含4
個字段column
,type
,assign
和colors
column
:data
中用于繪制的列名type
:需要繪制的圖像類型,如果指定的列為數值型,則可以是hist
和heat
;如果是布爾型,則可以繪制bool
熱圖;如果是分類類型(字符串),則可以是heat
和text
;如果想在矩陣中繪制,可以使用matrix_rows
。assign
:該元數據圖分配的列數,如果繪制2
列數據,并分別分配了20
和10
,則UpSet
圖變為100 X 130
colors
:元數據圖顏色,如果是條形圖,則會應用于整個元數據圖;如果是heat
或bool
,則可以設置一個顏色向量;如果是factor
則沒有colors
參數,并且圖像為漸變色;如果是text
則可以為每個唯一的字符串設置一個顏色,不設置會自動分配顏色
1. 條形圖
我們為每個集合添加元數據屬性,為每部電影隨機設置爛番茄的電影評分
sets <- names(movies[3:19])
avgRottenTomatoesScore <- round(runif(17, min = 0, max = 90))
metadata <- as.data.frame(cbind(sets, avgRottenTomatoesScore))
names(metadata) <- c("sets", "avgRottenTomatoesScore")
要繪制條形圖,需要保證對應列的數據類型必須是數值型
> str(metadata)
'data.frame': 17 obs. of 2 variables:
$ sets : Factor w/ 17 levels "Action","Adventure",..: 1 2 3 4 5 6 7 8 12 9 ...
$ avgRottenTomatoesScore: Factor w/ 12 levels "13","16","21",..: 6 10 12 5 1 1 3 2 11 11 ...
我們看到,評分列為 factor
,所以需要先進行轉換
metadata$avgRottenTomatoesScore <- as.numeric(as.character(metadata$avgRottenTomatoesScore))
現在可以繪制元數據圖了
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "hist",
column = "avgRottenTomatoesScore",
assign = 20)
)
)
)
2. 熱圖
我們再構造電影的元數據,為電影添加城市屬性,同時確保該列為字符串類型而不是 factor
Cities <- sample(c("Boston", "NYC", "LA"), 17, replace = T)
metadata <- cbind(metadata, Cities)
metadata$Cities <- as.character(metadata$Cities)
我們繪制兩幅熱圖,一幅指定了顏色,另一幅不指定顏色
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "heat",
column = "Cities",
assign = 10,
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple")
),
list(
type = "heat",
column = "avgRottenTomatoesScore",
assign = 10)
)
)
)
可以看到,不指定顏色的熱圖為灰色漸變色
布爾型熱圖
我們為電影添加一列 accepted
信息,值為 0
、1
accepted <- round(runif(17, min = 0, max = 1))
metadata <- cbind(metadata, accepted)
設置方式與上面類似
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "bool",
column = "accepted",
assign = 5,
colors = c("#FF3333", "#006400")
)
)
)
)
如果將 bool
換成 heat
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "heat",
column = "accepted",
assign = 5,
colors = c("#FF3333", "#006400")
)
)
)
)
會將 0
、1
布爾型數據視為數值型,并繪制漸變色
3. 文本
對于城市信息元數據,可能顯示文本比熱圖更合適一些
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "text",
column = "Cities",
assign = 10,
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple")
)
)
)
)
4. 在矩陣中應用元數據
有時候,我們可能想將元數據信息直接體現在 UpSet
圖中,可以設置 type = "matrix_rows"
,在矩陣中為不同城市設置不同的顏色
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "hist",
column = "avgRottenTomatoesScore",
assign = 20),
list(
type = "matrix_rows",
column = "Cities",
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple"),
alpha = 0.5)
)
)
)
匯總
最后,我們將這些圖合并在一起
upset(movies,
# 查詢
queries = list(
list(
query = intersects,
params = list("Drama"),
color = "red",
active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T),
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T)),
# 元數據圖
set.metadata = list(
data = metadata,
plots = list(
list(
type = "hist",
column = "avgRottenTomatoesScore",
assign = 20),
list(
type = "bool",
column = "accepted",
assign = 5,
colors = c("#FF3333", "#006400")),
list(
type = "text",
column = "Cities",
assign = 5,
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple")),
list(
type = "matrix_rows",
column = "Cities",
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple"),
alpha = 0.5)
)
),
# 屬性圖
attribute.plots = list(
gridrows = 45,
plots = list(
list(
plot = my_scatter,
x = "ReleaseDate",
y = "AvgRating",
queries = T),
list(plot = my_density,
x = "AvgRating",
y = "ReleaseDate",
queries = F)),
ncols = 2),
query.legend = "bottom"
)
代碼:
https://github.com/dxsbiocc/learn/blob/main/R/plot/upset_plot.R
參數詳情