這篇文章可以說是怎么分析和展示RNAseq基因表達數據中基因的相關性的延續。上次繪制了下圖:
可以發現只有兩個基因的表達表現出了較強的相關(ETV3-ELK4)。
一般教材描述相關性大小為:
相關系數r 是否是:·|r|>0.95 存在顯著性相關;
·|r|≥0.8 高度相關;
·0.5≤|r|<0.8 中度相關;
·0.3≤|r|<0.5 低度相關;
·|r|<0.3 關系極弱,認為不相關
計算公式為:
公式
可見這兩個基因屬于中度相關。
但是我們知道皮爾遜相關系數表示的是兩組數據線性相關的程度,但是如果兩者在統計學上不存在相關性呢?那這個指標還有什么意義?因此,我們在評判相關的時候需要同時考量p值和r相關系數大小。
一個博主是這樣認為的:
看兩者是否算相關要看兩方面:顯著水平以及相關系數
(1)顯著水平,就是P值,這是首要的,因為如果不顯著,相關系數再高也沒用,可能只是因為偶然因素引起的,那么多少才算顯著,一般p值小于0.05就是顯著了;如果小于0.01就更顯著;例如p值=0.001,就是很高的顯著水平了,只要顯著,就可以下結論說:拒絕原假設無關,兩組數據顯著相關也說兩者間確實有明顯關系.通常需要p值小于0.1,最好小于0.05設甚至0.01,才可得出結論:兩組數據有明顯關系,如果p=0.5,遠大于0.1,只能說明相關程度不明顯甚至不相關.起碼不是線性相關.
(2)相關系數,也就是Pearson Correlation(皮爾遜相關系數),通常也稱為R值,在確認上面指標顯著情況下,再來看這個指標,一般相關系數越高表明兩者間關系越密切.
在搜索相關概念時,發現百度文庫一篇文檔相關系數與P值的一些基本概念提供了詳細的描述和實例。有興趣可以看下。
從上面看來,在進行相關分析考量相關系數r(或者R2)前,先考量顯著性是有必要的。不過,如果你已經能看到兩變量有很明顯的線性關系了,你不看p值也無傷大雅,因為那個時候p值肯定少于0.05。
參考R包corrplot文檔對上次的函數進行優化,代碼如下:
gene_exp.corr <- function(gene.list, project_code, project.clinical, project.exp, outdir, ID_transform=TRUE, conf.level=0.95){
# Arguments:
# gene.list: a list of gene you want to analyze their expression correlation
# project_code: data project name or name you want to specify this analysis
# project.clinical: clinical information about samples, data.frame format
# project.exp: normalized gene expression (RNA seq) about samples, data.frame format
# ID_transform: sometimes clinical information use "-" as separate symbol for sample ID,
# we need it to be the same as it in project.exp data
# one sample ID example: in clinical information, one sample may be marked by "TCGA-3N-A9WB-06",
# in RNA seq data.set, this sample is "TCGA.3N.A9WB.06". If it is not, set ID_transform=FALSE.
# note: you need to install "corrgram" package before use this function
gene_exp.list <- subset(project.exp, sample%in%gene.list)
rownames(gene_exp.list) <- gene_exp.list[,1]
gene_exp.list <- gene_exp.list[,-1]
gene_exp.list <- t(gene_exp.list)
# gene_exp.list <- gene_exp.list[,c(5,1,2,3,4,6,7,8,9,10)]
library(corrplot)
# combine with significance test
cor.mtest <- function(mat, conf.level = 0.95){
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
diag(p.mat) <- 0
diag(lowCI.mat) <- diag(uppCI.mat) <- 1
for(i in 1:(n-1)){
for(j in (i+1):n){
tmp <- cor.test(mat[,i], mat[,j], conf.level = conf.level)
p.mat[i,j] <- p.mat[j,i] <- tmp$p.value
lowCI.mat[i,j] <- lowCI.mat[j,i] <- tmp$conf.int[1]
uppCI.mat[i,j] <- uppCI.mat[j,i] <- tmp$conf.int[2]
}
}
return(list(p.mat, lowCI.mat, uppCI.mat))
}
if(ID_transform){
project.clinical$sampleID = gsub("-",".",project.clinical$sampleID, fixed = TRUE)
}
n.gene <- ncol(gene_exp.list)
# all samples
M1 <- cor(gene_exp.list)
res1 <- cor.mtest(gene_exp.list, conf.level)
pdf(paste(outdir,project_code,"_all_sample_genelist_expression_corrgram.pdf", sep=""))
corrplot(M1, order = "AOE", tl.pos = "d", p.mat = res1[[1]], insig = "p-value")
title(paste("Corrgram of ", n.gene," Genes Expression in ", project_code, sep = ""))
dev.off()
# choose tumor sample
# table(project.clinical$sample_type)
primary_tumor <- "Primary Tumor"
Metast_tumor <- "Metastatic"
primary_tumor.id <- project.clinical[project.clinical$sample_type==primary_tumor,]$sampleID
Metast_tumor.id <- project.clinical[project.clinical$sample_type==Metast_tumor,]$sampleID
if(length(primary_tumor.id)<2 & length(Metast_tumor.id<2)){
stop("Maybe your data have something wrong. Please check it!")
}else{
if(length(primary_tumor.id)<2){
stop("I don't think it's reasonable that there are less than 2 primary tumor samples.")}
gene_exp.list.primary <- subset(gene_exp.list, rownames(gene_exp.list)%in%primary_tumor.id)
M2 <- cor(gene_exp.list.primary)
res2 <- cor.mtest(gene_exp.list.primary, conf.level)
pdf(paste(outdir,project_code,"_primary_tumor_sample_genelist_expression_corrgram.pdf", sep=""))
corrplot(M2, order = "AOE", tl.pos = "d", p.mat = res2[[1]], insig = "p-value")
title(paste("Corrgram of ", n.gene," Genes Expression in ", project_code, sep = ""))
dev.off()
if(length(Metast_tumor.id)<2){
cat("It seems has no Metastatic sample in this analysis. \n")
return(0)
}
gene_exp.list.Metast <- subset(gene_exp.list, rownames(gene_exp.list)%in%Metast_tumor.id)
M3 <- cor(gene_exp.list.Metast)
res3 <- cor.mtest(gene_exp.list.Metast, conf.level)
pdf(paste(outdir,project_code,"_Metastatic_sample_genelist_expression_corrgram.pdf", sep=""))
corrplot(M3, order = "AOE", tl.pos = "d", p.mat = res3[[1]], insig = "p-value")
title(paste("Corrgram of ", n.gene," Genes Expression in ", project_code, sep = ""))
dev.off()}
}
一方面增加了檢驗部分,另一方面修改了畫圖函數。如果你想用這個函數繪制更多自定義的圖,可以參考R包文檔進行修改。
這跟上次的圖類似。偏藍色代表正相關,偏紅色代表負相關。用圓圈大小和顏色鮮艷程度輔助color legend可以很好的區分和找出相關性明顯的。corrgram的左下角和右上角是對稱的,標有數字的顯示的是p值,我這里默認設定0.05為閾值,大于0.05都會顯示出來,這些值說明對應的兩個基因在統計學上是沒有相關性的。