【r<-高級|方案】如何在相關矩陣圖上添加p-value

這篇文章可以說是怎么分析和展示RNAseq基因表達數據中基因的相關性的延續。上次繪制了下圖:

可以發現只有兩個基因的表達表現出了較強的相關(ETV3-ELK4)。
一般教材描述相關性大小為:

相關系數r 是否是:·|r|>0.95 存在顯著性相關;
·|r|≥0.8 高度相關;
·0.5≤|r|<0.8 中度相關;
·0.3≤|r|<0.5 低度相關;
·|r|<0.3 關系極弱,認為不相關
計算公式為:


公式

可見這兩個基因屬于中度相關。

但是我們知道皮爾遜相關系數表示的是兩組數據線性相關的程度,但是如果兩者在統計學上不存在相關性呢?那這個指標還有什么意義?因此,我們在評判相關的時候需要同時考量p值和r相關系數大小。

一個博主是這樣認為的:

看兩者是否算相關要看兩方面:顯著水平以及相關系數
(1)顯著水平,就是P值,這是首要的,因為如果不顯著,相關系數再高也沒用,可能只是因為偶然因素引起的,那么多少才算顯著,一般p值小于0.05就是顯著了;如果小于0.01就更顯著;例如p值=0.001,就是很高的顯著水平了,只要顯著,就可以下結論說:拒絕原假設無關,兩組數據顯著相關也說兩者間確實有明顯關系.通常需要p值小于0.1,最好小于0.05設甚至0.01,才可得出結論:兩組數據有明顯關系,如果p=0.5,遠大于0.1,只能說明相關程度不明顯甚至不相關.起碼不是線性相關.
(2)相關系數,也就是Pearson Correlation(皮爾遜相關系數),通常也稱為R值,在確認上面指標顯著情況下,再來看這個指標,一般相關系數越高表明兩者間關系越密切.

在搜索相關概念時,發現百度文庫一篇文檔相關系數與P值的一些基本概念提供了詳細的描述和實例。有興趣可以看下。

從上面看來,在進行相關分析考量相關系數r(或者R2)前,先考量顯著性是有必要的。不過,如果你已經能看到兩變量有很明顯的線性關系了,你不看p值也無傷大雅,因為那個時候p值肯定少于0.05。

參考R包corrplot文檔對上次的函數進行優化,代碼如下:

gene_exp.corr <- function(gene.list, project_code, project.clinical, project.exp, outdir, ID_transform=TRUE, conf.level=0.95){
    # Arguments:
    # gene.list: a list of gene you want to analyze their expression correlation
    # project_code: data project name or name you want to specify this analysis
    # project.clinical: clinical information about samples, data.frame format
    # project.exp: normalized gene expression (RNA seq) about samples, data.frame format
    # ID_transform: sometimes clinical information use "-" as separate symbol for sample ID,
    #               we need it to be the same as it in project.exp data
    # one sample ID example: in clinical information, one sample may be marked by "TCGA-3N-A9WB-06",
    #                        in RNA seq data.set, this sample is "TCGA.3N.A9WB.06". If it is not, set ID_transform=FALSE. 
    
    # note: you need to install "corrgram" package before use this function
    
    gene_exp.list <- subset(project.exp, sample%in%gene.list)
    rownames(gene_exp.list) <- gene_exp.list[,1]
    gene_exp.list <- gene_exp.list[,-1]
    gene_exp.list <- t(gene_exp.list)
    # gene_exp.list <- gene_exp.list[,c(5,1,2,3,4,6,7,8,9,10)]
    library(corrplot)
    # combine with significance test
    cor.mtest <- function(mat, conf.level = 0.95){
        mat <- as.matrix(mat)
        n <- ncol(mat)
        p.mat <- lowCI.mat <- uppCI.mat <- matrix(NA, n, n)
        diag(p.mat) <- 0
        diag(lowCI.mat) <- diag(uppCI.mat) <- 1
        for(i in 1:(n-1)){
            for(j in (i+1):n){
                tmp <- cor.test(mat[,i], mat[,j], conf.level = conf.level)
                p.mat[i,j] <- p.mat[j,i] <- tmp$p.value
                lowCI.mat[i,j] <- lowCI.mat[j,i] <- tmp$conf.int[1]
                uppCI.mat[i,j] <- uppCI.mat[j,i] <- tmp$conf.int[2]
            }
        }
        return(list(p.mat, lowCI.mat, uppCI.mat))
    }
    
    if(ID_transform){
        project.clinical$sampleID = gsub("-",".",project.clinical$sampleID, fixed = TRUE)
    }
    n.gene <- ncol(gene_exp.list)
    # all samples
    M1 <- cor(gene_exp.list)
    res1 <- cor.mtest(gene_exp.list, conf.level)
    
    pdf(paste(outdir,project_code,"_all_sample_genelist_expression_corrgram.pdf", sep=""))
    corrplot(M1, order = "AOE", tl.pos = "d", p.mat = res1[[1]], insig = "p-value")
    title(paste("Corrgram of ", n.gene," Genes Expression in ", project_code, sep = ""))
    dev.off()
    
    # choose tumor sample
    # table(project.clinical$sample_type)
    primary_tumor <- "Primary Tumor"
    Metast_tumor  <- "Metastatic"
    primary_tumor.id <- project.clinical[project.clinical$sample_type==primary_tumor,]$sampleID
    Metast_tumor.id  <- project.clinical[project.clinical$sample_type==Metast_tumor,]$sampleID
    
    if(length(primary_tumor.id)<2 & length(Metast_tumor.id<2)){
        stop("Maybe your data have something wrong. Please check it!")
    }else{
        if(length(primary_tumor.id)<2){
            stop("I don't think it's reasonable that there are less than 2 primary tumor samples.")}
        
        gene_exp.list.primary <- subset(gene_exp.list, rownames(gene_exp.list)%in%primary_tumor.id)
        M2 <- cor(gene_exp.list.primary)
        res2 <- cor.mtest(gene_exp.list.primary, conf.level)
        
        pdf(paste(outdir,project_code,"_primary_tumor_sample_genelist_expression_corrgram.pdf", sep=""))
        corrplot(M2, order = "AOE", tl.pos = "d", p.mat = res2[[1]], insig = "p-value")
        title(paste("Corrgram of ", n.gene," Genes Expression in ", project_code, sep = ""))
        dev.off()
        
        if(length(Metast_tumor.id)<2){
            cat("It seems has no Metastatic sample in this analysis. \n")
            return(0)
        }
        
        gene_exp.list.Metast  <- subset(gene_exp.list, rownames(gene_exp.list)%in%Metast_tumor.id)
        M3 <- cor(gene_exp.list.Metast)
        res3 <- cor.mtest(gene_exp.list.Metast, conf.level)
        pdf(paste(outdir,project_code,"_Metastatic_sample_genelist_expression_corrgram.pdf", sep=""))
        corrplot(M3, order = "AOE", tl.pos = "d", p.mat = res3[[1]], insig = "p-value")
        title(paste("Corrgram of ", n.gene," Genes Expression in ", project_code, sep = ""))
        dev.off()}
}

一方面增加了檢驗部分,另一方面修改了畫圖函數。如果你想用這個函數繪制更多自定義的圖,可以參考R包文檔進行修改。

corrplot_demo.png

這跟上次的圖類似。偏藍色代表正相關,偏紅色代表負相關。用圓圈大小和顏色鮮艷程度輔助color legend可以很好的區分和找出相關性明顯的。corrgram的左下角和右上角是對稱的,標有數字的顯示的是p值,我這里默認設定0.05為閾值,大于0.05都會顯示出來,這些值說明對應的兩個基因在統計學上是沒有相關性的。


博文鏈接

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容