本文寫于觀看生信技能樹公眾號(vx: biotrainee)的七步走純R代碼通過數據挖掘復現一篇實驗文章(第1到6步)一文后,感覺生信技能樹優秀學徒的工作十分吸引人,就自己動手復現了一次。
Step00.問題概述
本文的任務是全代碼復現一篇paper,標題為 :Co-expression networks revealed potential core lncRNAs in the triple-negative breast cancer. PMID:27380926
ref: 生信技能樹--七步走純R代碼通過數據挖掘復現一篇實驗文章(第1到6步)
文章是在8名乳腺癌的患者開展了轉錄組測序并分析后作出的。復現測序的流程恐怕不太現實,但是我們可以通過TCGA數據庫中的腫瘤數據復現文章的數據分析流程。
本文的分析流程包括:
下載數據
數據清洗
質量控制
差異分析
注釋mRNA,lncRNA
富集分析
至于WGCNA分析在本文就不再復現了,有興趣的同學也可以查閱生信技能樹的文章七步走純R代碼通過數據挖掘復現一篇實驗文章(第七步WGCNA)
Step01.數據下載
-
TCGA database
TCGA數據庫上的數據下載可以參考生信技能樹上有關的文章送你一篇TCGA數據挖掘文章。在本文中也簡要地復述下載流程。
首先,登入UCSC Xena
選擇TCGA的breast cancer data
下載RNAseq表達矩陣和臨床信息
P.S. 要注意的是在生信技能樹中使用的是GDC的breast cancer dataset,而本文使用的是TCGA 的。兩個dataset分析出來的數據差異頗大。
-
Ensembl GTF file -- annotation information
在Ensembl的FTP download頁(http://asia.ensembl.org/info/data/ftp/index.html)中,選擇人的GTF文件:
隨后,下載文件“Homo_sapiens.GRCh38.98.chr.gtf.gz”即可。
Step02.數據清洗
該步驟需要從臨床信息中提取中三陰性乳腺癌樣本的臨床信息與表達矩陣,并將腫瘤樣本與正常樣本進行配對。
三陰性乳腺癌(Triple-negative breast cancer, TNBC) : 指的是以下三種受體均不表達的乳腺癌類型:
- 雌激素受體:estrogen receptor (ER) ;
- 孕激素受體:progesterone receptor(PR) ;
- 人類表皮生長因子受體2: HER2/neu
rm(list = ls())
#selecting triple-negative breast cancer samples from phenotype data
#extracting clinical information
p <- read.table('.../data/BRCA_clinicalMatrix',header = T,
sep = '\t',quote = '')
colnames(p)[grep("receptor_status", colnames(p))]
## [1] "breast_carcinoma_estrogen_receptor_status"
## [2] "breast_carcinoma_progesterone_receptor_status"
## [3] "lab_proc_her2_neu_immunohistochemistry_receptor_status"
## [4] "metastatic_breast_carcinoma_estrogen_receptor_status"
## [5] "metastatic_breast_carcinoma_progesterone_receptor_status"
# examining how many triple-negative receptors samples
table(p$breast_carcinoma_estrogen_receptor_status == 'Negative' &
p$breast_carcinoma_progesterone_receptor_status == 'Negative' &
p$lab_proc_her2_neu_immunohistochemistry_receptor_status == 'Negative')
## FALSE TRUE
## 1117 130
# extracting tnbc samples
tnbc_samples <- p[p$breast_carcinoma_estrogen_receptor_status == 'Negative' &
p$breast_carcinoma_progesterone_receptor_status == 'Negative' &
p$lab_proc_her2_neu_immunohistochemistry_receptor_status == 'Negative', ]
在TCGA的命名規則中樣本名字的第14,15個字符是以兩位數字表示的,其中01-09表示腫瘤樣本,10-16表示正常對照樣本,具體對應關系可查看其幫助網頁:https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes
因此,在后續分析中我們分別將對應位置為01的分到tumor group,11的分為normal group
#pairing tumor samples with normal samples
library(stringr)
tab1 <- tnbc_samples[1:2] # includes 'sampleID' & "AJCC_Stage_nature2012"
tumor <- tab1[substr(tab1$sampleID,14,15) < 10,]
tumor$TCGAID <- str_sub(tumor$sampleID,1,12)
normal <- tab1[!substr(tab1$sampleID,14,15) <10,]
normal$TCGAID <- str_sub(normal$sampleID,1,12)
dim(tumor)
## [1] 117 3
dim(normal)
## [1] 13 3
tnbc_samples_paired <- merge(tumor,normal,by = 'TCGAID') # return samples only have N-T pairing
#because samples 'TCGA-BH-A18V' have been detected twice,
#so we remove the duplicated one 'TCGA-BH-A18V-06'
tnbc_samples_paired <- tnbc_samples_paired[-6, ]
save(tnbc_samples_paired, file = "data/tnbc_samples_paired.Rdata")
#gene expression matrix
rawdata <- read.csv("data/HiSeqV2", sep = '\t', header = T)
rawdata <- as.data.frame(rawdata)
rawdata[1:3,1:3]
## sample TCGA.AR.A5QQ.01 TCGA.D8.A1JA.01
## 1 ARHGEF10L 9.5074 7.4346
## 2 HIF3A 1.5787 3.6607
## 3 RNF17 0.0000 0.6245
tnbc_samples_paired[1,"sampleID.x"]
## [1] TCGA-A7-A4SE-01
#make sampleid suitable for comparing with the id in rawdata
t_idfordata <- tnbc_samples_paired$sampleID.x
t_idfordata <- gsub('-','.',t_idfordata)
tnbc_samples_paired$t_dataid <- t_idfordata
n_idfordata <- tnbc_samples_paired$sampleID.y
n_idfordata <- gsub('-','.',n_idfordata)
tnbc_samples_paired$n_dataid <- n_idfordata
table(colnames(rawdata) %in% tnbc_samples_paired$t_dataid)
## FALSE TRUE
## 1205 14
tab2 <- rawdata[ ,colnames(rawdata) %in% tnbc_samples_paired$t_dataid]
tab3 <- rawdata[ ,colnames(rawdata) %in% tnbc_samples_paired$n_dataid]
tab2 <- tab2[, str_sub(colnames(tab2),1,12) %in% str_sub(colnames(tab3),1,12)]
expr <- cbind(tab2, tab3)
rownames(expr) <- rawdata[ ,1]
expr <- t(expr)
expr[1:3,1:3]
## ARHGEF10L HIF3A RNF17
## TCGA.E2.A1L7.01 9.8265 1.7767 0.0000
## TCGA.BH.A1FC.01 9.6724 2.2705 0.6677
## TCGA.E2.A1LS.01 9.3743 7.8902 0.0000
save(expr,file = "data/TNBC_pair_expr.Rdata")
Step03.質量控制
提取表達矩陣后,我們需要對提取到的數據進行質量檢測,看看分組是否正確等等。在這里分別使用PCA和聚類的方法對表達矩陣進行分析。一般而言,兩者之一都可以作為表達矩陣質量分析的可視化結果,在此處為了展示方式方法的多樣性,我們都將其進行展示。
# using pca to exmain the data quality
library(factoextra)
library(FactoMineR)
group <- c(rep('tumor',11), rep('normal',11))
expr.pca <- PCA(expr,graph = F)
fviz_pca_ind(expr.pca,
geom.ind = "point",
col.ind = group,
addEllipses = TRUE,
legend.title = "Groups")
#cluster for exmaining the data quality
plot(hclust(dist(expr)))
兩種分析都將tumor和normal group清晰地分開,說明表達矩陣質量良好。
Step04.差異表達分析
本次差異分析使用DESeq2
進行,由于DESeq2
要求輸入的表達矩陣數據是未標準化前的值,而TCGA上表達矩陣的值是進行過log2(norm_count+1)校正的,因此在差異分析之前,需要進行un-normalized
library(DESeq2)
# un-normalization
dat <- as.data.frame(t(expr))
# un-normalization
dat <- 2^dat - 1
dat <- ceiling(dat)
dat[1:3,1:3]
## TCGA.E2.A1L7.01 TCGA.BH.A1FC.01 TCGA.E2.A1LS.01
## ARHGEF10L 907 815 663
## HIF3A 3 4 237
## RNF17 0 1 0
在DESeq2
分析過程中,會將表達矩陣存儲在dds
對象中,以存儲中間變量和進行一部分計算。dds
對象的構建需要包括以下幾方面數據:
- un-normalized expression matrix(or count matrix)
- colData :存儲樣本信息
- design formula:指明在模型中的變量,并用于估計模型的離散值和log2 fold changes
# Transforming to dds object
group_list <- factor(rep(c('tumor','normal'), each = 11))
colData <- data.frame(row.names=colnames(dat),
group_list=group_list)
dds <- DESeqDataSetFromMatrix(countData = dat,
colData = colData,
design = ~group_list,
tidy = F)
dim(dds)
## [1] 20530 22
# filtering very low-expression data
table(rowSums(counts(dds)==0))
# keep rows at least have almost 70% samples being detected
keep <- rowSums(counts(dds)==0)< 16
dds <- dds[keep, ]
counts(dds)[1:10,1:3]
dim(dds) # more than 2,000 genes being romoved
## [1] 18423 22
# Performing differential expression analysis
dds <- DESeq(dds)
# Extracting transformed values
vsd <- vst(dds, blind = F)
# specifying the contrast in model using to estimate the fold change and p-value
contrast <- c("group_list","tumor","normal")
dd1 <- results(dds, contrast=contrast, alpha = 0.05)
plotMA(dd1, ylim=c(-2,2))
MA-plot用于可視化fold change與gene counts之間的關系,默認情況下p < 0.1的值會被標紅,而超過y軸范圍的值則以三角形表示
lfcShrink
可對log fold change進行矯正以消除低表達基因帶來的誤差
# lfcShrink
dd3 <- lfcShrink(dds, coef = "group_list_tumor_vs_normal", res=dd1, type='apeglm')
dd3
plotMA(dd3, ylim=c(-2,2))
summary(dd3, alpha = 0.05)
## out of 18423 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 4054, 22%
## LFC < 0 (down) : 2837, 15%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# considering genes which fold change > 2 or <0.5 and adjusted-p <0.05 as significantly differential expressed
sig <- abs(dd3$log2FoldChange)>1 & dd3$padj<0.05
res_sig <- dd3[sig,]
summary(res_sig)
## out of 4215 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 2255, 53%
## LFC < 0 (down) : 1960, 47%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
save(dd3,res_sig,vsd, file = '.../data/TCGA_TNBC_DE.Rdata')
差異分析結果可視化
# visualization
library(ggplot2)
library(ggthemes)
res <- as.data.frame(dd3)
res$threshold <- as.factor(ifelse(res$padj < 0.05 & abs(res$log2FoldChange) >=log2(2),ifelse(res$log2FoldChange > log2(2) ,'Up','Down'),'Not'))
plot2 <- ggplot(data=res, aes(x=log2FoldChange, y =-log10(padj), colour=threshold,fill=threshold)) +
scale_color_manual(values=c("blue", "grey","red"))+
geom_point(alpha=0.4, size=1.2) +
theme_bw(base_size = 12, base_family = "Times") +
geom_vline(xintercept=c(-0.5,0.5),lty=4,col="grey",lwd=0.6)+
geom_hline(yintercept = -log10(0.05),lty=4,col="grey",lwd=0.6)+
theme(legend.position="right",
panel.grid=element_blank(),
legend.title = element_blank(),
legend.text= element_text(face="bold", color="black",family = "Times", size=8),
plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(face="bold", color="black", size=12),
axis.text.y = element_text(face="bold", color="black", size=12),
axis.title.x = element_text(face="bold", color="black", size=12),
axis.title.y = element_text(face="bold",color="black", size=12)) +
labs( x="log2 (Fold Change)",y="-log10 (p-value)")
plot2
Step05.注釋
文章中對數據注釋后分為了mRNA和lncRNA,并對兩者分別進行了分析。接下來我們也將利用Ensembl的GTF進行注釋。
library(rtracklayer)
library(tidyr)
library(dplyr)
library(pheatmap)
require(org.Hs.eg.db)
gtf1 <- import('data/Homo_sapiens.GRCh38.98.chr.gtf')
gtf_df <- as.data.frame(gtf1)
colnames(gtf_df)
# extracting "gene_id" ,"gene_biotype"
gtf <- gtf_df[,c(10,14)]
head(gtf)
save(gtf,file = "data/Homo_sapiens.GRCh38.98.chr.Rdata")
keytypes(org.Hs.eg.db)
res_sig$gene_names <- rownames(res_sig)
# ID transformation
res_id <- clusterProfiler::bitr(res_sig$gene_names,
fromType = 'SYMBOL',
toType = "ENSEMBL",
OrgDb = 'org.Hs.eg.db')
k <- res_id[res_id$ENSEMBL %in% gtf$gene_id, 2] %>%
match(gtf$gene_id)
id_keep <- gtf[k,]
colnames(res_id) <- c("gene_names",'gene_id')
id_keep <- merge(id_keep, res_id, by='gene_id')
## lncRNA polymorphic_pseudogene
## 30 2
## processed_pseudogene transcribed_processed_pseudogene
## 1 6
## protein_coding snoRNA
## 3650 3
## transcribed_unitary_pseudogene transcribed_unprocessed_pseudogene
## 6 15
res_ord <- as.data.frame(res_sig[order(res_sig$padj),])
# extracting mRNA and lncRNA results respectively
res_mrna <- id_keep[id_keep$gene_biotype=='protein_coding',] %>%
merge(as.data.frame(res_ord), by = "gene_names")
res_lncrna <- id_keep[id_keep$gene_biotype=='lncRNA',] %>%
merge(as.data.frame(res_ord), by = "gene_names")
save(res_ord,res_mrna,res_lncrna, file = '.../data/TCGA_annotation_results.Rdata')
Step06.富集分析
富集分析及其可視化采用clusterProfiler
進行,由于kegg識別的ID為"ENTREZID",因此在分析之前也進行了一次轉換。同時,在轉換的過程中出現了"ENSEMBL"--"ENTREZID" multi-mapping的情況,因此我們移除了冗余的id。
library(clusterProfiler)
library(org.Hs.eg.db)
library(ggplot2)
library(RColorBrewer)
library(gridExtra)
library(enrichplot)
deid <- bitr(res_mrna$gene_id,
fromType = "ENSEMBL",
toType = "ENTREZID",
OrgDb = 'org.Hs.eg.db')
## 'select()' returned 1:many mapping between keys and columns
deid <- deid[!duplicated(deid$ENSEMBL),]
# cc,MF not showed
ego_BP <- enrichGO(gene = deid$ENTREZID,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = TRUE)
dotplot(ego_BP, showCategory = 20,font.size = 8)
ego_KEGG <- enrichKEGG(gene = deid$ENTREZID, organism = "hsa",
keyType = 'kegg',
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
minGSSize = 10, maxGSSize = 500,
qvalueCutoff = 0.05,
use_internal_data = FALSE)
dotplot(ego_KEGG, showCategory = 20,font.size = 8)
本次代碼復現也到此暫告一段落,可能是由于數據集或是分析代碼的改動,我們注釋到的顯著差異表達的lncRNA只有30個,遠遠小于文章報道的1,211個。原則上,我們應先對差異分析結果注釋,再行設定cutoff以找出顯著的差異表達基因。但是在先行注釋的情況下,仍然只能在本數據集中找到143個lncRNA,讓我不禁懷疑不同數據集的差異性真的有這么大?亦或是由樣本的差異性所導致的?
#dd3是lfcshrink的差異分析結果
dd3$gene_names <- rownames(dd3)
res_id2 <- clusterProfiler::bitr(dd3$gene_names,
fromType = 'SYMBOL',
toType = "ENSEMBL",
OrgDb = 'org.Hs.eg.db')
dim(res_id2)
## [1] 17830 2
k2 <- res_id2[res_id2$ENSEMBL %in% gtf$gene_id, 2] %>%
match(gtf$gene_id)
id_keep2 <- gtf[k2,]
colnames(res_id2) <- c("gene_names",'gene_id')
id_keep2 <- merge(id_keep2, res_id2, by='gene_id')
table(id_keep2$gene_biotype)
lncRNA misc_RNA
143 1
polymorphic_pseudogene processed_pseudogene
12 12
protein_coding ribozyme
15534 1
scaRNA snoRNA
5 26
TEC TR_C_gene
3 1
transcribed_processed_pseudogene transcribed_unitary_pseudogene
33 14
transcribed_unprocessed_pseudogene unitary_pseudogene
79 1
unprocessed_pseudogene
3
不論分析結果如何,本次分析流程也是十分值得學習的。至于文中的疑問在解決后會回來填坑的!最后,再次感謝生信技能樹(vx: biotrainee)的分享,大家快去關注吧!
補坑
在咨詢生信技能樹的jimmy老師后,對本文中的一些問題也得到了解答。關于為何最終注釋的lncRNA較少主要是因為選取的表達矩陣是RSEM normalized count matrix,該數據集含有的non-coding genes 的數量本來就較少,故能夠注釋到的lncRNA也會較少。但在TCGA Breast Cancer (BRCA)的數據集中我暫時還沒發現到轉錄組的表達矩陣,該數據集的RNA-seq數據基本上是使用polyA+ IlluminaHiSeq,意味著測序的基本上都是mRNA。miRNA的data倒是有,但整個轉錄組的data還沒找到,如果有找到的朋友也可以告知我。
補充于2019/10/13
完。