使用DEseq2循環做多組間差異表達分析
????當有多組RNA-seq數據時,有時需要對多個組合進行差異表達分析,例如當我有CIM0/CIM7/CIM14/CIM28四組時,我需要得到每個組合間的差異表達情況,CIM7:CIM0; CIM14:CIM0; CIM14:CIM7等。使用ANOVA的方式也可以進行多組間比較,但由于ANOVA是指定同一個CK,并且不能得到具體是哪組相對于CK有差異表達,不能精準的解決我的需求,因此選擇使用DEseq2循環對不同組進行差異表達分析。
一. R腳本
? ? 目前腳本中DEGs(差異表達基因)篩選標準為log2FoldChange>1或log2FoldChange<-1以及pvalue<0.05, qvalue<0.05,可根據需求更改閾值。
###
library(GenomicFeatures)
library(DESeq2)
library(dplyr)
###load and set output file
file_in <- "counts.txt"
file_design <- "design.txt"
file_compare <- "compare2.txt"
file_deg_num = paste("data//","DE_",file_in,sep="")? ##number of DEGs in all compare
file_final_csv? = paste("data//","DE_",file_in, "_Final_Out.csv",sep="")
file_final_genelist = paste("data//","DEG_geneid_allcomapre.txt",sep="")
################################################################
###########get DEGs in different compare
# counts file
data_in = read.table(file_in, head=TRUE,row.names =1, check.names = FALSE)
mycompare=read.table(file_compare,head=TRUE)
mydesign=read.table(file_design, head = TRUE)
##filter counts
countData = as.data.frame(data_in)
dim(countData)
mycounts_filter <- countData[rowSums(countData) != 0,]
dim(mycounts_filter)
##check group
head(mycompare)
head(mydesign)
total_num=dim(mycompare)[1]? ##get the num of groups
tracking = 0
gene_num_out = c()
pvalue_cut = 0.01
condition_name = c()
for (index_num in c(1:total_num)){
? tracking = tracking + 1
? test_name = as.character(mycompare[index_num,1])
? group1 = as.character(mycompare[index_num, 2])
? group2 = as.character(mycompare[index_num, 3])
? sample1 = mydesign[mydesign$Group == group1,]
? sample2 = mydesign[mydesign$Group == group2,]
? allsample = rbind(sample1, sample2)
? counts_sample = as.character(allsample$counts_id)
? groupreal = as.character(allsample$Group)
? countData = mycounts_filter[, counts_sample]
? colData = as.data.frame(cbind(counts_sample, groupreal))
? names(colData) = c("sample", "condition")
? dds <- DESeqDataSetFromMatrix(countData = countData,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? colData = colData,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? design = ~ condition)
? dds <- DESeq(dds)
? resSFtreatment <- results(dds, cooksCutoff =FALSE, contrast=c("condition",group2,group1))
? out_test = as.data.frame(resSFtreatment)
? final_each = cbind( out_test$log2FoldChange,out_test$pvalue,out_test$padj)
? rownames(final_each) = resSFtreatment@rownames
? names = c('(logFC)','(pvalue)','(Qvalue)')
? final_name = paste(test_name,'_',names,sep="")
? colnames(final_each) = final_name
? if (tracking == 1){
? ? final_table = final_each
? }else{
? ? final_table = cbind(final_table, final_each)
? }
? gene_sel = out_test[((!is.na(out_test$pvalue))&(!is.na(out_test$log2FoldChange)))&out_test$pvalue < 0.05 & abs(out_test$log2FoldChange) >= 1 & out_test$padj<0.05, ]
? gene_sel<-na.omit(gene_sel)? ##delete rows contain NA
? gene_sel_up = gene_sel [gene_sel$log2FoldChange>0,]
? gene_sel_do = gene_sel [gene_sel$log2FoldChange<0,]
? file_out_up = paste("data//DEGsid//","UP_",test_name,".txt",sep="")
? file_out_do = paste("data//DEGsid//","DOWN_",test_name,".txt",sep="")
? gene_list_up = rownames(gene_sel_up)
? gene_list_do = rownames(gene_sel_do)
? all_ub_down = list(gene_list_up, gene_list_do)
? nameup <- paste(test_name,"_up",sep="")
? namedown <- paste(test_name,"_down",sep="")
? names(all_ub_down) = c(nameup, namedown)
? if (tracking == 1){
? ? final_genelist = all_ub_down
? }else{
? ? final_genelist = c(final_genelist, all_ub_down)
? }
? gene_num_up = length(gene_list_up)
? gene_num_do = length(gene_list_do)
? write.table(gene_sel_up, file= file_out_up , row.names = TRUE,col.names = TRUE)
? write.table(gene_sel_do, file= file_out_do , row.names = TRUE,col.names = TRUE)
? condition_name = c(condition_name, paste("UP_",test_name,sep=""),paste("DO_",test_name,sep=""))
? gene_num_out = c(gene_num_out,gene_num_up,gene_num_do)
}
final_DEGs_list<-do.call(cbind, lapply(lapply(final_genelist, unlist), `length<-`, max(lengths(final_genelist))))
final_DEGs_list
out_final2 = cbind(condition_name, gene_num_out)
colnames(out_final2) = c("Tests", "DEG number")
write.table(final_DEGs_list, file=file_final_genelist, row.names = FALSE, sep="\t", na = "") ###all DEGs list
write.table(out_final2, file=file_deg_num, row.names = FALSE, sep="\t")? ? ##DEG number in different compare
write.csv(final_table, file=file_final_csv, row.names = TRUE,quote=TRUE)? ##allgene in all compare
二. 輸入文件
2.1.counts.txt
????DEseq2的輸入counts數據需為整數;如果是RSEM結果,使用expected
counts取整。
2.2.design.txt
文件共四列,格式如下:
sample_id: 樣本id;
counts_id: counts文件中的樣品id;
Group: 樣本分組;
rep: 樣本重復;
2.3.compare.txt
指定比較組
Test:進行比較的組
denominator: 對照組
numerator: 比較組
注:對照組在前,比較組在后,假如做CIM7相對于CIM0的差異表達基因分析,denominator為CIM0,numerator為CIM7。
三.輸出文件
????運行腳本前在當前目錄下建立data文件夾,并在data下建立DEGsid文件夾,用于存放輸出文件。腳本運行完成后會在data文件夾下輸出3個文件:
3.1.DE_counts.txt_Final_Out.csv
輸出總表,格式如下:
第一列為基因id,包括所有基因,后幾列為所有比較組的logFC,pvalue以及qvaue值。
3.2.DE_counts.txt
所有比較組中差異表達基因的數目。
3.3. DEG_geneid_allcomapre.txt
????所有比較組中差異表達基因的基因id,該文件用于篩選不同組間的重疊基因。“-up”表示比較組內上調的基因,“_down”表示比較組內下調的基因。
3.4.DEGsid
????DEGsid文件夾下生成一系列文件,為每個比較組的DEseq2的分析結果, “UP_”和“DOWN”分別表示上調和下調的基因。
四.找不同比較組間的重疊DEGs
得到這些輸出結果后我還想找到不同組間的重疊DEGs:
names(mydegs)獲得不同比較組的名稱,指定mycom1和mycom2后使用intersect()找到交集。