聯(lián)合使用deeptools和自編R腳本分析鏈特異性數(shù)據(jù)

不積跬步無以至千里

deeptools作為分析深度測(cè)序數(shù)據(jù)的一大利器,受到了廣泛的歡迎和關(guān)注,這一點(diǎn)從Github的星標(biāo)數(shù)以及conda的安裝次數(shù)就可以知道,很多人都對(duì)它非常熟悉了。

但是不知道你關(guān)注過沒有,deeptools本身是一個(gè)處理鏈非特異性數(shù)據(jù)的工具,這一點(diǎn)在它的介紹里可見一斑:

deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers.

但是我們有的時(shí)候確實(shí)需要用deeptools去處理鏈特異性的數(shù)據(jù)(例如新生RNA測(cè)序數(shù)據(jù)),我們?cè)撛趺崔k呢?

一個(gè)非常簡單的情景就是我們要分析新生RNA測(cè)序數(shù)據(jù)在基因上的分布特征,這個(gè)時(shí)候我們顯然不能將來自于負(fù)鏈基因的轉(zhuǎn)錄本統(tǒng)計(jì)到同一區(qū)域的正鏈基因上。一個(gè)簡單的思路就是:

  • 首先將BAM文件根據(jù)鏈拆分成正鏈和負(fù)鏈的BAM文件;

  • 再用正鏈的BAM文件去和正鏈基因的BED文件去computeMatrix,負(fù)鏈同理;

  • 最后再將正負(fù)鏈computeMatrix的結(jié)果進(jìn)行合并。

示例
  • 數(shù)據(jù)來源:GSE38140
  • 數(shù)據(jù)類型:GRO-seq

首先下載數(shù)據(jù)并進(jìn)行質(zhì)控:

#-- data download
prefetch SRR828695
fastq-dump --split-3 SRR828695/SRR828695.sra
#-- FASTQC
mkdir fastqc
fastqc SRR828695/SRR828695_1.fastq -o fastqc/
#-- Quality Control
trim_galore --fastqc \
  --fastqc_args "-o fastqc" \
  --small_rna \
  --basename hct116

然后使用bowtie2進(jìn)行比對(duì):

#-- index
index="ref/bowtie2/hg38"
#-- mapping
bowtie2 --local -U hct116.fq -x $index | samtools view -q 20 -b -o hct116.bam
#-- remove duplicates (optional)
samtools markdup -r --output-fmt BAM hct116.bam hct116.flt.bam

判斷一下鏈特異性:

infer_experiment.py -i hct116.flt.bam -r ref/genes.bed
This is SingleEnd Data
Fraction of reads failed to determine: 0.0881
Fraction of reads explained by "++,--": 0.7555
Fraction of reads explained by "+-,-+": 0.1565

屬于stranded數(shù)據(jù)。
拆分正負(fù)鏈:

samtools view -F 16 -b -o hct116.p.bam hct116.flt.bam
samtools view -f 16 -b -o hct116.m.bam hct116.flt.bam
bamCoverage --bam hct116.p.bam --outFileName hct116.p.bw --binSize 1 --numberOfProcessors 5 --normalizeUsing CPM
bamCoverage --bam hct116.m.bam --outFileName hct116.m.bw --binSize 1 --numberOfProcessors 5 --normalizeUsing CPM

最后再分別computeMatrix

computeMatrix scale-regions -R genes.p.bed \
  -S hct116.p.bw \
  -m 10000 \
  -a 5000 \
  -b 5000 \
  -p 3 \
  --binSize 100 \
  --skipZeros \
  --outFileName tmp.gz \
  --outFileNameMatrix p.txt

負(fù)鏈同理。
這里一定不要忘了--outFileNameMatrix,這個(gè)文件會(huì)作為我們后續(xù)的R輸入文件。

合并結(jié)果

為了合并現(xiàn)有的profile結(jié)果,我寫了一個(gè)R命令行工具供大家使用,且?guī)臀?code>debug,歡迎大家使用,源代碼如下:

#!/usr/local/bin/Rscript

#--
#@Author: Kun-Ming Shui, School of Life Sciences, Nanjing University (NJU).
#@Contribution list: ...

suppressMessages(library(argparse))
suppressMessages(library(ggplot2))
suppressMessages(library(patchwork))
suppressMessages(library(pheatmap))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(purrr))
suppressMessages(library(forcats))

parser <- ArgumentParser(prog = 'deeptools2r.R',
             description = 'This tool can help you visualize deeptools computeMatrix output in R.',
             epilog = 'Kun-Ming Shui, skm@smail.nju.edu.cn')

parser$add_argument('--version', '-v', action = 'version', version = '%(prog)s 1.0.0')
parser$add_argument('--input', '-i', nargs = '+', help = 'the complexHeatmap output file, multiple files should be separated by spaced.', required = TRUE)
parser$add_argument('--output', '-o', help = 'the output file name, "deeptools2r.out.pdf" by default.', default = 'deeptool2r.out.pdf')
parser$add_argument('--averageType', '-t', help = 'the type of stastics should be used for the profile, "mean" by default.', default = 'mean', choices = c('mean', 'max', 'min', 'median', 'sum'))
parser$add_argument('--plotType', help = 'the plot type for profile, "line" by default.', default = 'line', choices = c('line', 'heatmap', 'both'))
parser$add_argument('--colors', nargs = '+', help = 'the colors used for plot lines, multiple colors should be separated by spaced and should be equal with group information size, "None" by default.', default = NULL, required = FALSE)
parser$add_argument('--group', '-g', nargs = '+', help = 'group information for INPUT FILE, an important function of this tool is to combine profile data from forward and reverse strand. For example, if you have the file list: r1.fwd.tab r1.rev.tab r2.tab, you should pass "-g r1_f r1_r r2" to this argument. All in all, profile data from one sample but different strand should be taged with same group but different strand.', required = TRUE)
parser$add_argument('--startLabel', help = '[Only for scale-regions mode] Label shown in the plot for the start of the region, "TSS" by default.', default = 'TSS', required = FALSE)
parser$add_argument('--endLabel', help = '[Only for scale-regions mode] Label shown in the plot for the end of the region, "TES" by default.', default = 'TES', required = FALSE)
parser$add_argument('--refPointLabel', help = '[Only for reference-point mode] Label shown in the plot for the center of the region', default = 'center', required = FALSE)
parser$add_argument('--yMax', help = 'Maximum value for Y-axis, "None" by default.', type = 'double', default = NULL, required = FALSE)
parser$add_argument('--yMin', help = 'Minimum value for Y-axis, "None" by default.', type = 'double', default = NULL, required = FALSE)
parser$add_argument('--width', help = 'Width value for line plot, 0.7 by default', type = 'double', default = 0.7, required = FALSE)
parser$add_argument('--plotHeight', help = 'Plot height in inch, 5 by default.', default = 5, type = 'double', required = FALSE)
parser$add_argument('--plotWidth', help = 'Plot width in inch, 7 by default.', default = 7, type = 'double', required = FALSE)

args <- parser$parse_args()

groups <- args$group
FILES <- args$input

#--group information
if(length(groups) != length(FILES)) stop('The group information does not equal with sample number.')
#-group level
gp.level <- sapply(groups, FUN = function(group){
    if(grepl(group, pattern = '_[f|r]$')){
        str_list <- strsplit(group, split = "_", fixed = T)
        return(paste(str_list[[1]][1:(length(str_list[[1]])-1)], collapse = "_"))
    }else{
        return(group)
    }
})
gp.level <- unique(gp.level)

#--load data
data <- lapply(FILES, FUN = function(FILE){
           tmp <- read.table(file = FILE, header = FALSE, sep = "\t", skip = 3)
           gp <- groups[FILES == FILE]
           gp.info <- ifelse(grepl(gp, pattern = '[r|f]$'), 
                 gp %>% 
                    strsplit(split = "_", fixed = TRUE) %>% 
                    sapply(FUN = function(string){string[1:length(string)-1]}) %>%
                    sapply(FUN = function(string){paste(string, collapse = "_")}),
                 gp)
           tmp %>% mutate(group = rep(gp.info, nrow(tmp)))
})
data <- purrr::reduce(data, rbind)

#--tidy data
data <- data %>% 
  group_by(group) %>% 
  summarise_all(args$averageType, na.rm = TRUE) %>% 
  pivot_longer(cols = starts_with('V'), names_to = 'index', values_to = 'signal')

#--label
label.info <- read.table(file = FILES[1], comment.char = "", nrows = 2, fill = TRUE)
#-downstream
dw.size <- label.info[2, ] %>% 
    grep(pattern = 'downstream', value = T) %>% 
    gsub(pattern = '#', replacement = "") %>% 
    strsplit(split = ':', fixed = T) %>% 
    sapply('[[', 2) %>% 
    as.numeric()
#-upstream
up.size <- label.info[2, ] %>% 
    grep(pattern = 'upstream', value = T) %>% 
    gsub(pattern = '#', replacement = "") %>% 
    strsplit(split = ':', fixed = T) %>% 
    sapply('[[', 2) %>% 
    as.numeric()
#-body
bd.size <- label.info[2, ] %>% 
    grep(pattern = 'body', value = T) %>% 
    gsub(pattern = '#', replacement = "") %>% 
    strsplit(split = ':', fixed = T) %>% 
    sapply('[[', 2) %>% 
    as.numeric()
#-bin
binSize <- label.info[2, ] %>%
    grep(pattern = 'size', value = T) %>%
    gsub(pattern = '#', replacement = "") %>%
    strsplit(split = ':', fixed = T) %>%
    sapply('[[', 2) %>%
    as.numeric()

#--plot
line_plot <- data %>%
    mutate(index = fct_relevel(index, paste0('V', 1:((up.size + bd.size + dw.size)/binSize))),
           group = fct_relevel(group, gp.level)) %>%
    ggplot(., aes(x = index, y = signal)) +
    geom_line(aes(group = group, color = group), linewidth = args$width) +
    xlab(label = 'Position') +
    ylab(label = 'Signal') +
    theme_classic() +
    theme(axis.text = element_text(family = 'sans', color = 'black'),
          axis.ticks = element_line(color = 'black'),
          axis.title = element_text(family = 'sans', face = 'bold'))

#-label
if(bd.size != 0){
    startLabel <- ifelse(is.null(args$startLabel), 'TSS', args$startLabel)
    endLabel <- ifelse(is.null(args$endLabel), 'TSS', args$endLabel)
    breakPoints <- c('V1', paste0('V', c(up.size/binSize, (up.size + bd.size)/binSize)), paste0('V', (up.size + bd.size + dw.size)/binSize))
    line_plot <- line_plot + 
        scale_x_discrete(breaks = breakPoints, labels = c(paste0("-", up.size/1000, ' kb'), startLabel, endLabel, paste0(dw.size/1000, ' kb')))
}else{
    refPointLabel <- ifelse(is.null(args$refPointLabel), 'center', args$refPointLabel)
    breakPoints <- c('V1', paste0('V', up.size/binSize), paste0('V', (up.size + dw.size)/binSize))
    line_plot <- line_plot +
        scale_x_discrete(breaks = breakPoints, labels = c(paste0("-", up.size/1000, ' kb'), refPointLabel, paste0(dw.size/1000, ' kb')))
}

#--Y-region
if(!is.null(args$yMin) && !is.null(args$yMax)){
    line_plot <- line_plot + 
        coord_cartesian(ylim = c(args$yMin, args$yMax))
}

#--colors
if(!is.null(args$colors) && length(args$colors) == length(gp.level)){
    line_plot <- line_plot + 
        scale_color_manual(values = args$colors)
}else if(!is.null(args$colors) && length(args$colors) != length(gp.level)){
    print("Warning: Your color number doesn't match group number, use default set instead!")
}

plot <- line_plot
ggsave(filename = args$output, plot = plot, width = args$plotWidth, height = args$plotHeight, units = 'in')

cat(paste0('Finished at ', date(), '!\n'))
q(save = 'no')

請(qǐng)大家先安裝好依賴包,都很常見:

argparse
ggplot2
patchwork
pheatmap
tidyr
dplyr
purrr
forcats

另外,熱圖還在開發(fā)當(dāng)中,目前還不支持,更新好后我會(huì)及時(shí)放在這里。
使用方法:

deeptools2r.R --help
usage: deeptools2r.R [-h] [--version] --input INPUT [INPUT ...]
                     [--output OUTPUT]
                     [--averageType {mean,max,min,median,sum}]
                     [--plotType {line,heatmap,both}]
                     [--colors COLORS [COLORS ...]] --group GROUP [GROUP ...]
                     [--startLabel STARTLABEL] [--endLabel ENDLABEL]
                     [--refPointLabel REFPOINTLABEL] [--yMax YMAX]
                     [--yMin YMIN] [--width WIDTH] [--plotHeight PLOTHEIGHT]
                     [--plotWidth PLOTWIDTH]

This tool can help you visualize deeptools computeMatrix output in R.

optional arguments:
  -h, --help            show this help message and exit
  --version, -v         show program's version number and exit
  --input INPUT [INPUT ...], -i INPUT [INPUT ...]
                        the complexHeatmap output file, multiple files should
                        be separated by spaced.
  --output OUTPUT, -o OUTPUT
                        the output file name, "deeptools2r.out.pdf" by
                        default.
  --averageType {mean,max,min,median,sum}, -t {mean,max,min,median,sum}
                        the type of stastics should be used for the profile,
                        "mean" by default.
  --plotType {line,heatmap,both}
                        the plot type for profile, "line" by default.
  --colors COLORS [COLORS ...]
                        the colors used for plot lines, multiple colors should
                        be separated by spaced and should be equal with group
                        information size, "None" by default.
  --group GROUP [GROUP ...], -g GROUP [GROUP ...]
                        group information for INPUT FILE, an important
                        function of this tool is to combine profile data from
                        forward and reverse strand. For example, if you have
                        the file list: r1.fwd.tab r1.rev.tab r2.tab, you
                        should pass "-g r1_f r1_r r2" to this argument. All in
                        all, profile data from one sample but different strand
                        should be taged with same group but different strand.
  --startLabel STARTLABEL
                        [Only for scale-regions mode] Label shown in the plot
                        for the start of the region, "TSS" by default.
  --endLabel ENDLABEL   [Only for scale-regions mode] Label shown in the plot
                        for the end of the region, "TES" by default.
  --refPointLabel REFPOINTLABEL
                        [Only for reference-point mode] Label shown in the
                        plot for the center of the region
  --yMax YMAX           Maximum value for Y-axis, "None" by default.
  --yMin YMIN           Minimum value for Y-axis, "None" by default.
  --width WIDTH         Width value for line plot, 0.7 by default
  --plotHeight PLOTHEIGHT
                        Plot height in inch, 5 by default.
  --plotWidth PLOTWIDTH
                        Plot width in inch, 7 by default.

Kun-Ming Shui, skm@smail.nju.edu.cn
淺試一下這個(gè)工具:
deeptools2r.R --input m.txt p.txt \
    --output deeptools.pdf \
    --group hct116_r hct116_f \
    --plotHeight 2 \
    --plotWidth 3 \
    --colors '#045a8d'


這是符合經(jīng)典的GRO-seq數(shù)據(jù)分布模式的~。

號(hào)外

我每次寫東西時(shí)都不會(huì)吝嗇于代碼的分享,一方面想和大家共同進(jìn)步,另一方面也是想大家?guī)臀?code>debug,歡迎大家試用這些小工具~

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容