【ATAC-Seq 實(shí)戰(zhàn)】四、計(jì)算插入片段長(zhǎng)度,F(xiàn)RiP,IDR值與deeptools可視化

這里是佳奧!

2022年的最后一天,讓我們繼續(xù)ATAC-Seq的學(xué)習(xí)!

1 計(jì)算插入片段長(zhǎng)度

非冗余非線粒體能夠比對(duì)的fragment、比對(duì)率、NRF、PBC1、PBC2、peak數(shù)、無(wú)核小體區(qū)NFR、TSS富集、FRiP 、IDR重復(fù)的一致性

根據(jù)bam文件第9列,在R里面統(tǒng)計(jì)繪圖

samtools view 2-ce11-2.last.bam | cut -f 9 >1.txt

apt install r-base-core

$ R
> a=read.table('1.txt')
> dim(a)
[1] 7292144       1
> png('hist.png')
> hist(as.numeric(a[,1]))
> dev.off
> q()
hist.png
hist(abs(as.numeric(a[,1])), breaks=100)
hist2.png

批量腳本

##創(chuàng)建一個(gè)config.last.bam文件,里面內(nèi)容包含bam文件的名稱
2-cell-1.last.bam 2-cell-1.last
2-cell-2.last.bam 2-cell-2.last
2-cell-4.last.bam 2-cell-4.last
2-cell-5.last.bam 2-cell-5.last

##提取bam文件的第九列indel插入長(zhǎng)度信息
cat config.last.bam | while read id;
do
arr=($id)
sample=${arr[0]}
sample_name=${arr[1]}
samtools view $sample | awk '{print $9}'  > ${sample_name}.length.txt
done

##準(zhǔn)備一個(gè)用于R語(yǔ)言批量繪制indel分布的文本輸入文件config.indel.length.distribution
2-cell-1.last.length.txt 2-cell-1.last.length
2-cell-2.last.length.txt 2-cell-2.last.length
2-cell-4.last.length.txt 2-cell-4.last.length
2-cell-5.last.length.txt 2-cell-5.last.length

##有了上面的文件就可以批量檢驗(yàn)bam文件進(jìn)行出圖。創(chuàng)建批量運(yùn)行的shell腳本
cat config.indel.length.distribution  | while read id;
do
arr=($id)
input=${arr[0]}
output=${arr[1]}
Rscript indel.length.distribution.R $input $output
done

##indel.length.distribution.R
cmd=commandArgs(trailingOnly=TRUE); 
input=cmd[1]; output=cmd[2]; 
a=abs(as.numeric(read.table(input)[,1])); 
png(file=output);
hist(a,
main="Insertion Size distribution",
ylab="Read Count",xlab="Insert Size",
xaxt="n",
breaks=seq(0,max(a),by=10)
); 

axis(side=1,
at=seq(0,max(a),by=100),
labels=seq(0,max(a),by=100)
);

dev.off()  

2 FRiP值的計(jì)算

fraction of reads in called peak regions

Fraction of reads in peaks (FRiP) - Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads. In general, FRiP scores correlate positively with the number of regions. (Landt et al, Genome Research Sept. 2012, 22(9): 1813–1831)

bedtools intersect -a ../align/2-ceLL-1.bed -b 2-ceLL-1_peaks.narrowPeak |wc -l
148210

wc ../align/2-ceLL-1.bed
5105844
wc ../align/2-ceLL-1.raw.bed
5105844

ls *narrowPeak|while  read id;
do 
echo $id
bed=../align/$(basename $id "_peaks.narrowPeak").raw.bed
#ls -lh $bed 
Reads=$(bedtools intersect -a $bed -b $id |wc -l|awk '{print $1}')
totalReads=$(wc -l $bed|awk '{print $1}')
echo $Reads  $totalReads 
echo '==> FRiP value:' $(bc <<< "scale=2;100*$Reads/$totalReads")'%'
done 

2-ce11-2_peaks.narrowPeak
3420904 95149325
==> FRiP value: 3.59%
2-ce11-4_peaks.narrowPeak
1126859 29866961
==> FRiP value: 3.77%
2-ce11-5_peaks.narrowPeak
4259835 103697403
==> FRiP value: 4.10%
2-ceLL-1_peaks.narrowPeak
2488167 62365958
==> FRiP value: 3.98%

只顯示.bam,其他不顯示:

$ ls 2-ce11-?.raw.bam

2-ce11-2.raw.bam  2-ce11-4.raw.bam  2-ce11-5.raw.bam

可以使用R包看不同peaks文件的overlap情況:


QQ截圖20221231175609.png
if(F){
  options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/") 
  options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
  source("http://bioconductor.org/biocLite.R") 

  BiocManager::install('ChIPseeker')
  BiocManager::install('ChIPpeakAnno')

}

library(ChIPseeker)
library(ChIPpeakAnno)
list.files('D:/ATAC-Seq/數(shù)據(jù)/',"*.narrowPeak")
tmp=lapply(list.files('D:/ATAC-Seq/數(shù)據(jù)/',"*.narrowPeak"),function(x){
  return(readPeakFile(file.path('D:/ATAC-Seq/數(shù)據(jù)/', x))) 
})

ol <- findOverlapsOfPeaks(tmp[[1]],tmp[[4]])
png('overlapVenn.png')
makeVennDiagram(ol)
dev.off()
QQ截圖20221231180630.png

3 IDR計(jì)算

也可以使用專業(yè)軟件,IDR 來(lái)進(jìn)行計(jì)算出來(lái),同時(shí)考慮peaks間的overlap,和富集倍數(shù)的一致性 。

詳細(xì)的教程:

http://www.lxweimin.com/p/d8a7056b4294
source activate atac
# 可以用search先進(jìn)行檢索
conda search idr
source  deactivate
## 保證所有的軟件都是安裝在 py3 這個(gè)環(huán)境下面
conda  create -n py3 -y python=3 idr
conda activate py3
conda install -c bioconda idr

idr -h 
idr --samples  2-ceLL-1_peaks.narrowPeak 2-ce11-2_peaks.narrowPeak --plot

idr --samples 2-ceLL-1_peaks.narrowPeak 2-ce11-2_peaks.narrowPeak \
--input-file-type narrowPeak \
--rank p.value \
--output-file sample-idr \
--plot \
--log-output-file sample.idr.log

4 deeptools可視化

需要把.bam轉(zhuǎn)化為.bw

http://www.bio-info-trainee.com/1815.html
cd  ~/project/atac/align
source activate atac
# ls  *.bam  |xargs -i samtools index {} 
ls *last.bam |while read id;do
nohup bamCoverage -p 5 --normalizeUsing CPM -b $id -o ${id%%.*}.last.bw & 
done 

cd dup 
ls  *.bam  |xargs -i samtools index {} 
ls *.bam |while read id;do
nohup bamCoverage --normalizeUsing CPM -b $id -o ${id%%.*}.rm.bw & 
done 

.bw文件的IGV可視化

QQ截圖20221231210807.png

查看TSS附件信號(hào)強(qiáng)度

## both -R and -S can accept multiple files 
mkdir -p  ~/project/atac/tss
cd   ~/project/atac/tss 
source activate atac

computeMatrix reference-point  --referencePoint TSS  -p 15  \
-b 10000 -a 10000    \
-R /home/kaoku/refer/mm10/ucsc.refseq.bed  \
-S /home/kaoku/project/atac/align/*.bw  \
--skipZeros  -o matrix1_test_TSS.gz  \
--outFileSortedRegions regions1_test_genes.bed

## both plotHeatmap and plotProfile will use the output from   computeMatrix
plotHeatmap -m matrix1_test_TSS.gz  -out test_Heatmap.png
plotHeatmap -m matrix1_test_TSS.gz  -out test_Heatmap.pdf --plotFileFormat pdf  --dpi 720  
plotProfile -m matrix1_test_TSS.gz  -out test_Profile.png
plotProfile -m matrix1_test_TSS.gz  -out test_Profile.pdf --plotFileFormat pdf --perGroup --dpi 720 

下載參考.bed

http://genome.ucsc.edu/cgi-bin/hgTables

##具體轉(zhuǎn)化方法
http://www.lxweimin.com/p/5d078d517770

QQ截圖20221231211404.png

繪制的熱圖
test_Heatmap.png

查看基因body的信號(hào)強(qiáng)度

source activate atac
computeMatrix scale-regions  -p 15  \
-R /home/kaoku/refer/mm10/ucsc.refseq.bed  \
-S /home/kaoku/project/atac/align/*.bw  \
-b 10000 -a 10000  \
--skipZeros -o matrix1_test_body.gz
plotHeatmap -m matrix1_test_body.gz  -out ExampleHeatmap1.png 

plotHeatmap -m matrix1_test_body.gz  -out test_body_Heatmap.png
plotProfile -m matrix1_test_body.gz  -out test_body_Profile.png

繪制的熱圖

test_body_Heatmap.png

ngsplot也是可以的。

上面的批量代碼其實(shí)就是為了統(tǒng)計(jì)全基因組范圍的peak在基因特征的分布情況,也就是需要用到computeMatrix計(jì)算,用plotHeatmap以熱圖的方式對(duì)覆蓋進(jìn)行可視化,用plotProfile以折線圖的方式展示覆蓋情況。

computeMatrix具有兩個(gè)模式: scale-regionreference-point。前者用來(lái)信號(hào)在一個(gè)區(qū)域內(nèi)分布,后者查看信號(hào)相對(duì)于某一個(gè)點(diǎn)的分布情況。無(wú)論是那個(gè)模式,都有有兩個(gè)參數(shù)是必須的,-S是 提供bigwig文件,-R是提供基因的注釋信息。

##deeptools官方文檔
https://deeptools.readthedocs.io/en/develop/content/tools/computeMatrix.html#id10

補(bǔ)充:

查看進(jìn)程:
top

彩色界面:
htop

下一步便是peaks的注釋。

我們下一篇再見(jiàn)!

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容