太粗太硬受不了熟女人妻,嫩白bbwbbwbbwbbw,韩国非常大度的电影原声

數據來源的文章：

The landscape of accessible chromatin in mammalian preimplantation embryos. Nature 2016 Jun 30;534(7609):652-7. PMID: 27309802

image.png

數據的GEO號：GSE66581

鏈接：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE66581
健明說: 在SRA數據庫可以下載原始測序數據 , 從文章找到數據的ID： https://www.ncbi.nlm.nih.gov/sra?term=SRP055881 把下面的內容保存到文件，命名為 srr.list 就可以使用prefetch這個函數來下載。

配置環境之軟件的安裝，這里首推通過conda來創建一個project專屬的環境

可以無腦復制下面這段代碼

# https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/
# https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/ 
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc 
## 安裝好conda后需要設置鏡像。
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
conda config --set show_channel_urls yes

conda  create -n atac -y   python=2 bwa
conda info --envs
source activate atac
# 可以用search先進行檢索
conda search trim_galore
## 保證所有的軟件都是安裝在 wes 這個環境下面
conda install -y sra-tools  
conda install -y trim-galore  samtools bedtools
conda install -y deeptools homer  meme
conda install -y macs2 bowtie bowtie2 
conda install -y  multiqc 
conda install -y  sambamba

數據的下載，由于原文數據太多，這里選取了四組數據來進行練習

創建文件config.sra內容如下(fastq-dump時候使用，以便將sra文件轉換成fastq文件時候加上我們所需要的樣品名稱):

2-cell-1 SRR2927015
2-cell-2 SRR2927016
2-cell-5 SRR3545580
2-cell-4 SRR2927018

創建srr.list文件（里面為我們所需要下載的數據的SRR號）

SRR2927015
SRR2927016
SRR3545580
SRR2927018

數據下載

source activate atac 
mkdir -p  ~/project/atac/
cd ~/project/atac/
mkdir {sra,raw,clean,align,peaks,motif,qc}
cd sra 
cat srr.list |while read id;do ( nohup  prefetch $id & );done
## 默認下載目錄：~/ncbi/public/sra/

下載完成后數據大小如下：

-rw-r--r-- 1 4.2G Nov 20  2015 SRR2927015.sra
-rw-r--r-- 1 5.5G Nov 20  2015 SRR2927016.sra
-rw-r--r-- 1 2.0G Nov 20  2015 SRR2927018.sra
-rw-r--r-- 1 7.0G May 20  2016 SRR3545580.sra

第一步：將sra文件轉換成fastq文件

mv ~/ncbi/public/sra/SRR* sra/

創建sh文件01_fastq-dump.sh，內容如下：

## 下面需要用循環
## cd ~/project/atac/
source activate atac
dump=fastq-dump
analysis_dir=raw
# mkdir -p $analysis_dir # 由于之前已經創建過了，所以這里就無需創建了
## 下面用到的 config.sra 文件，就是上面自行制作的。

# $fastq-dump sra/SRR2927015.sra  --gzip --split-3  -A 2-cell-1 -O clean/
cat config.sra |while read id;
do echo $id
arr=($id) # 這里可以類似看成獲得矩陣
srr=${arr[1]} # 這里表示提取矩陣的第二列，即SRR號
sample=${arr[0]} # 這里表示提取矩陣的第一列，即樣本名稱
#  測序數據的sra轉fasq
nohup $dump -A  $sample -O $analysis_dir  --gzip --split-3  sra/$srr.sra &
done

然后運行, 有集群的切勿在登陸節點運行，要切換到計算節點（不知道為什么我自己的不能通過提交任務qsub來運行）

sh 01_fastq-dump.sh

第二步，測序數據的過濾

手動創建一個包含fastq文件路徑的config.raw文本文件，第一列隨意填，充數，第二列為fastq1的路徑，第二列為fastq2的路徑。

1 ~/project/atac/raw/2-cell-1_1.fastq.gz ~/project/atac/raw/2-cell-1_2.fastq.gz
2 ~/project/atac/raw/2-cell-2_1.fastq.gz ~/project/atac/raw/2-cell-2_2.fastq.gz
3 ~/project/atac/raw/2-cell-4_1.fastq.gz ~/project/atac/raw/2-cell-4_2.fastq.gz
4 ~/project/atac/raw/2-cell-5_1.fastq.gz ~/project/atac/raw/2-cell-5_2.fastq.gz

創建02_trim_galore.sh文件，內容如下

cd ~/project/atac/
# mkdir -p clean 
source activate atac
# trim_galore -q 25 --phred33 --length 35 -e 0.1 --stringency 4 --paired -o clean/ raw/2-cell-1_1.fastq.gz raw/2-cell-1_2.fastq.gz
cat config.raw  |while read id;
do echo $id
arr=($id)
fq2=${arr[2]}
fq1=${arr[1]}
sample=${arr[0]}
nohup  trim_galore -q 25 --phred33 --length 35 -e 0.1 --stringency 4 --paired -o  clean  $fq1   $fq2  &
done
ps -ef |grep trim

第三步，數據質量的檢測

創建03_fastqc_multiqc.sh文本文件，內容如下

mkdir -p qc
cd ~/JMJ705_ChIP-seq/qc
mkdir -p clean
fastqc -t 5  ../clean/*gz -o clean/
mkdir -p raw
fastqc -t 5  ../raw/*gz -o raw/

使用multiqc進行合并質檢結果

cd raw/
multiqc ./*zip

cd clean/
multiqc ./*zip

第四步，比對

健明說：比對需要的index，看清楚物種，根據對應的軟件來構建，這里直接用bowtie2進行比對和統計比對率, 需要提前下載參考基因組然后使用命令構建索引，或者直接就下載索引文件：下載小鼠參考基因組的索引和注釋文件, 這里用常用的mm10
下載索引文件

# 索引大小為3.2GB， 不建議自己下載基因組構建，可以直接下載索引文件，代碼如下：
mkdir referece && cd reference
wget -4 -q ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/mm10.zip
unzip mm10.zip

解壓后文件大小

-rw-r--r-- 1  848M May  3  2012 mm10.rev.1.bt2
-rw-r--r-- 1  633M May  3  2012 mm10.rev.2.bt2
-rw-r--r-- 1  848M May  2  2012 mm10.1.bt2
-rw-r--r-- 1  633M May  2  2012 mm10.2.bt2
-rw-r--r-- 1  6.0K May  2  2012 mm10.3.bt2
-rw-r--r-- 1  633M May  2  2012 mm10.4.bt2

取少量數據進行比對，測試流程

zcat clean/2-cell-1_1_val_1.fq.gz |head -10000 > test_file/test1.fq
zcat clean/2-cell-1_2_val_2.fq.gz |head -10000 > test_file/test2.fq

bowtie2 -x ~/project/atac/referece/mm10 -1 test1.fq  -2 test2.fq | samtools sort -@ 5 -O bam -o test.bam
# 三種不同的去重復軟件
# 這里選用sambamba來去重復
sambamba markdup -r test.bam  test.sambamba.rmdup.bam
samtools flagstat test.sambamba.rmdup.bam

samtools flagstat test.sambamba.rmdup.bam
samtools flagstat test.bam

## 接下來只保留兩條reads要比對到同一條染色體(Proper paired) ，還有高質量的比對結果(Mapping quality>=30)
## 順便過濾 線粒體reads
samtools view -h -f 2 -q 30  test.sambamba.rmdup.bam |grep -v chrM| samtools sort  -O bam  -@ 5 -o - > test.last.bam
bedtools bamtobed -i test.last.bam  > test.bed

測試無錯誤，可進行下一步分析, 創建04_bowtie2_align_sambamba_markdup.sh文件，基本copy健明的就行，改改文件路徑，內容如下：

cd ~/project/atac/align

ls ~/project/atac/clean/*_1.fq.gz > 1
ls ~/project/atac/clean/*_2.fq.gz > 2
ls ~/project/atac/clean/*_2.fq.gz |cut -d"/" -f 8|cut -d"_" -f 1  > 0 ## 這里最好自己逐步分開運行一下檢測0里面的結果是否是你的sample名稱
paste 0 1 2  > config.clean ## 供mapping使用的配置文件

## 相對目錄需要理解
bowtie2_index=~/project/atac/referece/mm10
## 一定要搞清楚自己的bowtie2軟件安裝在哪里，以及自己的索引文件在什么地方?。?！
#source activate atac 
cat config.clean |while read id;
do echo $id
arr=($id)
fq2=${arr[2]}
fq1=${arr[1]}
sample=${arr[0]}
## 比對過程15分鐘一個樣本
bowtie2  -p 5  --very-sensitive -X 2000 -x  $bowtie2_index -1 $fq1 -2 $fq2 |samtools sort  -O bam  -@ 5 -o - > ${sample}.raw.bam
samtools index ${sample}.raw.bam
bedtools bamtobed -i ${sample}.raw.bam  > ${sample}.raw.bed
samtools flagstat ${sample}.raw.bam  > ${sample}.raw.stat
# https://github.com/biod/sambamba/issues/177
sambamba markdup --overflow-list-size 600000  --tmpdir='./'  -r ${sample}.raw.bam  ${sample}.rmdup.bam
samtools index   ${sample}.rmdup.bam

## ref:https://www.biostars.org/p/170294/ 
## Calculate %mtDNA:
mtReads=$(samtools idxstats  ${sample}.rmdup.bam | grep 'chrM' | cut -f 3)
totalReads=$(samtools idxstats  ${sample}.rmdup.bam | awk '{SUM += $3} END {print SUM}')
echo '==> mtDNA Content:' $(bc <<< "scale=2;100*$mtReads/$totalReads")'%'

samtools flagstat  ${sample}.rmdup.bam > ${sample}.rmdup.stat
samtools view  -h  -f 2 -q 30    ${sample}.rmdup.bam   |grep -v chrM |samtools sort  -O bam  -@ 5 -o - > ${sample}.last.bam
samtools index   ${sample}.last.bam
samtools flagstat  ${sample}.last.bam > ${sample}.last.stat
bedtools bamtobed -i ${sample}.last.bam  > ${sample}.bed
done

其中bowtie2比對加入了-X 2000 參數，是最大插入片段，寬泛的插入片段范圍(10-1000bp)
得到的bam文件如下：

-rw-r--r-- 1  523M Oct 12 07:15 ./2-cell-5.last.bam
-rw-r--r-- 1  899M Oct 12 07:14 ./2-cell-5.rmdup.bam
-rw-r--r-- 1  5.5G Oct 12 06:51 ./2-cell-5.raw.bam
-rw-r--r-- 1  427M Oct 12 03:23 ./2-cell-4.last.bam
-rw-r--r-- 1  586M Oct 12 03:23 ./2-cell-4.rmdup.bam
-rw-r--r-- 1  1.8G Oct 12 03:17 ./2-cell-4.raw.bam
-rw-r--r-- 1  678M Oct 12 02:22 ./2-cell-2.last.bam
-rw-r--r-- 1  1.1G Oct 12 02:20 ./2-cell-2.rmdup.bam
-rw-r--r-- 1  4.6G Oct 12 02:00 ./2-cell-2.raw.bam
-rw-r--r-- 1  490M Oct 11 23:02 ./2-cell-1.last.bam
-rw-r--r-- 1  776M Oct 11 23:01 ./2-cell-1.rmdup.bam
-rw-r--r-- 1  3.7G Oct 11 22:48 ./2-cell-1.raw.bam

上述腳本的步驟都可以拆分運行，比如bam文件構建index或者轉為bed的:

ls *.last.bam|xargs -i samtools index {} 
ls *.last.bam|while read id;do (bedtools bamtobed -i $id >${id%%.*}.bed) ;done
ls *.raw.bam|while read id;do (nohup bedtools bamtobed -i $id >${id%%.*}.raw.bed & ) ;done

最后得到的bed文件是

-rw-r--r-- 1  254M Oct 12 07:16 ./2-cell-5.bed
-rw-r--r-- 1  203M Oct 12 03:24 ./2-cell-4.bed
-rw-r--r-- 1  338M Oct 12 02:23 ./2-cell-2.bed
-rw-r--r-- 1  237M Oct 11 23:03 ./2-cell-1.bed

第五步，使用macs2找peaks

# macs2 callpeak -t 2-cell-1.bed  -g mm --nomodel --shift -100 --extsize 200  -n 2-cell-1 --outdir ../peaks/
cd ~/project/atac/peaks/
ls *.bed | while read id ;do (macs2 callpeak -t $id  -g mm --nomodel --shift  -100 --extsize 200  -n ${id%%.*} --outdir ./) ;done

macs2軟件說明書詳見：http://www.lxweimin.com/p/21e8c51fca23
得到如下結果

-rw-r--r-- 1  1.1M Oct 12 14:35 2-cell-5_peaks.narrowPeak
-rw-r--r-- 1  690K Oct 12 14:35 2-cell-5_summits.bed
-rw-r--r-- 1  1.2M Oct 12 14:35 2-cell-5_peaks.xls
-rw-r--r-- 1  368K Oct 12 14:35 2-cell-4_peaks.narrowPeak
-rw-r--r-- 1  418K Oct 12 14:35 2-cell-4_peaks.xls
-rw-r--r-- 1  247K Oct 12 14:35 2-cell-4_summits.bed
-rw-r--r-- 1  1.2M Oct 12 14:35 2-cell-2_peaks.narrowPeak
-rw-r--r-- 1  1.4M Oct 12 14:35 2-cell-2_peaks.xls
-rw-r--r-- 1  805K Oct 12 14:35 2-cell-2_summits.bed
-rw-r--r-- 1  634K Oct 12 14:34 2-cell-1_peaks.narrowPeak
-rw-r--r-- 1  720K Oct 12 14:34 2-cell-1_peaks.xls
-rw-r--r-- 1  425K Oct 12 14:34 2-cell-1_summits.bed

第六步，計算插入片段長度，FRiP值，IDR計算重復情況

非冗余非線粒體能夠比對的fragment、比對率、NRF、PBC1、PBC2、peak數、無核小體區NFR、TSS富集、FRiP 、IDR重復的一致性！
名詞解釋：https://www.encodeproject.org/data-standards/terms/
參考：https://www.encodeproject.org/atac-seq/

統計indel插入長度的分布

看 bam文件第9列，在R里面統計繪圖 bam文件詳解

image.png
提取bam文件的第九列, 創建一個config.last_bam文件，里面內容包含bam文件的名稱

2-cell-1.last.bam 2-cell-1.last
2-cell-2.last.bam 2-cell-2.last
2-cell-4.last.bam 2-cell-4.last
2-cell-5.last.bam 2-cell-5.last

然后創建了提取bam文件的第九列indel插入長度信息的sh文件 indel_length.sh，內容如下：

cat config.last_bam |while read id;
do
arr=($id)
sample=${arr[0]}
sample_name=${arr[1]}
samtools view $sample |awk '{print $9}'  > ${sample_name}_length.txt
done

然后我們得到四個bam文件的indel插入長度信息

-rw-r--r-- 1   24M Oct 12 15:27 2-cell-5.last_length.txt
-rw-r--r-- 1   19M Oct 12 15:27 2-cell-4.last_length.txt
-rw-r--r-- 1   32M Oct 12 15:26 2-cell-2.last_length.txt
-rw-r--r-- 1   22M Oct 12 15:26 2-cell-1.last_length.txt

cmd=commandArgs(trailingOnly=TRUE); 
input=cmd[1]; output=cmd[2]; 
a=abs(as.numeric(read.table(input)[,1])); 
png(file=output);
hist(a,
main="Insertion Size distribution",
ylab="Read Count",xlab="Insert Size",
xaxt="n",
breaks=seq(0,max(a),by=10)
); 

axis(side=1,
at=seq(0,max(a),by=100),
labels=seq(0,max(a),by=100)
);

dev.off()

準備一個用于R語言批量繪制indel分布的文本輸入文件config.indel_length_distribution

2-cell-1.last_length.txt 2-cell-1.last_length
2-cell-2.last_length.txt 2-cell-2.last_length
2-cell-4.last_length.txt 2-cell-4.last_length
2-cell-5.last_length.txt 2-cell-5.last_length

有了上面的文件就可以批量檢驗bam文件進行出圖。創建批量運行的shell腳本indel_length_distribution.sh

cat config.indel_length_distribution  |while read id;
do
arr=($id)
input=${arr[0]}
output=${arr[1]}
Rscript indel_length_distribution.R $input $output
done

image.png

FRiP值的計算：fraction of reads in called peak regions

單個舉例

bedtools intersect -a 2-cell-1.bed -b 2-cell-1_peaks.narrowPeak |wc -l
148210
wc  -l 2-cell-1.bed
5105850 
# 故2-cell-1的FRiP為
148210/5105850 = 0.0292

批量計算 FRiP, 創建sh文件01_FRiP.sh

cd ~/project/atac/peaks
ls *narrowPeak|while  read id;
do
echo $id
bed=$(basename $id "_peaks.narrowPeak").bed
#ls  -lh $bed 
Reads=$(bedtools intersect -a $bed -b $id |wc -l|awk '{print $1}')
totalReads=$(wc -l $bed|awk '{print $1}')
echo $Reads  $totalReads 
echo '==> FRiP value:' $(bc <<< "scale=2;100*$Reads/$totalReads")'%'
done

運行結果為

$ sh 01_FRiP.sh 
2-cell-1_peaks.narrowPeak
148210 5105850
==> FRiP value: 2.90%
2-cell-2_peaks.narrowPeak
320407 7292154
==> FRiP value: 4.39%
2-cell-4_peaks.narrowPeak
90850 4399720
==> FRiP value: 2.06%
2-cell-5_peaks.narrowPeak
258988 5466482
==> FRiP value: 4.73%

健明記錄：
- Fraction of reads in peaks (FRiP) - Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads. In general, FRiP scores correlate positively with the number of regions. (Landt et al, Genome Research Sept. 2012, 22(9): 1813–1831)
- 文章其它指標：https://www.nature.com/articles/sdata2016109/tables/4

可以使用R包看不同peaks文件的overlap情況

注意這里由于R版本是3.5.1，所以需要GCC版本要大于等于4.8，由于本服務器系統版本為4.7，所以需更改GCC版本，通過module命令調用更高的GCC版本
裝包不成功，暫時不管

source /public/home/software/.bashrc
module load GCC/5.4.0-2.26
source activate atac

將narrowPeak文件傳入到本地，使用本地R進行可視化

options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/") 
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
source("http://bioconductor.org/biocLite.R") 
library('BiocInstaller')
# biocLite("ChIPpeakAnno")
# biocLite("ChIPseeker")
library(ChIPseeker)
library(ChIPpeakAnno)
setwd("E://desktop/sept/ATAC-seq_practice/find_peaks_overlaping/")
list.files('./',"*.narrowPeak")
tmp = lapply(list.files('./',"*.narrowPeak"),function(x){
  return(readPeakFile(file.path('./', x)))
  })
tmp
ol <- findOverlapsOfPeaks(tmp[[1]],tmp[[2]]) # 這里選取的是第一個文件和第二個文件，即cell.1_peak_1和cell.2_peak
png('overlapVenn.png')
makeVennDiagram(ol)
dev.off()

image.png

也可以使用專業軟件，IDR 來進行計算出來，同時考慮peaks間的overlap，和富集倍數的一致性

conda  create -n py3 -y   python=3 idr
conda activate py3
idr -h 
idr --samples  2-cell-1_peaks.narrowPeak 2-cell-2_peaks.narrowPeak  --plot

第七步，deeptools的可視化

具體見：https://mp.weixin.qq.com/s/a4qAcKE1DoukpLVV_ybobA 在ChiP-seq 講解。
首先把bam文件轉為bw文件，詳情：http://www.bio-info-trainee.com/1815.html

cd  ~/project/atac/deeptools_result
#source activate atac # 由于原本電腦存在deeptools所以就沒必要激活了
#ls  *.bam  |xargs -i samtools index {} 
ls *last.bam |while read id;do
nohup bamCoverage -p 5 --normalizeUsingRPKM -b $id -o ${id%%.*}.last.bw &
done

# cd dup 
# ls  *.bam  |xargs -i samtools index {} 
# ls *.bam |while read id;do
# nohup bamCoverage --normalizeUsing CPM -b $id -o ${id%%.*}.rm.bw & 
# done

mm10的Refgene文件的下載，
- 第一種參考鏈接：ChIP-seq基礎入門學習

curl 'http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=646311755_P0RcOBvAQnWZSzQz2fQfBiPPSBen&boolshad.hgta_printCustomTrackHeaders=0&hgta_ctName=tb_ncbiRefSeq&hgta_ctDesc=table+browser+query+on+ncbiRefSeq&hgta_ctVis=pack&hgta_ctUrl=&fbQual=whole&fbUpBases=200&fbExonBases=0&fbIntronBases=0&fbDownBases=200&hgta_doGetBed=get+BED' >mm10.refseq.bed

第二種：從http://genome.ucsc.edu/cgi-bin/hgTables下載

image.png
這里選取了第二種方法得到的bed文件進行后續可視化

查看TSS附件信號強度：創建07_deeptools_TSS.sh

depptools 使用說明

## both -R and -S can accept multiple files 
mkdir -p  ~/project/atac/tss
cd   ~/project/atac/tss 
# source activate atac # 由于我這里自己系統有就沒調用了
computeMatrix reference-point  --referencePoint TSS  -p 15  \
-b 10000 -a 10000    \
-R ~/project/atac/mm10_Refgene/Refseq.bed  \
-S ~/project/atac/deeptools_result/*.bw  \
--skipZeros  -o matrix1_test_TSS.gz  \
--outFileSortedRegions regions1_test_genes.bed

##     both plotHeatmap and plotProfile will use the output from   computeMatrix
plotHeatmap -m matrix1_test_TSS.gz  -out test_Heatmap.png
plotHeatmap -m matrix1_test_TSS.gz  -out test_Heatmap.pdf --plotFileFormat pdf  --dpi 720  
plotProfile -m matrix1_test_TSS.gz  -out test_Profile.png
plotProfile -m matrix1_test_TSS.gz  -out test_Profile.pdf --plotFileFormat pdf --perGroup --dpi 720

出圖展示

image.png

image.png

image.png
查看基因body的信號強度,創建07_deeptools_Body.sh

#source activate atac
mkdir Body
cd ~/project/atac/Body
computeMatrix scale-regions  -p 15  \
-R ~/project/atac/mm10_Refgene/Refseq.bed  \
-S ~/project/atac/deeptools_result/*.bw  \
-b 10000 -a 10000  \
--skipZeros -o matrix1_test_body.gz
# plotHeatmap -m matrix1_test_body.gz  -out ExampleHeatmap1.png
plotHeatmap -m matrix1_test_body.gz  -out test_body_Heatmap.png
plotProfile -m matrix1_test_body.gz  -out test_body_Profile.png
plotProfile -m matrix1_test_body.gz -out test_Body_Profile.pdf --plotFileFormat pdf --perGroup --dpi 720

出圖展示，注意下圖可以看出body區明顯的要短于兩側，如果要調整寬度，可自行調整以下參數
- --regionBodyLength
- --binSize
- 參考官網參數含義：https://deeptools.readthedocs.io/en/develop/content/tools/computeMatrix.html
  
  image.png
  
  image.png
  
  image.png
ngsplot也是一個畫profiler圖的利器。

第八步：peaks注釋

參考老版本教程鏈接：CS3: peak注釋
peaks區間注釋分布

options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/") 
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
source("http://bioconductor.org/biocLite.R") 
library('BiocInstaller')
biocLite("ChIPpeakAnno")
library(ChIPpeakAnno)
setwd("E://desktop/sept/ATAC-seq_practice/peaks_annotaion/")
biocLite("TxDb.Mmusculus.UCSC.mm10.knownGene")
biocLite("org.Mm.eg.db")
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
promoter <- getPromoters(TxDb=txdb, 
                         upstream=3000, downstream=3000)
files = list(cell_1_summits = "2-cell-1_summits.bed", cell_2_summits = "2-cell-2_summits.bed",
                   cell_4_summits = "2-cell-4_summits.bed", cell_5_summits = "2-cell-5_summits.bed")
peakAnno <- annotatePeak(files[[1]], # 分別改成2或者3或者4即可，分別對應四個文件
                         tssRegion=c(-3000, 3000),
                         TxDb=txdb, annoDb="org.Hs.eg.db")
plotAnnoPie(peakAnno)

image.png

plotAnnoBar來展示

plotAnnoBar(peakAnno)

image.png

vennpie來展示

vennpie(peakAnno)

image.png

upsetplot來展示

upsetplot(peakAnno)
upsetplot(peakAnno, vennpie=TRUE)

image.png

vennpie=TRUE

ChIPseeker還注釋了最近的基因，peak離最近基因的距離分布是什么樣子的？ChIPseeker提供了plotDistToTSS函數來畫這個分布：

plotDistToTSS(peakAnno,
              title="Distribution of transcription factor-binding loci\nrelative to TSS")

image.png

plotAnnoBar和plotDistToTSS這兩個柱狀圖都支持多個數據同時展示，方便比較，比如：

peakAnnoList <- lapply(files, annotatePeak, 
                       TxDb=txdb,tssRegion=c(-3000, 3000))
plotAnnoBar(peakAnnoList)

image.png

批量繪制距離TSS的百分比圖

plotDistToTSS(peakAnnoList)

image.png

ChIPseeker還提供了一個vennplot函數，比如我想看注釋的最近基因在不同樣本中的overlap：

genes <- lapply(peakAnnoList, function(i) 
    as.data.frame(i)$geneId)
vennplot(genes[2:4], by='Vennerable')

ChIPseeker還提供了一個vennplot函數，比如我想看注釋的最近基因在不同樣本中的overlap：

genes <- lapply(peakAnnoList, function(i) 
    as.data.frame(i)$geneId)
vennplot(genes[2:4], by='Vennerable')

新版本ChIPpeakAnno()可視化后續再補上
- 參考鏈接：The ChIPpeakAnno user’s guide
  
  image.png

使用hommer進行注釋

perl ~/miniconda3/envs/atac/share/homer-4.9.1-6/configureHomer.pl  -install mm10  # 此不能再計算節點運行，需先在登陸節點運行下載
## 保證數據庫下載是OK
# ls -lh  ~/miniconda3/envs/atac/share/homer-4.9.1-5/data/genomes  
# source activate atac
mkdir hommer_anno
cp peaks/*narrowPeak hommer_anno/
cd   ~/project/atac/hommer_anno  
ls *.narrowPeak |while read id;
do 
echo $id
awk '{print $4"\t"$1"\t"$2"\t"$3"\t+"}' $id >${id%%.*}.homer_peaks.tmp
annotatePeaks.pl  {id%%.*}.homer_peaks.tmp mm10  1>${id%%.*}.peakAnn.xls
  2>${id%%.*}.annLog.txt
done

然后將peakAnn.xls文件導入到本地，通過excel透視功能進行可視化
用到的函數COUNTIF(), SUM()
這里還用到一個用來上傳gif文件，生成鏈接的在線網頁 http://thyrsi.com/
操作gif動圖

image.png

第九步，motif尋找及注釋

創建08_hommer_motif.sh 文件

# mkdir -p  ~/project/atac/motif
cd   ~/project/atac/motif
# source activate atac
ls ../peaks/*.narrowPeak |while read id;
do
file=$(basename $id )
sample=${file%%.*}
echo $sample 
awk '{print $4"\t"$1"\t"$2"\t"$3"\t+"}' $id > ${sample}.homer_peaks.tmp
nohup findMotifsGenome.pl ${sample}.homer_peaks.tmp  mm10 ${sample}_motifDir -len 8,10,12  &
done

homerResults

knownResults

第十步，差異peaks分析

diffbind
DESeq2等后續進行分析

寫在最后的話，通過此流程的實踐，再次學到很多新的知識點以及腳本操作。

批量運行命令以及可視化
簡單R腳本程序的編寫

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

ATAC-seq分析實操生信技能樹健明教程

ATAC-seq分析實操生信技能樹健明教程

數據來源的文章：

數據的GEO號：GSE66581

配置環境之軟件的安裝，這里首推通過conda來創建一個project專屬的環境

數據的下載，由于原文數據太多，這里選取了四組數據來進行練習

數據下載

第一步：將sra文件轉換成fastq文件

第二步，測序數據的過濾

第三步，數據質量的檢測

第四步，比對

第五步，使用macs2找peaks

第六步，計算插入片段長度，FRiP值，IDR計算重復情況

第七步，deeptools的可視化

第八步：peaks注釋

第九步，motif尋找及注釋

第十步，差異peaks分析

寫在最后的話，通過此流程的實踐，再次學到很多新的知識點以及腳本操作。

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

ATAC-seq分析實操生信技能樹健明教程

數據來源的文章：

數據的GEO號：GSE66581

配置環境之軟件的安裝，這里首推通過conda來創建一個project專屬的環境

數據的下載，由于原文數據太多，這里選取了四組數據來進行練習

數據下載

第一步：將sra文件轉換成fastq文件

第二步，測序數據的過濾

第三步，數據質量的檢測

第四步，比對

第五步，使用macs2找peaks

第六步，計算插入片段長度，FRiP值，IDR計算重復情況

第七步，deeptools的可視化

第八步：peaks注釋

第九步，motif尋找及注釋

第十步，差異peaks分析

寫在最后的話，通過此流程的實踐，再次學到很多新的知識點以及腳本操作。

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

配置環境之軟件的安裝，這里首推通過conda來創建一個project專屬的環境

數據的下載，由于原文數據太多，這里選取了四組數據來進行練習

第二步，測序數據的過濾

第三步，數據質量的檢測

第六步，計算插入片段長度，FRiP值，IDR計算重復情況

第九步，motif尋找及注釋

第十步，差異peaks分析

寫在最后的話，通過此流程的實踐，再次學到很多新的知識點以及腳本操作。