[轉(zhuǎn)] cufflinks介紹（2018-05-29）

原文：http://blog.sina.com.cn/s/blog_751bd9440102v72b.html

一. 簡介

Cufflinks下主要包含cufflinks,cuffmerge,cuffcompare和cuffdiff等幾支主要的程序。主要用于基因表達量的計算和差異表達基因的尋找。

二. 安裝

Cufflinks下載網(wǎng)頁。

1. 為了安裝Cufflinks，必須有Boost C++

libraries。下載Boost并安裝。默認安裝在/usr/local。

$ tar jxvf boost_1_53_0.tar.bz2

$ cd boost_1_53_0

$ ./bootstrap.sh

$ sudo ./b2 install

2.安裝SAM tools。

下載SAM tools。$ tar jxvf samtools-0.1.18.tar.bz2$ cd samtools-0.1.18$ make$ sudo su # mkdir /usr/local/include/bam# cp libbam.a /usr/local/lib# cp *.h /usr/local/include/bam/# cp samtools /usr/bin/

3. 安裝 Eigen libraries。

下載Eigen$ tar jxvf 3.1.2.tar.bz2$ cd eigen-eigen-5097c01bcdc4$ sudo cp -r Eigen/ /usr/local/include/

4. 安裝Cufflinks。

$ tar zxvf cufflinks-2.0.2.tar.gz

$ cd cufflinks-2.0.2

$ ./configure --prefix=/path/to/cufflinks/install --with-boost=/usr/local/ --with-eigen=/usr/local/include//Eigen/

$ make

$ make install

5. 可以直接下載Linux x86_64 binary。不需要上述繁瑣步驟，解壓后的程序直接可用。(推薦)

三.Cufflinks的使用

1. Cufflinks簡介

Cufflinks程序主要根據(jù)Tophat的比對結(jié)果，依托或不依托于參考基因組的GTF注釋文件，計算出(各個gene的)isoform的FPKM值，并給出trascripts.gtf注釋結(jié)果(組裝出轉(zhuǎn)錄組)。

注意：

fragment的長度的估測，若為pair-end測序，則cufflinks自己會有一套算法，算出結(jié)果。若為single-end測序，則cufflinks默認的是高斯分布，或者你自己提供相關(guān)的參數(shù)設(shè)置。

2. cufflinks計算multi-mapped reads，一般a read map到10個位置，則每個位置記為10%。a

read mapping to 10 positions will count as 10% of a read at each

position.

3. 一般不推薦用cufflinks拼接細菌的轉(zhuǎn)錄組，推薦Glimmer。但是，若有注釋文件，可以用cufflinks和cuffdiff來檢測基因的表達和差異性。

4. cufflinks/cuffdiff不能計算出exon或splicing event的FPKM

5.cuffdiff處理時間序列data：采用參數(shù)-t

6.當你使用cufflinks時，在最后出現(xiàn)了99%，然后一直不動。因為cuffdiff需要更多的CPU來處理一些匹配很多reads的loci。而這些位點一般要等其他位點全部解決了后，才由cuffdiff來處理。可以用參數(shù)-M來提供相關(guān)的文件，過濾掉rRNA或者線粒體RNA。

7. 當使用cufflinks或cuffdiff出現(xiàn)了“crash with a ‘bad_alloc'

error”，cuffdiff和cufflinks運行了很長時間才結(jié)束————這表明計算機拼接一個高表達的基因或定量分析一個高表達的基因，運行的內(nèi)存使用玩盡了！解決方法：修改選項“-max-bundle-frags”，可以先嘗試500000，若錯誤依舊在，可以繼續(xù)下調(diào)！

8. cuffdiff報道的結(jié)果里面所有的基因和轉(zhuǎn)錄本的FPKM=0，這表明GTF中的染色體名字和BAM里的名字不匹配。

9.cuffdiff和cufflinks的缺點：存在一定的假基因和轉(zhuǎn)錄本（原因：測序深度，測序質(zhì)量，測序樣本的測序次數(shù)，以及注釋的錯誤）

10. large fold

change表達量不代表數(shù)據(jù)的明顯性（這些基因的isform多或這些基因測序測到的少，整體較低的表達）。cuffdiff中明顯表達倍數(shù)改變的基因，存在不確定性。

11.通過cufflinks產(chǎn)生的結(jié)果中transcript.gtf文件中cuff標識的轉(zhuǎn)錄本就是新的轉(zhuǎn)錄本。相應的，其他模塊輸出中CUFF標識代表著新的轉(zhuǎn)錄本。

12. 若出現(xiàn)了如下錯誤：

YouareusingCufflinksv2.2.1,whichisthemostrecentrelease.

open:Nosuchfileordirectory

File30doesn'tappeartobeavalidBAMfile,tryingSAM...

Error:cannotopenalignmentfile30forreading

這表明，你的參數(shù)有問題。例如“--min-intron-length”,你設(shè)置為了：“-min-intron-length”

2. 使用方法

$ cufflinks [options]*

一個常用的例子：

$ cufflinks -p 8 -G transcript.gtf --library-type fr-unstranded -o cufflinks_output tophat_out/accepted_hits.bam

3. 普通參數(shù)

-h | --help

-o | --output-dir? default: ./設(shè)置輸出的文件夾名稱

-p | --num-threads? default: 1用于比對reads的CPU線程數(shù)

-G | --GTF 提供一個GFF文件，以此來計算isoform的表達。此時，將不會組裝新的transcripts，程序會忽略和reference transcript不兼容的比對結(jié)果

-g | --GTF-guide 提供GFF文件，以此來指導轉(zhuǎn)錄子組裝(RABT assembly)。此時，輸出結(jié)果會包含reference transcripts和novel genes and isforms。

-M | --mask-file 提供GFF文件。Cufflinks將忽略比對到該GTF文件的transcripts中的reads。該文件中常常是rRNA的注釋，也可以包含線立體和其它希望忽略的transcripts的注釋。將這些不需要的RNA去除后，對計算mRNA的表達量是有利的。

-b | --frag-bias-correct 提供一個fasta文件來指導Cufflinks運行新的bias detection and correction algorithm。這樣能明顯提高轉(zhuǎn)錄子豐度計算的精確性。

-u | --multi-read-correct讓Cufflinks來做initial estimation步驟，從而更精確衡量比對到genome多個位點的reads。

--library-type? default:fr-unstranded處理的reads具有鏈特異性。比對結(jié)果中將會有個XS標簽。一般Illumina數(shù)據(jù)的library-type為 fr-unstranded。

--library-norm-method具體參考官網(wǎng),三種方式：classic-fpkm默認的方式。geometric針對DESeq。quartile計算時，fragments和總的map的count取75%

4. 豐度評估參數(shù)

-m | --frag-len-mean default: 200插入片段的平均長度。不過現(xiàn)在Cufflinks能learns插入片段的平均長度，因此不推薦自主設(shè)置此值。

-s | --frag-len-std-dev default: 80插入片段長度的標準差。不過現(xiàn)在Cufflinks能learns插入片段的平均長度，因此不推薦自主設(shè)置此值。

-N | --upper-quartile-form使用75%分為數(shù)的值來代替總的值(比對到單一位點的fragments的數(shù)值)，作normalize。這樣有利于在低豐度基因和轉(zhuǎn)錄子中尋找差異基因。

--total-hits-norm default: TRUECufflinks在計算FPKM時,算入所有的fragments和比對上的reads。和下一個參數(shù)對立。默認激活該參數(shù)。

--compatible-hits-norm Cufflinks在計算FPKM時，只針對和reference transcripts兼容的fragments以及比對上的reads。該參數(shù)默認不激活，只能在有 --GTF 參數(shù)下有效，并且作 RABT或 ab initio 的時候無效。

--max-mle-iterations進行極大似然法時選擇的迭代次數(shù)，默認為：5000

--max-bundle-frags一個skipped locus/loci在別skipped前可以擁有的最大的fragment片段。默認為1000000

--no-effective-length-correctionCufflinks will not employ its "effective" length normalization to transcript FPKM.Cufflinks將不會使用它的“effective” 長度標準化去計算轉(zhuǎn)錄的FPKM

--no-length-correctionCufflinks將根本不會使用轉(zhuǎn)錄本的長度去標準化fragment的數(shù)目。當fragment的數(shù)目和the features being quantified的size是獨立的，可以使用（例如for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length).小心使用

5. 組裝常用參數(shù)

-L | --label? default: CUFFCufflink以GTF格式來報告轉(zhuǎn)錄子片段(transfrags),該參數(shù)是GTF文件的前綴

-F/--min-isoform-fraction <0.0-1.0>在計算一個基因的isoform 豐度后，過濾了豐度極低的轉(zhuǎn)錄本，因為這些轉(zhuǎn)錄本不可以信任。也可以過濾一些read匹配極低的外顯子。默認為0.1或者10% of the most abundant isoform (the major isoform) of the gene.（一個基因的主要isoform的豐度的10%）

-j/--pre-mrna-fraction <0.0-1.0>內(nèi)含子被aligment覆蓋的最低深度。若小于這個值則那些內(nèi)含子的alignments被忽略掉。默認為15%。 The minimum depth of coverage in the intronic region covered? ? ? by the alignment is divided by the number of spliced reads, and if the? ? ? ? ? result is lower than this parameter value, the intronic alignments are? ? ? ? ? ignored. The default is 15%.

-I/--max-intron-length內(nèi)含子的最大長度。若大于該值的內(nèi)含子，cufflinks不會報告。默認為300000.Cufflinks will not report transcripts with? ? introns longer than this, and will ignore SAM alignments with REF_SKIP? ? ? ? ? CIGAR operations longer than this.? The default is 300,000.

-a/--junc-alpha <0.0-1.0>剪接比對過濾中假陽性的二項檢驗中的 alpha value。默認為 0.001

-A/--small-anchor-fraction <0.0-1.0>在junction中一個reads小于自身長度的這個百分比，會被懷疑，可能會在拼接前被過濾掉。默認為0.09

--min-frags-per-transfrag? default: 10組裝出的transfrags被支持的RNA-seq的fragments數(shù)少于該值則不被報道。

--overhang-tolerance當決定一個reads或轉(zhuǎn)錄本與某個轉(zhuǎn)錄本兼容或匹配的時候，允許的能加入該轉(zhuǎn)錄本的外顯子的延伸長度。默認是8bp和bowtie/tophat默認的一致。

--max-bundle-lengthMaximum genomic length allowed for a given bundle.? The default is 3,500,000bp.

--min-intron-length? default: 50最小的intron大小。

--trim-3-avgcov-thresh最小的3‘端的平均覆蓋程度。小于該值，則刪除其3’端序列。默認10Minimum average coverage required to attempt 3' trimming.? The default is 10.

--trim-3-dropoff-frac最低百分比的拼接的轉(zhuǎn)錄本的3‘端的平均覆蓋程度。默認0.1The fraction of average coverage below which to trim the 3' end of an assembled? ? ? ? ? transcript.? The default is 0.1.

--max-multiread-fraction <0.0-1.0>若一個轉(zhuǎn)錄本Transfrags的reads能匹配到基因組的多個位置，其中該轉(zhuǎn)錄本的reads有超過該百分比是multireads，則不會報告這個轉(zhuǎn)錄本。默認為75%The fraction a transfrag's supporting reads that may be multiply mapped to the genome. A transcript composed of more than this fraction will not be reported by the assembler.? Default: 0.75 (75% multireads or more is suppressed).

--overlap-radius? default: 50Transfrags之間的距離少于該值，則將其連到一起。

Advanced Reference Annotation Based Transcript (RABT) Assembly Options:當你使用-g/--GTF-guide這個參數(shù)時，需要考慮的選項。

--3-overhang-tolerance當決定一個拼接的轉(zhuǎn)錄本（這個轉(zhuǎn)錄本可能不是新的轉(zhuǎn)錄本）和一個參考轉(zhuǎn)錄本是否合并時，參考轉(zhuǎn)錄本的3‘端允許延伸的長度。默認600bpThe number of bp allowed to overhang the 3' end of a reference transcript when determining? ? ? if an assembled transcript should be merged with it (ie, the assembled transcript is not novel).? ? ? ? The default is 600 bp.

--intron-overhang-tolerance當決定一個拼接的轉(zhuǎn)錄本（這個轉(zhuǎn)錄本可能不是新的轉(zhuǎn)錄本）和一個參考轉(zhuǎn)錄本是否合并時，參考轉(zhuǎn)錄本的外顯子允許延伸的長度。默認50bpThe number of bp allowed to enter the intron of a reference transcript when determining if an? ? assembled transcript should be merged with it (ie, the assembled transcript is not novel).? ? ? The default is 50 bp.

--no-faux-readsThis option disables tiling of the reference transcripts with faux reads.? Use this if you only? ? ? ? want to use sequencing reads in assembly but do not want to output assembled transcripts that lay? ? ? within reference transcripts.? All reference transcripts in the input annotation will also? ? ? be included in the output.這一項將不能掩蓋參考轉(zhuǎn)錄組中的假reads。當你只想在拼接中使用測序的reads而不想輸出lay within reference transcripts的拼接的轉(zhuǎn)錄組。輸入時注釋的所有的參考轉(zhuǎn)錄組也將會輸入到輸出中。

其他參數(shù)（無關(guān)緊要）

-v/--verbose顯示版本信息等等

-q/--quiet除了警告和錯誤外，其他信息將不會print

--no-update-check關(guān)系cufflinks自動更新的能力

6. Cufflinks輸出結(jié)果

cufflinks的輸入文件是sam或bam格式。并且sam或bam格式的文件必須排好序。（The SAM file supplied to Cufflinksmustbe sorted by? ? ? ? ? reference position.）Tophat的輸出結(jié)果sam或bam已經(jīng)排好了序。針對其他的未排序的sam或bam文件采用如下排序方式：

sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted

1. transcripts.gtf

該文件包含Cufflinks的組裝結(jié)果isoforms。前7列為標準的GTF格式，最后一列為attributes。其每一列的意義：

列數(shù)? 列的名稱? 例子? ? ? ? 描述

1? ? 序列名? ? chrX? ? ? ? 染色體或contig名; 2? ? 來源? ? ? Cufflinks? 產(chǎn)生該文件的程序名; 3? ? 類型? ? ? exon? ? ? ? 記錄的類型，一般是transcript或exon; 4? ? 起始? ? ? 1? ? ? ? ? 1-base的值; 5? ? 結(jié)束? ? ? 1000? ? ? ? 結(jié)束位置; 6? ? 得分? ? ? 1000? ? ? ? ; 7? ? 鏈? ? ? ? +? ? ? ? ? Cufflinks猜測isoform來自參考序列的那一條鏈，一般是'+','-'或'.';8? ? frame? ? .? ? ? ? ? Cufflinks不去預測起始或終止密碼子框的位置; 9? ? attributes? ...? ? ? 詳見下

每一個GTF記錄包含如下attributes：

Attribute? ? ? 例子? ? ? 描述

gene_idCUFF.1Cufflinks的gene id;transcript_idCUFF.1.1? Cufflinks的轉(zhuǎn)錄子 id; FPKM? ? ? ? ? 101.267? isoform水平上的豐度,FragmentsPerKilobase of exon model perMillion mapped fragments; frac? ? ? ? ? 0.7647? ? 保留著的一項，忽略即可，以后可能會取消這個;conf_lo? ? ? ? 0.07? ? ? isoform豐度的95%置信區(qū)間的下邊界，即下邊界值 = FPKM * ( 1.0 - conf_lo );conf_hi? ? ? ? 0.1102? ? isoform豐度的95%置信區(qū)間的上邊界，即上邊界值 = FPKM * ( 1.0 + conf_hi ); cov? ? ? ? ? ? 100.765? 計算整個transcript上read的覆蓋度;full_read_support? yes? 當使用 RABT assembly 時，該選項報告所有的introns和exons是否完全被reads所覆蓋

2. ispforms.fpkm_tracking

isoforms(可以理解為gene的各個外顯子)的fpkm計算結(jié)果

3. genes.fpkm_tracking

gene的fpkm計算結(jié)果

四.Cuffmerge的使用

1. Cuffmerge簡介

Cuffmerge將各個Cufflinks生成的transcripts.gtf文件融合稱為一個更加全面的transcripts注釋結(jié)果文件merged.gtf。以利于用Cuffdiff來分析基因差異表達。

2. 使用方法

$ cuffmerge [options]*

輸入文件為一個文本文件，是包含著GTF文件路徑的list。常用例子：

$ cuffmerge -o ./merged_asm -p 8 assembly_list.txt

3. 使用參數(shù)

-h | --help

-o? default: ./merged_asm

將結(jié)果輸出至該文件夾。

-g | --ref-gtf將該reference GTF一起融合到最終結(jié)果中。

-p | --num-threads? defautl: 1

使用的CPU線程數(shù)

-s | --ref-sequence /該參數(shù)指向基因組DNA序列。如果是一個文件夾，則每個contig則是一個fasta文件；如果是一個fasta文件，則所有的contigs都需要在里面。Cuffmerge將使用該ref-sequence來幫助對transfrags分類，并排除repeats。比如transcripts包含一些小寫堿基的將歸類到repeats.

4. Cuffmerge輸出結(jié)果

輸出的結(jié)果文件默認為 /merged.gtf

五.Cuffcompare的使用

1. Cuffcompare簡介

Cuffcompare使用Cufflinks的GTF結(jié)果，對GTF結(jié)果進行比較。和reference gtf比較尋找novel轉(zhuǎn)錄本等。

2. Cuffcompare的使用方法

$ cuffcompare [options]*? [cuff2.gtf] ... [cuffN.gtf]

使用例子：

$ cuffcompare -o cuffcmp cuff1.gtf cuff2.gtf

3. 使用參數(shù)

-h-V顯示進程

-C默認，表示"contained" transcripts 也會寫入.combined.gtf中。

-o? default: cuffcmp輸出文件的前綴

-r 參考的GFF文件。用來評估輸入的gtf文件中g(shù)ene models的精確性。每一個輸入的gtf的isoforms將和該參考文件進行比較，并被標注為 overlapping, matching 或 novel。

-R當有了 -r 參數(shù)時，指定該參數(shù)時，將忽略參考GFF文件中的一些transcripts。這些transcripts不和任何輸入的GTF文件overlapped。

-s

該參數(shù)指向基因組DNA序列。如果是一個文件夾，則每個contig則是一個fasta文件；如果是一個fasta文件，則所有的contigs都需要在里面。小寫字母的堿基用來將相應的transcripts作為repeats處理。

4.輸出結(jié)果

在當前目錄下輸出3個文件：

.stats，報告與參考注釋比較時，各種與準確性相關(guān)的數(shù)據(jù)。其中，Sn和Sp展示的是specificity and sensitivity values。fSnandfSp列展示的 "fuzzy" variants of these same accuracy calculations。允許存在變動。（-o 沒有設(shè)置，默認為cuffcmp為文件前綴）

.combined.gtf報告每個樣本的所有的 transfrags 的信息。若一個transfrag在多個樣本中，它只報道一次。

.tracking匹配到樣本間的轉(zhuǎn)錄本。this file matches transcripts up between samples.? Each row contains? ? ? ? ? ? ? ? a transcript structure that is present in one or more input GTF files.? ? ? ? ? ? ? ? Because the transcripts will generally have different IDs (unless you? ? ? ? ? ? ? ? assembled your RNA-Seq reads against a reference transcriptome),cuffcompareexamines the structure of each the transcripts,? ? ? ? ? ? ? ? matching transcripts that agree on the coordinates and order of all of? ? ? ? ? ? ? ? their introns, as well as strand.? Matching transcripts are allowed to? ? ? ? ? ? ? ? differ on the length of the first and last exons, since these lengths? ? ? ? ? ? ? ? will naturally vary from sample to sample due to the random nature of? ? ? ? ? ? ? ? sequencing.

例子；

TCONS_00000045 XLOC_000023 Tcea|uc007afj.1? ? j? ? ? \? ? q1:exp.115|exp.115.0|100|3.061355|0.350242|0.350207 \? ? q2:60hr.292|60hr.292.0|100|4.094084|0.000000|0.000000

In this example, a transcript present in the two input files,calledexp.115.0in the first and60hr.292.0inthe second, doesn't match any reference transcript exactly, butshares exons withuc007afj.1, an isoform of the gene Tcea,as indicated by theclass

codej. The first three columns are as follows:

其中，1 Cufflinks transfrag idTCONS_00000045內(nèi)部的transfrag id；2Cufflinks locus idXLOC_000023內(nèi)部的locus id； 3Reference gene idTcea參考的注釋的gene的id或者“-”表示沒有匹配到參考的轉(zhuǎn)錄本； 4Reference transcript iduc007afj.1參考的注釋的轉(zhuǎn)錄本的id或者“-”表示沒有匹配到參考的轉(zhuǎn)錄本； 5 Class codec轉(zhuǎn)錄本和參考轉(zhuǎn)錄本之間的匹配類型。第五列之后如下：

qJ: | | | | | | |

在輸入的GTF的同目錄下輸出.refmap 和 .tmap 文件。

.refmap具體內(nèi)容如下：

1Reference gene name參考注釋的gtf中的基因名字 2 Reference transcript id 參考的轉(zhuǎn)錄本id3Class code 表示cufflinks拼接的轉(zhuǎn)錄本和參考轉(zhuǎn)錄本間的匹配情況：c 表示部分匹配；= 表示全部匹配

4Cufflinks matches匹配到參考轉(zhuǎn)錄本的cufflinks拼接的轉(zhuǎn)錄本的id

.tmap具體內(nèi)容如下：

4 Cufflinks gene id; 5 Cufflinks transcript id;6 Fraction of major isofor m (FMI) ; 7FPKM ; 8 FPKM_conf_lo; 9FPKM_conf_hi; 10 Coverage ; 11 Length; 12Major isoform ID

class cord :

PriorityCodeDescription

1=Complete match of intron chain

2cContained

3jPotentially novel isoform (fragment): at least one

splice junction is shared with a reference transcript

4eSingle exon transfrag overlapping a reference exon

and at least 10 bp of a reference intron, indicating a possible

pre-mRNA fragment.

5iA transfrag falling entirely within a reference

intron

6oGeneric exonic overlap with a reference

transcript

7pPossible polymerase run-on fragment (within

2Kbases of a reference transcript)

8rRepeat. Currently determined by looking at the

soft-masked reference sequence and applied to transcripts where at

least 50% of the bases are lower case

9uUnknown, intergenic transcript

10xExonic overlap with reference on the opposite

strand

11sAn intron of the transfrag overlaps a reference

intron on the opposite strand (likely due to read mapping

errors)

12.(.tracking file only, indicates multiple

classifications)

六.Cuffdiff的使用

1. Cuffdiff簡介

用于尋找轉(zhuǎn)錄子表達的顯著性差異。

2. Cuffdiff使用方法

cuffdiff主要是發(fā)現(xiàn)轉(zhuǎn)錄本表達，剪接，啟動子使用的明顯變化。

cuffdiff [options]* ...

[sampleN.sam_replicate1.sam[,...,sample2_replicateM.sam]]

$ cuffdiff [options]*? ...[sampleN_1.sam[,...,sampleN_M.sam]]其中transcripts.gtf是由cufflinks，cuffcompare，cuffmerge所生成的文件，或是由其它程序生成的。一個樣本有多個replicate，用逗號隔開。sample多于一個時，cuffdiff將比較samples間的基因表達的差異性。一個常用例子：$ cuffdiff --lables lable1,lable2 -p 8 --time-series --multi-read-correct --library-type fr-unstranded --poisson-dispersion transcripts.gtf sample1.sam sample2.sam

cuffdiff接受bam/sam或cuffquant的CXB文件，同時也可以接受bam與sam的混合文件，不能接受bam/sam和CXB的混合文件。

3. 使用參數(shù)

-h | --help

-o | --output-dir? default: ./

輸出的文件夾目錄。

-L | --lables? default: q1,q2,...qN

給每個sample一個樣品名或者一個環(huán)境條件一個lable

-p | --num-threads? default: 1

使用的CPU線程數(shù)

-T | --time-series

讓Cuffdiff來按樣品順序來比對樣品，而不是對所有的samples都進行兩兩比對。即第二個SAM和第一個SAM比；第三個SAM和第二個SAM比；第四個SAM和第三個SAM比...

-N | --upper-quartile-form

使用75%分為數(shù)的值來代替總的值(比對到單一位點的fragments的數(shù)值)，作normalize。這樣有利于在低豐度基因和轉(zhuǎn)錄子中尋找差異基因。

--total-hits-norm

Cufflinks在計算FPKM時,算入所有的fragments和比對上的reads。和下一個參數(shù)對立。默認不激活該參數(shù)。

--compatible-hits-normCufflinks在計算FPKM時，只針對和reference transcripts兼容的fragments以及比對上的reads。該參數(shù)默認激活，使用該參數(shù)可以降低核糖體rna的reads對基因表達的干擾。

-b | --frag-bias-correct（一般是genome.fa）提供一個fasta文件來指導Cufflinks運行新的bias detection and correction algorithm。這樣能明顯提高轉(zhuǎn)錄子豐度計算的精確性。

-u | --multi-read-correct讓Cufflinks來做initial estimation步驟，從而更精確衡量比對到genome多個位點的reads。

-c | --min-alignment-count? default: 10

如果比對到某一個位點的fragments數(shù)目少于該值，則不做該位點的顯著性分析。認為該位點的表達量沒有顯著性差異。

-M | --mask-file

提供GFF文件。Cufflinks將忽略比對到該GTF文件的transcripts中的reads。該文件中常常是rRNA的注釋，也可以包含線立體和其它希望忽略的transcripts的注釋。將這些不需要的RNA去除后，對計算mRNA的表達量是有利的。

-FDR? default: 0.05允許的false discovery rate.

--library-type default:fr-unstranded處理的reads具有鏈特異性。比對結(jié)果中將會有個XS標簽。一般Illumina數(shù)據(jù)的library-type為 fr-unstranded。

--dispersion-method

其他高級參數(shù)：

-m | --frag-len-mean default: 200插入片段的平均長度。不過現(xiàn)在Cufflinks能learns插入片段的平均長度，因此不推薦自主設(shè)置此值。

-s | --frag-len-std-dev default: 80

插入片段長度的標準差。不過現(xiàn)在Cufflinks能learns插入片段的平均長度，因此不推薦自主設(shè)置此值。-v/--verbose顯示版本信息等等

-q/--quiet除了警告和錯誤外，其他信息將不會print

--no-update-check關(guān)系cufflinks自動更新的能力

-F/--min-isoform-fraction <0.0-1.0>建議不要更改，主要的isorform豐度若低于這個分數(shù)，可變的isoform將四舍五入為0.默認為1e-5

--max-bundle-frags一個skipped locus/loci在skipped前可以擁有的最大的fragment片段。默認為1000000

--max-frag-count-draws （默認為100）和--max-frag-assign-draws （默認為50）--min-reps-for-js-test一個針對不同調(diào)控的基因做test的最小的復制次數(shù)。Cuffdiff won't test genes for differential regulation unless the

conditions in question have at least this many replicates.? Default: 3.

--no-effective-length-correctionCuffdiff will not employ its "effective" length normalization to transcript FPKM. Cufflinks將不會使用它的“effective” 長度標準化去計算轉(zhuǎn)錄的FPKM

--no-length-correctioncufflinks將根本不會使用轉(zhuǎn)錄本的長度去標準化fragment的數(shù)目。當fragment的數(shù)目和the

features being quantified的size是獨立的，可以使用（例如for small RNA libraries,

where no fragmentation takes place, or 3 prime end sequencing, where

sampled RNA fragments are all essentially the same length).小心使用

--max-mle-iterations極大似然法的迭代次數(shù)，默認5000

--poisson-dispersionUse the Poisson fragment dispersion model instead of learning one in each condition.

4.Cuffdiff輸出

1. FPKM tracking filescuffdiff計算每個樣本中的轉(zhuǎn)錄本，初始轉(zhuǎn)錄本和基因的FPKM。其中，基因和初始轉(zhuǎn)錄本的FPKM的計算是在每個轉(zhuǎn)錄本group和基因group中的轉(zhuǎn)錄本的FPKM的求和。

isoforms.fpkm_trackingTranscript FPKMs

genes.fpkm_trackingGene FPKMs. Tracks the summed FPKM of transcriptssharing eachgene_id

cds.fpkm_trackingCoding sequence FPKMs. Tracks the summed FPKM oftranscripts sharing eachp_id, independent oftss_id

tss_groups.fpkm_trackingPrimary transcript FPKMs. Tracks the summed FPKMof transcripts sharing eachtss_id

2. Count tracking files評估每個樣本中來自每個 transcript, primary transcript,? ? ? ? ? ? ? ? and gene的fragment數(shù)目。其中primary transcript,? ? ? ? ? ? ? ? and gene的fragment數(shù)目是每個primary transcript group或gene group中trancript的數(shù)目之和。

isoforms.count_trackingTranscript counts

genes.count_trackingGene counts. Tracks the summed counts oftranscripts sharing eachgene_id

cds.count_trackingCoding sequence counts. Tracks the summed countsof transcripts sharing eachp_id, independent oftss_id

tss_groups.count_trackingPrimary transcript counts. Tracks the summedcounts of transcripts sharing eachtss_id

3. Read group tracking files計算在每個repulate中每個transcript， primary transcript和gene的表達量和frage數(shù)目

isoforms.read_group_trackingTranscript read group tracking

genes.read_group_trackingGene read group tracking. Tracks the summedexpression and counts of transcripts sharing eachgene_idin each replicate

cds.read_group_trackingCoding sequence FPKMs. Tracks the summedexpression and counts of transcripts sharing eachp_id,independent oftss_idin each replicate

tss_groups.read_group_trackingPrimary transcript FPKMs. Tracks the summedexpression and counts of transcripts sharing eachtss_idin each replicate

4. Differential expression test對于splicing transcript，? ? ? ? ? ? ? ? primary transcripts, genes, and coding sequences.樣本之間的表達差異檢驗。對于每一對樣本x和y，都會有以下四個文件：

isoform_exp.diffTranscript differential FPKM.

gene_exp.diffGene differential FPKM. Tests difference sin thesummed FPKM of transcripts sharing eachgene_id

tss_group_exp.diffPrimary transcript differential FPKM. Testsdifferences in the summed FPKM of transcripts sharing eachtss_id

cds_exp.diffCoding sequence differential FPKM. Testsdifferences in the summed FPKM of transcripts sharing eachp_idindependent oftss_id

每個文件的樣式如下：

Column numberColumn nameExampleDescription

1Tested idXLOC_000001A unique identifier describing the transcipt,

gene, primary transcript, or CDS being tested

2geneLypla1Thegene_name(s) orgene_id(s)being tested

3locuschr1:4797771-4835363Genomic coordinates for easy browsing to the genes

or transcripts being tested.

4sample 1LiverLabel (or number if no labels provided) of the

first sample being tested

5sample 2BrainLabel (or number if no labels provided) of the

second sample being tested

6Test statusNOTESTCan be one of OK (test successful), NOTEST (not

enough alignments for testing), LOWDATA (too complex or shallowly

sequenced), HIDATA (too many fragments in locus), or FAIL, when an

ill-conditioned covariance matrix or other numerical exception

prevents testing.

7FPKMx8.01089FPKM of the gene in samplex

8FPKMy8.551545FPKM of the gene in sampley

9log2(FPKMy/FPKMx)0.06531The (base 2) log of the fold changey/x

10test stat0.860902The value of the test statistic used to compute

significance of the observed change in FPKM

11p value0.389292Theuncorrectedp-value of thetest statistic

12q value0.985216TheFDR-adjustedp-value of thetest statistic

13significantnoCan be either "yes" or "no", depending on whetherpis greater then the FDRafterBenjamini-Hochbergcorrection for multiple-testing

5. Differential splicing tests – splicing.diff對于每個primary transcript，鑒定的不同的isoform的差異性。只有2個或2個以上的isoforms的primary transcript存在

Column numberColumn nameExampleDescription

1Tested idTSS10015A unique identifier describing the primary

transcript being tested.

2gene nameRtknThegene_nameorgene_idthatthe primary transcript being tested belongs to

3locuschr6:83087311-83102572Genomic coordinates for easy browsing to the genes

or transcripts being tested.

4sample 1LiverLabel (or number if no labels provided) of the

first sample being tested

5sample 2BrainLabel (or number if no labels provided) of the

second sample being tested

6Test statusOKCan be one of OK (test successful), NOTEST (not

enough alignments for testing), LOWDATA (too complex or shallowly

sequenced), HIDATA (too many fragments in locus), or FAIL, when an

ill-conditioned covariance matrix or other numerical exception

prevents testing.

7Reserved0

8Reserved0

9√JS(x,y)0.22115The splice overloading of the primary transcript,

as measured by the square root of the Jensen-Shannon divergence

computed on the relative abundances of the splice variants

10test stat0.22115The value of the test statistic used to compute

significance of the observed overloading, equal to √JS(x,y)

11p value0.000174982Theuncorrectedp-value of thetest statistic.

12q value0.985216TheFDR-adjustedp-value of thetest statistic

13significantyesCan be either "yes" or "no", depending on whetherpis greater then the FDRafterBenjamini-Hochbergcorrection for multiple-testing

6. Differential coding output – cds.diff對于每個基因，它的cds的鑒定。樣本間的輸出cds的差異性。只有2個或2個以上的cds（multi-protein genes）列舉在文件中。

Column numberColumn nameExampleDescription

1Tested idXLOC_000002-[chr1:5073200-5152501]A unique identifier describing the gene being

tested.

2gene nameAtp6v1hThegene_nameorgene_id

3locuschr1:5073200-5152501Genomic coordinates for easy browsing to the genes

or transcripts being tested.

4sample 1LiverLabel (or number if no labels provided) of the

first sample being tested

5sample 2BrainLabel (or number if no labels provided) of the

second sample being tested

6Test statusOKCan be one of OK (test successful), NOTEST (not

enough alignments for testing), LOWDATA (too complex or shallowly

sequenced), HIDATA (too many fragments in locus), or FAIL, when an

ill-conditioned covariance matrix or other numerical exception

prevents testing.

7Reserved0

8Reserved0

9√JS(x,y)0.0686517The CDS overloading of the gene, as measured by

the square root of the Jensen-Shannon divergence computed on the

relative abundances of the coding sequences

10test stat0.0686517The value of the test statistic used to compute

significance of the observed overloading, equal to √JS(x,y)

11p value0.00546783Theuncorrectedp-value of thetest statistic

12q value0.985216TheFDR-adjustedp-value of thetest statistic

13significantyesCan be either "yes" or "no", depending on whetherpis greater then the FDRafterBenjamini-Hochbergcorrection for multiple-testing

7. Differential promoter use – promoters.diff樣本間啟動子使用的差異性。只有表達2個或2個以上isoform的基因列舉在這里。

8. Read group info – read_groups.info每個repulate，在進行定量分析時，cuffdiff的關(guān)鍵屬性會列出。

Column numberColumn nameExampleDescription

1filemCherry_rep_A/accepted_hits.bamBAM or SAM file containing the data for the read

group

2conditionmCherryCondition to which the read group belongs

3replicate_num0Replicate number of the read group

4total_mass4.72517e+06Total number of fragments for the read group

5norm_mass4.72517e+06Fragment normalization constant used during

calculation of FPKMs.

6internal_scale1.23916Internal scaling factor, used to transform

replicates of a single condition onto the "internal" common count

scale.

7external_scale0.96External scaling factor, used to transform counts

from different conditions onto an internal common count scale.

9. Run info – run.info運行的信息。

其中：輸出文件FPKM Tracking file的格式如下：

1tracking_idTCONS_00000001內(nèi)部唯一object的id（識別基因，轉(zhuǎn)錄本，CDS，初始轉(zhuǎn)錄本）A

unique identifier describing the object (gene, transcript, CDS,

primary transcript)

2class_code=內(nèi)部定義的類別的id，“-”表明不是轉(zhuǎn)錄本。Theclass_codeattribute for the object, or "-" if not a transcript, or ifclass_codeisn't present

3nearest_ref_idNM_008866.1最接近的參考轉(zhuǎn)錄本The

reference transcript to which the class code refers, if

any

4gene_idNM_008866基因idThegene_id(s)

associated with the object

5gene_short_nameLypla1基因名字Thegene_short_name(s)

associated with the object

6tss_idTSS1初始轉(zhuǎn)錄本id，或者“-”表示沒有初始轉(zhuǎn)錄本。Thetss_idassociated with the object, or "-" if not a transcript/primary

transcript, or iftss_idisn't

present

7locuschr1:4797771-4835363基因組上的位置Genomic

coordinates for easy browsing to the object

8length2447轉(zhuǎn)錄本的長度The

number of base pairs in the transcript, or '-' if not a

transcript/primary transcript

9coverage43.4279read覆蓋深度的估測值Estimate for the absolute depth of read coverage across the

object

10q0_FPKM8.01089樣本0中object的FPKMFPKMof

the object in sample 0

11q0_FPKM_lo7.03583object在樣本0中FPKM的95%置信區(qū)間的下界the

lower bound of the 95% confidence interval on the FPKM of the

object in sample 0

12q0_FPKM_hi8.98595object在樣本0中FPKM的95%置信區(qū)間的上界the

upper bound of the 95% confidence interval on the FPKM of the

object in sample 0

13q0_statusOKobject在樣本0中的量化狀態(tài)，0K表示成功，LOWDATA:太復雜或測序深度不夠；HIDATA：在一個基因座上太多fragments，F(xiàn)AIL：失敗的協(xié)方差矩陣或其他數(shù)值阻止了去卷積Quantification

status for the object in sample 0. Can be one of OK

(deconvolutionsuccessful), LOWDATA (too complex or shallowly sequenced), HIDATA

(too many fragments in locus), or FAIL, when an ill-conditioned

covariance matrix or other numerical exception preventsdeconvolution.

Count tracking files 格式如下:

1tracking_idTCONS_00000001A unique identifierdescribing the object (gene, transcript, CDS, primarytranscript)

2q0_count201.334Estimated (externally scaled) number of fragments generated by theobject in sample 0

3q0_count_variance5988.24Estimated variance inthe number of fragments generated by the object in sample0

4q0_count_uncertainty_var170.21Estimated variance inthe number of fragments generated by the object in sample 0 due tofragment assignment uncertainty.

5q0_count_dispersion_var4905.63Estimated variance inthe number of fragments generated by the object in sample 0 due tocross-replicate variability.

6q0_statusOKQuantification status for the object in sample 0. Can be one of OK(deconvolutionsuccessful), LOWDATA (too complex or shallowly sequenced), HIDATA

(too many fragments in locus), or FAIL, when an ill-conditioned

covariance matrix or other numerical exception preventsdeconvolution.

七. cufflinks使用中遇到的問題

使用cuffdiff時候，在最新版本下，無重復的RNA-seq樣作比較，結(jié)果中沒有差異表達基因？

在v2.0.1及之后的版本中cuffdiff貌似不支持無重復的RNA-seq數(shù)據(jù)了。使用之前的版本即可。

八 Cuffquant

cuffquant是cuffquant能夠?qū)蝹€ BAM文件的基因轉(zhuǎn)錄本表達水平進行定量分析。生成的是CXB文件abundances.cxb,，可以作為cuffdiff的輸入，這會加快cuffdiff的運行速度。也可以作為Cuffnorm的輸入。

具體使用：Usage: cuffquant [options]*

它的參數(shù)：(和前面參數(shù)的含義是一樣的)

-h/--help；-o/--output-dir

；-p/--num-threads ；-M/--mask-file

；-b/--frag-bias-correct

；-u/--multi-read-correct；--library-type；-m/--frag-len-mean

；-s/--frag-len-std-dev ；--max-mle-iterations

；--max-bundle-frags

；--no-effective-length-correction；--no-length-correction；-v/--verbose；-q/--quiet；--no-update-check；

九Cuffnorm

cuffnorm能夠用 cuffquant

的輸出文件作為輸入文件，對基因和轉(zhuǎn)錄組，簡單計算標準化過的表達水平。當你想要的是一系列可比較的基因、轉(zhuǎn)錄組、CDS 組和 TSS

組的表達值時，可是使用 cuffnorm。例如，當你僅僅想對單個基因的表達值做個熱圖或者點圖時。

cuffnorm [options]* ...[sampleN.sam_replicate1.sam[,...,sample2_replicateM.sam]]

具體參數(shù)：它的參數(shù)和前面的類似，可以看前面的相關(guān)參數(shù)。

-h/--help；-o/--output-dir

；-L/--labels ；-p/--num-threads ；

--total-hits-norm（默認不激活）；--compatible-hits-norm（默認激活）；--library-type；--library-norm-method；--output-format；-v/--verbose；-q/--quiet；--no-update-check；

cuffnorm的輸出文件是實驗中的each gene, transcript, TSS group, andCDSgroup的標準化的表達水平。不做表達差異的分析。cuffnorm的輸出文件默認是“simple-table”的文件。這些文件和cuffdiff輸出的文件格式不同。若你想要cuffdiff格式的文件，你需要輸入命令：--output-format cuffdiff

cuffnorm報道FPKM values andnormalized,estimates for the number of fragments that originate from eachgene, transcript, TSS group, and CDSgroup.這些結(jié)果已經(jīng)做了標準化處理。對于某些下游軟件需要原始文件，是不作為其輸入的。

可以創(chuàng)建一個文件，例如sample_sheet.txt作為cuffdiff或cuffnorm的輸入（存入sam文件的path）。文件格式如下：

sample_idgroup_label

C1_R1.samC1

C1_R2.samC1

C2_R1.samC2

C2_R2.samC2

輸出結(jié)果文件如下：

FPKM tracking files：估測的基因的表達水平

Count tracking files：估測的基因的fragment count values

Read group tracking files：報道per-replicate expression and count

data.

對于每個genes, transcripts, TSS groups, and CDS

groups，cuffnorm會報道兩種文件形式： *.fpkm_table files and *.count_table

files。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明：文章內(nèi)容（如有圖片或視頻亦包括在內(nèi)）由作者上傳并發(fā)布，文章內(nèi)容僅代表作者本人觀點，簡書系信息發(fā)布平臺，僅提供信息存儲服務。

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 228,238評論 6贊 531
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 98,430評論 3贊 415
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 176,134評論 0贊 373
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經(jīng)常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 62,893評論 1贊 309
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當我...
茶點故事閱讀 71,653評論 6贊 408
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 55,136評論 1贊 323
城市分裂傳說
那天，我揣著相機與錄音，去河邊找鬼。笑死，一個胖子當著我的面吹牛，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 43,212評論 3贊 441
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 42,372評論 0贊 288
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 48,888評論 1贊 334
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 40,738評論 3贊 354
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 42,939評論 1贊 369
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 38,482評論 5贊 359
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 44,179評論 3贊 347
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 34,588評論 0贊 26
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 35,829評論 1贊 283
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 51,610評論 3贊 391
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 47,916評論 2贊 372

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

[轉(zhuǎn)] cufflinks介紹（2018-05-29）

[轉(zhuǎn)] cufflinks介紹（2018-05-29）

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

[轉(zhuǎn)] cufflinks介紹（2018-05-29）

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频