RNA 數(shù)據(jù)下載后,如果處理成read counts matrix的話,是一定要進(jìn)行基于基因長(zhǎng)度的標(biāo)準(zhǔn)化的(TMP,RPKM,TPKM等)。目前最常用的是TPM,網(wǎng)上已經(jīng)有很多關(guān)于這三個(gè)標(biāo)準(zhǔn)的計(jì)算方法了,在此不贅述,主要說(shuō)一下這幾個(gè)數(shù)據(jù)的計(jì)算公式和相互轉(zhuǎn)換。
前提知識(shí)點(diǎn)
RPKM, FPKM, TPM區(qū)別www.plob.org
計(jì)算公式
- FPKM、RPKM
Reads Per Kilobase of exon model per Million mapped reads (每千個(gè)堿基的轉(zhuǎn)錄每百萬(wàn)映射讀取的reads)
FPKM : Fragments Per Kilobase of exon model per Million mapped fragments(每千個(gè)堿基的轉(zhuǎn)錄每百萬(wàn)映射讀取的fragments)。與RPKM計(jì)算過(guò)程類(lèi)似。只有一點(diǎn)差異:RPKM計(jì)算的是reads,F(xiàn)PKM計(jì)算的是fragments。single-end/paired-end測(cè)序數(shù)據(jù)均可計(jì)算reads count,fragments count只能通過(guò)paired-end測(cè)序數(shù)據(jù)計(jì)算。paired-end測(cè)序數(shù)據(jù)時(shí),兩端的reads比對(duì)到相同區(qū)域,且方向相反,即計(jì)數(shù)1個(gè)fragments;如果只有單端reads比對(duì)到該區(qū)域,則一個(gè)reads即計(jì)數(shù)1個(gè)fragments。所以fragments count接近且小于2 * reads count
參考:http://www.lxweimin.com/p/c25e84383ae3
-
TPM
Transcripts Per Kilobase of exon model per Million mapped reads (每千個(gè)堿基的轉(zhuǎn)錄每百萬(wàn)映射讀取的Transcripts)
i為比對(duì)到第i個(gè)exon的reads數(shù); Li為第i個(gè)exon的長(zhǎng)度;sum()為所有 (n個(gè))exon按長(zhǎng)度進(jìn)行標(biāo)準(zhǔn)化之后數(shù)值的和
- CPM
RPM/CPM: Reads/Counts of exon model per Million mapped reads (每百萬(wàn)映射讀取的reads),多進(jìn)行樣本間比較,無(wú)法進(jìn)行樣本內(nèi)差異表達(dá)分析
相互轉(zhuǎn)換代碼
countToTpm <- function(counts, effLen)
{
rate <- log(counts) - log(effLen)
denom <- log(sum(exp(rate)))
exp(rate - denom + log(1e6))
}
countToFpkm <- function(counts, effLen)
{
N <- sum(counts)
exp( log(counts) + log(1e9) - log(effLen) - log(N) )
}
fpkmToTpm <- function(fpkm)
{
exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
}
################################################################################
# An example
################################################################################
# count convert
cnts <- c(4250, 3300, 200, 1750, 50, 0)
lens <- c(900, 1020, 2000, 770, 3000, 1777)
countDf <- data.frame(count = cnts, length = lens)
## assume a mean(FLD) = 203.7
#countDf$effLength <- countDf$length - 203.7 + 1
countDf$effLength=countDf$length
countDf$tpm <- with(countDf, countToTpm(count, effLength))
countDf$fpkm <- with(countDf, countToFpkm(count, effLength))
countDf (INPUT FORMAT)
本來(lái)還有一個(gè)effect length 的計(jì)算,即校正實(shí)驗(yàn)誤差后的序列長(zhǎng)度,同時(shí)由effect length 產(chǎn)生effect counts,為了方便理解,此處把原始數(shù)據(jù)當(dāng)成effect并進(jìn)行后續(xù)計(jì)算,詳細(xì)見(jiàn)下方英文文章說(shuō)明
結(jié)果輸出
#fpkmToTpm
expMatrix<-read.table("fpkm_expr.txt",header = T,row.names = 1)
tpms <- apply(expMatrix,2,fpkmToTpm)
tpms[1:3,]
#最后可以根據(jù)TPM的特征進(jìn)行檢查,看每列加和是否一致
colSums(tpms)
fpkm_expr.txt:行為基因,列為樣本,中間數(shù)值是FPKM計(jì)算得到的值
轉(zhuǎn)換后的TPM
超全英文版參考資料:
https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/haroldpimentel.wordpress.com
轉(zhuǎn)自https://zhuanlan.zhihu.com/p/150300801