首先附上參考文獻:《Spatially informed cell-type deconvolution for spatial transcriptomics》
原理分析
截至目前,大部分的反卷積求細胞組分的基本邏輯都是回歸,CARD的基本邏輯同樣采用的是回歸策略:
其中:
- B: as the G-by-K cell-type-specific expression matrix for the informative genes(單細胞表達矩陣)
- X: as the G-by-N gene expression matrix for the same set of informative genes measured on N spatial locations(空轉表達矩陣)
- V: as the N-by-K cell-type composition matrix(空間每個location的細胞組分矩陣)
- E: 服從
- 以上矩陣均為非負矩陣
同時作者也考慮到了相鄰位置的兩個location細胞成分可能會很相近,因此考慮了矯正,這里的矯正作者利用了回歸到思想,建立了 Vik 與 Vjk 之間的線性關系,表明第 i 個位置的細胞組分首臨近位置細胞組分的影響(第 i 個位置的細胞組分為非自由項):
其中:
- Vik:represents the proportion of cell type k on the ith location,即第 i 個位置 cell type k 的比例
- bk: is the kth cell-type-specific intercept that represents the average cell-type composition across locations,類似于回歸問題中的截距,表示 cell type k 在所有位置比例的均值,為一列向量
- W: an N-by-N non-negative weight matrix with each element Wij specifying the weight used for inferring the cell-type composition on the ith location based on the cell-type composition information on the jth location,類似于回歸問題中的回歸系數
- ?: is a spatial autocorrelation parameter that determines the strength of the spatial correlation in cell-type composition,表征為決定系數
- Vjk:kth cell-type compositions on all other locations
而我們的目的就是要推斷出 Vik 矩陣,即每個location的細胞組分
統計推斷部分:
根據引理:
假設下面式子中的εik服從正態分布
而整個統計推斷將轉變為一個最優化問題,即尋找一個合適的 Vik 矩陣,使得 Vik 與
將公式1化簡以后:
這里巧用矩陣乘法法則,將加和(∑)改變成為了兩個矩陣乘積形式,其中:
提出公因式后
接下來構造似然函數,利用極大似然法求解參數:
首先,作者先定義協方差矩陣如下:
第二步將所有函數的似然值相乘構造似然函數:( k 為 k 個cell type,i 為 i 個location)
這里的似然函數要估計 4 個參數 V,λk,σe2,bk,并且在這個似然函數中要優化兩個回歸:
于是聯合起來得到的似然函數,這個似然函數的作用是使得誤差 Egi 和 εik 以及σe2在 0 處的似然值最大,λk 感覺像是一個正則項:
這里求解的是似然函數的最小值,作者把求解最小值的問題轉換為求解最大值:
最后利用極大似然法求出最優 V矩陣 的參數解,這里的G代表基因數目,N為 location 數目
代碼分析
首先下載相關數據:
- sc_count.RData為單細胞表達矩陣
- sc_meta.RData為單細胞基本信息
- spatial_count.RData為空間轉錄組表達矩陣
行代表基因,列代表位置信息,矩陣元素代表基因在不同 location 的表達量spatial_count- spatial_location.RData為空間轉錄組位置信息
x,y相當于二維坐標,空間轉錄組相當于在一個平面上按照矩形形式標記不同的像素點,10×10代表平面第10行第10列的像素點,每一個像素點相當于一個小的 bulk-seqspatial_location每一個小圓點相當于一個像素點,稱之為一個 location
示例數據給的流程如下:
# 載入數據
load("C:/Users/lenovo/Downloads/spatial_count.RData")
load("C:/Users/lenovo/Downloads/spatial_location.RData")
load("C:/Users/lenovo/Downloads/sc_count.RData")
load("C:/Users/lenovo/Downloads/sc_meta.RData")
library(CARD)
CARD_obj = createCARDObject(
sc_count = sc_count,
sc_meta = sc_meta,
spatial_count = spatial_count,
spatial_location = spatial_location,
ct.varname = "cellType",
ct.select = unique(sc_meta$cellType),
sample.varname = "sampleInfo",
minCountGene = 100,
minCountSpot = 5)
CARD_obj = CARD_deconvolution(CARD_object = CARD_obj)
# 細胞組分矩陣
CARD_obj@Proportion_CARD
細胞組分矩陣:
CARD_obj@Proportion_CARD
1.解析createCARDObject()函數
# 載入數據
sc_count = sc_count
sc_meta = sc_meta
spatial_count = spatial_count
spatial_location = spatial_location
ct.varname = "cellType"
ct.select = unique(sc_meta$cellType)
sample.varname = "sampleInfo"
minCountGene = 100
minCountSpot = 5
# step 1 對單細胞的數據進行質控
sc_countMat <- sc_count
ct.select <- as.character(ct.select[!is.na(ct.select)])
sc_eset = sc_QC(sc_countMat,sc_meta,ct.varname,ct.select,sample.varname)
#### Check the spatial count dataset
#### QC on spatial dataset
spatial_countMat <- spatial_count
commonGene = intersect(rownames(spatial_countMat),rownames(assays(sc_eset)$counts))
# step2 對空轉的數據進行過濾
#### QC on spatial dataset
spatial_countMat = spatial_countMat[rowSums(spatial_countMat > 0) > minCountSpot,]
spatial_countMat = spatial_countMat[,(colSums(spatial_countMat) >= minCountGene & colSums(spatial_countMat) <= 1e6)]
spatial_location = spatial_location[rownames(spatial_location) %in% colnames(spatial_countMat),]
spatial_location = spatial_location[match(colnames(spatial_countMat),rownames(spatial_location)),]
object <- new(
Class = "CARD",
# 質控后的單細胞不同 cell type 的基因表達矩陣
sc_eset = sc_eset,
spatial_countMat = spatial_countMat,
spatial_location = spatial_location,
project = "Deconvolution",
info_parameters = list(ct.varname = ct.varname,ct.select = ct.select,sample.varname = sample.varname)
)
return(object)
## sc_QC 函數的作用是過濾一些低表達的基因和低質量的細胞
sc_QC <- function(counts_in,metaData,ct.varname,ct.select,sample.varname = NULL, min.cells = 0,min.genes = 0){
# Filter based on min.features
coldf = metaData
counts = counts_in
if (min.genes >= 0) {
nfeatures <- colSums(x = counts )
counts <- counts[, which(x = nfeatures > min.genes)]
coldf <- coldf[which(x = nfeatures > min.genes),]
}
# filter genes on the number of cells expressing
if (min.cells >= 0) {
num.cells <- rowSums(x = counts > 0)
counts <- counts[which(x = num.cells > min.cells), ]
}
fdata = as.data.frame(rownames(counts))
rownames(fdata) = rownames(counts)
keepCell = as.character(coldf[,ct.varname]) %in% ct.select
counts = counts[,keepCell]
coldf = coldf[keepCell,]
keepGene = rowSums(counts) > 0
fdata = as.data.frame(fdata[keepGene,])
counts = counts[keepGene,]
sce <- SingleCellExperiment(list(counts=counts),
colData=as.data.frame(coldf),
rowData=as.data.frame(fdata))
return(sce)
}
2.解析CARD_deconvolution()函數
# 讀取createCARDObject()的結果文件
CARD_object = CARD_obj
# 獲取不同的 cellType 名稱, ct.select 為不同 cellType 的名稱
ct.select = CARD_object@info_parameters$ct.select
# ct.varname 為字符串 "cellType"
ct.varname = CARD_object@info_parameters$ct.varname
sample.varname = CARD_object@info_parameters$sample.varname
# sc_eset 為單細胞表達矩陣, 利用 counts(sc_eset) 查看表達矩陣
sc_eset = CARD_object@sc_eset
# 對單細胞表達矩陣進行標準化
Basis_ref = createscRef(sc_eset, ct.select, ct.varname, sample.varname)
Basis = Basis_ref$basis
Basis = Basis[,colnames(Basis) %in% ct.select]
Basis = Basis[,match(ct.select,colnames(Basis))]
# 獲得空間轉錄組表達矩陣 spatial_count
spatial_count = CARD_object@spatial_countMat
commonGene = intersect(rownames(spatial_count),rownames(Basis))
#### remove mitochondrial and ribosomal genes
#### 去除 mt DNA
commonGene = commonGene[!(commonGene %in% commonGene[grep("mt-",commonGene)])]
common = selectInfo(Basis,sc_eset,commonGene,ct.select,ct.varname)
# 空轉表達矩陣 Xinput
Xinput = spatial_count
# 單細胞表達矩陣 B
B = Basis
##### match the common gene names
##### 對空間表達矩陣選擇 common 的 gene 進行后續分析
Xinput = Xinput[order(rownames(Xinput)),]
B = B[order(rownames(B)),]
B = B[rownames(B) %in% common,]
Xinput = Xinput[rownames(Xinput) %in% common,]
##### filter out non expressed genes or cells again
##### 對空間表達矩陣過濾掉沒有表達的細胞和基因
Xinput = Xinput[rowSums(Xinput) > 0,]
Xinput = Xinput[,colSums(Xinput) > 0]
##### normalize count data
##### 對空轉表達矩陣進行標準化
colsumvec = colSums(Xinput)
### 相當于每個基因對相應位置的總測序深度做標準化
Xinput_norm = sweep(Xinput,2,colsumvec,"/")
B = B[rownames(B) %in% rownames(Xinput_norm),]
B = B[match(rownames(Xinput_norm),rownames(B)),]
#### spatial location
#### 獲取空轉的位置信息
spatial_location = CARD_object@spatial_location
spatial_location = spatial_location[rownames(spatial_location) %in% colnames(Xinput_norm),]
spatial_location = spatial_location[match(colnames(Xinput_norm),rownames(spatial_location)),]
##### normalize the coordinates without changing the shape and relative position
### 對空轉的位置進行標準化,轉換為相對位置
norm_cords = spatial_location[ ,c("x","y")]
norm_cords$x = norm_cords$x - min(norm_cords$x)
norm_cords$y = norm_cords$y - min(norm_cords$y)
scaleFactor = max(norm_cords$x,norm_cords$y)
norm_cords$x = norm_cords$x / scaleFactor
norm_cords$y = norm_cords$y / scaleFactor
##### initialize the proportion matrix
### 計算空轉位置間的歐式距離
ED <- rdist::rdist(as.matrix(norm_cords))##Euclidean distance matrix
set.seed(20200107)
Vint1 = as.matrix(gtools::rdirichlet(ncol(Xinput_norm), rep(10,ncol(B))))
colnames(Vint1) = colnames(B)
rownames(Vint1) = colnames(Xinput_norm)
b = rep(0,length(ct.select))
###### parameters that need to be set
isigma = 0.1 ####construct Gaussian kernel with the default scale /length parameter to be 0.1
epsilon = 1e-04 #### convergence epsion
phi = c(0.01,0.1,0.3,0.5,0.7,0.9,0.99) #### grided values for phi
## 隨機生成 W 矩陣
kernel_mat <- exp(-ED^2 / (2 * isigma^2))
diag(kernel_mat) <- 0
###### scale the Xinput_norm and B to speed up the convergence.
mean_X = mean(Xinput_norm)
mean_B = mean(B)
Xinput_norm = Xinput_norm * 1e-01 / mean_X
B = B * 1e-01 / mean_B
ResList = list()
Obj = c()
## 利用不同的參數 phi 來估計模型
for(iphi in 1:length(phi)){
res = CARDref(
XinputIn = as.matrix(Xinput_norm),
UIn = as.matrix(B),
WIn = kernel_mat,
phiIn = phi[iphi],
max_iterIn =1000,
epsilonIn = epsilon,
initV = Vint1,
initb = rep(0,ncol(B)),
initSigma_e2 = 0.1,
initLambda = rep(10,length(ct.select)))
rownames(res$V) = colnames(Xinput_norm)
colnames(res$V) = colnames(B)
ResList[[iphi]] = res
Obj = c(Obj,res$Obj)
}
## 選擇最優的參數下的模型
Optimal = which(Obj == max(Obj))
Optimal = Optimal[length(Optimal)] #### just in case if there are two equal objective function values
OptimalPhi = phi[Optimal]
OptimalRes = ResList[[Optimal]]
cat(paste0("## Deconvolution Finish! ...\n"))
CARD_object@info_parameters$phi = OptimalPhi
### 獲得細胞組分矩陣 Proportion_CARD
CARD_object@Proportion_CARD = sweep(OptimalRes$V,1,rowSums(OptimalRes$V),"/")
CARD_object@algorithm_matrix = list(B = B * mean_B / 1e-01, Xinput_norm = Xinput_norm * mean_X / 1e-01, Res = OptimalRes)
CARD_object@spatial_location = spatial_location
################################## 其中 createscRef() 函數
# 讀取數據
x = sc_eset # 單細胞表達矩陣
ct.select = ct.select # 獲取不同的 cellType 名稱, ct.select 為不同 cellType 的名稱
ct.varname = ct.varname # ct.varname 為字符串 "cellType"
sample.varname = sample.varname
# 其中 createscRef() 函數
createscRef <- function(x, ct.select = NULL, ct.varname, sample.varname = NULL){
library(MuSiC)
if (is.null(ct.select)) {
ct.select <- unique(colData(x)[, ct.varname])
}
# 去除 cellType 為 NA 的 cell Type
ct.select <- ct.select[!is.na(ct.select)]
# countMat <- as.matrix(assays(x)$counts)
# 將單細胞表達矩陣取出來賦予 countMat
countMat <- as(SummarizedExperiment::assays(x)$counts,"sparseMatrix")
# ct.id 的作用相當于將每個 cell 賦予對應的 cell type
ct.id <- droplevels(as.factor(SummarizedExperiment::colData(x)[, ct.varname]))
#if(length(unique(colData(x)[,sample.varname])) > 1){
if(is.null(sample.varname)){
SummarizedExperiment::colData(x)$sampleID = "Sample"
sample.varname = "sampleID"
}
sample.id <- as.character(SummarizedExperiment::colData(x)[, sample.varname])
ct_sample.id <- paste(ct.id, sample.id, sep = "$*$")
colSums_countMat <- colSums(countMat)
colSums_countMat_Ct = aggregate(colSums_countMat ~ ct.id + sample.id, FUN = 'sum')
colSums_countMat_Ct_wide = reshape(colSums_countMat_Ct, idvar = "sample.id", timevar = "ct.id", direction = "wide")
colnames(colSums_countMat_Ct_wide) = gsub("colSums_countMat.","",colnames(colSums_countMat_Ct_wide))
rownames(colSums_countMat_Ct_wide) = colSums_countMat_Ct_wide$sample.id
colSums_countMat_Ct_wide$sample.id <- NULL
tbl <- table(sample.id,ct.id)
colSums_countMat_Ct_wide = colSums_countMat_Ct_wide[,match(colnames(tbl),colnames(colSums_countMat_Ct_wide))]
colSums_countMat_Ct_wide = colSums_countMat_Ct_wide[match(rownames(tbl),rownames(colSums_countMat_Ct_wide)),]
S_JK <- colSums_countMat_Ct_wide / tbl
S_JK <- as.matrix(S_JK)
S_JK[S_JK == 0] = NA
S_JK[!is.finite(S_JK)] = NA
S = colMeans(S_JK, na.rm = TRUE)
S = S[match(unique(ct.id),names(S))]
library("wrMisc")
if(nrow(countMat) > 10000 & ncol(countMat) > 50000){ ### to save memory
seqID = seq(1,nrow(countMat),by = 10000)
Theta_S_rowMean = NULL
for(igs in seqID){
if(igs != seqID[length(seqID)]){
Theta_S_rowMean_Tmp <- rowGrpMeans(as.matrix(countMat[(igs:(igs+9999)),]), grp = ct_sample.id, na.rm = TRUE)
}else{
Theta_S_rowMean_Tmp <- rowGrpMeans(as.matrix(countMat[igs:nrow(countMat),]), grp = ct_sample.id, na.rm = TRUE)
}
Theta_S_rowMean <- rbind(Theta_S_rowMean,Theta_S_rowMean_Tmp)
}
}else{
Theta_S_rowMean <- rowGrpMeans(as.matrix(countMat), grp = ct_sample.id, na.rm = TRUE)
}
tbl_sample = table(ct_sample.id)
tbl_sample = tbl_sample[match(colnames(Theta_S_rowMean),names(tbl_sample))]
Theta_S_rowSums <- sweep(Theta_S_rowMean,2,tbl_sample,"*")
Theta_S <- sweep(Theta_S_rowSums,2,colSums(Theta_S_rowSums),"/")
grp <- sapply(strsplit(colnames(Theta_S),split="$*$",fixed = TRUE),"[",1)
Theta = rowGrpMeans(Theta_S, grp = grp, na.rm = TRUE)
Theta = Theta[,match(unique(ct.id),colnames(Theta))]
S = S[match(colnames(Theta),names(S))]
basis = sweep(Theta,2,S,"*")
colnames(basis) = colnames(Theta)
rownames(basis) = rownames(Theta)
return(list(basis = basis))
}
################################## 其中 selectInfo() 函數
selectInfo <- function(Basis,sc_eset,commonGene,ct.select,ct.varname){
#### log2 mean fold change >0.5
gene1 = lapply(ct.select,function(ict){
rest = rowMeans(Basis[,colnames(Basis) != ict])
FC = log((Basis[,ict] + 1e-06)) - log((rest + 1e-06))
rownames(Basis)[FC > 1.25 & Basis[,ict] > 0]
})
gene1 = unique(unlist(gene1))
gene1 = intersect(gene1,commonGene)
counts = assays(sc_eset)$counts
counts = counts[rownames(counts) %in% gene1,]
##### only check the cell type that contains at least 2 cells
ct.select = names(table(colData(sc_eset)[,ct.varname]))[table(colData(sc_eset)[,ct.varname]) > 1]
sd_within = sapply(ct.select,function(ict){
temp = counts[,colData(sc_eset)[,ct.varname] == ict]
apply(temp,1,var) / apply(temp,1,mean)
})
##### remove the outliers that have high dispersion across cell types
gene2 = rownames(sd_within)[apply(sd_within,1,mean,na.rm = T) < quantile(apply(sd_within,1,mean,na.rm = T),prob = 0.99,na.rm = T)]
return(gene2)
}
關于 CARD_deconvolution()中的變量
- 有關
Basis_ref$basis
:Basis_ref$basis- 有關 spatial_count
spatial_count- 有關標準化后的位置信息 norm_cords:
norm_cords
關于 createscRef()中的變量
- 有關 ct.id:
ct.id- 有關 ct.select:
ct.select
3. Cpp 函數 CARDref()
#include <iostream>
#include <fstream>
#define ARMA_64BIT_WORD 1
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
#include <R.h>
#include <Rmath.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <cstring>
#include <ctime>
#include <Rcpp.h>
// Enable C++11 via this plugin (Rcpp 0.10.3 or later)
// [[Rcpp::plugins(cpp11)]]
using namespace std;
using namespace arma;
using namespace Rcpp;
#define ARMA_DONT_PRINT_ERRORS
//*******************************************************************//
// spatially informed deconvolution:CARD //
//*******************************************************************//
//' SpatialDeconv function based on Conditional Autoregressive model
//' @param XinputIn The input of normalized spatial data
//' @param UIn The input of cell type specific basis matrix B
//' @param WIn The constructed W weight matrix from Gaussian kernel
//' @param phiIn The phi value
//' @param max_iterIn Maximum iterations
//' @param epsilonIn epsilon for convergence
//' @param initV Initial matrix of cell type compositions V
//' @param initb Initial vector of cell type specific intercept
//' @param initSigma_e2 Initial value of residual variance
//' @param initLambda Initial vector of cell type sepcific scalar.
//'
//' @return A list
//'
//' @export
// [[Rcpp::export]]
SEXP CARDref(SEXP XinputIn, SEXP UIn, SEXP WIn, SEXP phiIn, SEXP max_iterIn, SEXP epsilonIn, SEXP initV, SEXP initb, SEXP initSigma_e2, SEXP initLambda)
{
try {
// read in the data
arma::mat Xinput = as<mat>(XinputIn);
arma::mat U = as<mat>(UIn);
arma::mat W = as<mat>(WIn);
double phi = as<double>(phiIn);
int max_iter = Rcpp::as<int>(max_iterIn);
double epsilon = as<double>(epsilonIn);
arma::mat V = as<mat>(initV);
arma::vec b = as<vec>(initb);
double sigma_e2 = as<double>(initSigma_e2);
arma::vec lambda = as<vec>(initLambda);
// initialize some useful items
int nSample = (int)Xinput.n_cols; // number of spatial sample points
int mGene = (int)Xinput.n_rows; // number of genes in spatial deconvolution
int k = (int)U.n_cols; // number of cell type
arma::mat L = zeros<mat>(nSample,nSample);
arma::mat D = zeros<mat>(nSample,nSample);
arma::mat V_old = zeros<mat>(nSample,k);
arma::mat UtU = zeros<mat>(k,k);
arma::mat VtV = zeros<mat>(k,k);
arma::vec colsum_W = zeros<vec>(nSample);
arma::mat UtX = zeros<mat>(k,nSample);
arma::mat XtU = zeros<mat>(nSample,k);
arma::mat UtXV = zeros<mat>(k,k);
arma::mat temp = zeros<mat>(k,k);
arma::mat part1 = zeros<mat>(nSample,k);
arma::mat part2 = zeros<mat>(nSample,k);
arma::vec updateV_k = zeros<vec>(k);
arma::vec updateV_den_k = zeros<vec>(k);
arma::vec vecOne = ones<vec>( nSample);
arma::vec diag_UtU = zeros<vec>(k);
bool logicalLogL = FALSE;
double obj = 0;
double obj_old = 0;
double normNMF = 0;
double logX = 0;
double logV = 0;
double alpha = 1.0;
double beta = nSample / 2.0;
double logSigmaL2 = 0.0;
double accu_L = 0.0;
double trac_xxt = accu(Xinput % Xinput);
// initialize values
// constant matrix caculations for increasing speed
UtX = U.t() * Xinput;
XtU = UtX.t();
colsum_W = sum(W,1);
D = diagmat(colsum_W);// diagnol matrix whose entries are column
L = D - phi*W; // graph laplacian
accu_L = accu(L);
UtXV = UtX * V;
VtV = V.t() * V;
UtU = U.t() * U;
diag_UtU = UtU.diag();
// calculate initial objective function
normNMF = trac_xxt - 2.0 * trace(UtXV) + trace(UtU * VtV);
logX = -(double)(mGene * nSample) * 0.5 * log(sigma_e2) - 0.5 * (double)(normNMF / sigma_e2);
temp = (V.t() - b * vecOne.t()) * L * (V - vecOne * b.t());
logV = - (double)(nSample) * 0.5 * sum(log(lambda )) - 0.5 * (sum(temp.diag() / lambda ));
logSigmaL2 = -(alpha + 1.0) * sum(log(lambda)) - sum(beta / lambda);
obj_old = logX + logV + logSigmaL2;
V_old = V;
// iteration starts
for(int i = 1; i <= max_iter; ++i) {
logV = 0.0;
b = sum(V.t() * L, 1) / accu_L;
lambda = (temp.diag() / 2.0 + beta ) / (double(nSample) / 2.0 + alpha + 1.0);
part1 = sigma_e2 * (D * V + phi * colsum_W * b.t());
part2 = sigma_e2 * (phi * W * V + colsum_W * b.t());
for(int nCT = 0; nCT < k; ++nCT){
updateV_den_k = lambda(nCT) * (V.col(nCT) * diag_UtU(nCT) + (V * UtU.col(nCT) - V.col(nCT) * diag_UtU(nCT))) + part1.col(nCT);
updateV_k = (lambda(nCT) * XtU.col(nCT) + part2.col(nCT)) / updateV_den_k;
V.col(nCT) %= updateV_k;
}
UtXV = UtX * V;
VtV = V.t() * V;
normNMF = trac_xxt - 2.0 * trace(UtXV) + trace(UtU * VtV);
sigma_e2 = normNMF / (double)(mGene * nSample);
temp = (V.t() - b * vecOne.t()) * L * (V - vecOne * b.t());
logX = -(double)(nSample * mGene) * 0.5 * log(sigma_e2) - 0.5 * (double)(normNMF / sigma_e2);
logV = -(double)(nSample) * 0.5 * sum(log(lambda))- 0.5 * (sum(temp.diag() / lambda ));
logSigmaL2 = -(alpha + 1.0) * sum(log(lambda)) - sum(beta / lambda);
obj = logX + logV + logSigmaL2;
logicalLogL = (obj > obj_old) && (abs(obj - obj_old) * 2.0 / abs(obj + obj_old) < epsilon);
if(isnan(obj) || (sqrt(accu((V - V_old) % (V - V_old)) / double(nSample * k)) < epsilon) || logicalLogL){
if(i > 5){ // run at least 5 iterations
break;
}
}else{
obj_old = obj;
V_old = V;
}
}
return List::create(Named("V") = V,
Named("sigma_e2") = sigma_e2,
Named("lambda") = lambda,
Named("b") = b,
Named("Obj") = obj);
}//end try
catch (std::exception &ex)
{
forward_exception_to_r(ex);
}
catch (...)
{
::Rf_error("C++ exception (unknown reason)...");
}
return R_NilValue;
} // end funcs
Cpp 中的變量解釋:
trac_xxt = accu(Xinput % Xinput);
,與公式3-1的:2.0 * trace(UtXV)
:# 根據矩陣乘法的性質, 體現出加和的形式 UtX = U.t() * Xinput; UtXV = UtX * V; trace(UtXV);
UtX = U.t() * Xinput; UtXV = UtX * V;
,代表 BTXV
trace(UtU * VtV)
:# 根據矩陣乘法的性質, 體現出加和的形式 VtV = V.t() * V; UtU = U.t() * U; trace(UtU * VtV);
UtU = U.t() * U;
,代表 BTB;VtV = V.t() * V;
,代表 VTV
其中:
- 初始化各項矩陣:
UtX = U.t() * Xinput; XtU = UtX.t(); colsum_W = sum(W,1); D = diagmat(colsum_W);// diagnol matrix whose entries are column L = D - phi*W; // graph laplacian accu_L = accu(L); UtXV = UtX * V; VtV = V.t() * V; UtU = U.t() * U; diag_UtU = UtU.diag();
- 構造似然函數:
normNMF = trac_xxt - 2.0 * trace(UtXV) + trace(UtU * VtV); logX = -(double)(mGene * nSample) * 0.5 * log(sigma_e2) - 0.5 * (double)(normNMF / sigma_e2); temp = (V.t() - b * vecOne.t()) * L * (V - vecOne * b.t()); logV = -(double)(nSample) * 0.5 * sum(log(lambda )) - 0.5 * (sum(temp.diag() / lambda )); logSigmaL2 = -(alpha + 1.0) * sum(log(lambda)) - sum(beta / lambda); obj_old = logX + logV + logSigmaL2; V_old = V;
normNMF = trac_xxt - 2.0 * trace(UtXV) + trace(UtU * VtV);
代表公式3-1中的:logX = -(double)(mGene * nSample) * 0.5 * log(sigma_e2) - 0.5 * (double)(normNMF / sigma_e2);
代表公式3-1中的(不清楚為什么這里的符號是反的):temp = (V.t() - b * vecOne.t()) * L * (V - vecOne * b.t());
代表公式3-1中的:logV = - (double)(nSample) * 0.5 * sum(log(lambda )) - 0.5 * (sum(temp.diag() / lambda ));
代表公式3-1中的:logSigmaL2 = -(alpha + 1.0) * sum(log(lambda)) - sum(beta / lambda);
代表公式3-1中的:obj_old = logX + logV + logSigmaL2;
將他們加和- 迭代終止條件:
滿足下式小于定義的epsilon即可if(isnan(obj) || (sqrt(accu((V - V_old) % (V - V_old)) / double(nSample * k)) < epsilon) || logicalLogL){ if(i > 5){ // run at least 5 iterations break; }
- 每次迭代更新 V 矩陣:
for(int nCT = 0; nCT < k; ++nCT){ // nCT 相當于cell type k,V.col(nCT) 代表提取 V 矩陣的第 nCT 列 updateV_den_k = lambda(nCT) * (V.col(nCT) * diag_UtU(nCT) + (V * UtU.col(nCT) - V.col(nCT) * diag_UtU(nCT))) + part1.col(nCT); // 計算每一列的 updateV_k updateV_k = (lambda(nCT) * XtU.col(nCT) + part2.col(nCT)) / updateV_den_k; // V 矩陣的每一列除以 updateV_k,從而更新 V 矩陣 V.col(nCT) %= updateV_k; }
然后基于更新后的 V 矩陣更新 lambda :
temp = (V.t() - b * vecOne.t()) * L * (V - vecOne * b.t()); lambda = (temp.diag() / 2.0 + beta ) / (double(nSample) / 2.0 + alpha + 1.0);
然后基于更新后的 lambda 在更新 V 矩陣: