Seurat4已經(jīng)出了beta版本,最大的改變有兩點(diǎn)
(1)推出了多模態(tài)分析,WNN(權(quán)重最近鄰),由此可以聯(lián)合單細(xì)胞轉(zhuǎn)錄組與蛋白組、ATAC,空間轉(zhuǎn)錄組等一起分析
(2)Rapid mapping of query datasets to references。也就是我們通常說的映射,如果有一個好的數(shù)據(jù)集,就可以通過映射的方式做到很多無監(jiān)督做不到的事情,比如細(xì)胞定義,聚類等等。
官網(wǎng)在這里Seurat4
我們這里主要是對其中的分析點(diǎn)做一下功課。
什么是WNN?
首先我們了解一下多模態(tài)分析,多模態(tài)是直譯的結(jié)果,英文是multimodal,也就是同一細(xì)胞的多種數(shù)據(jù)類型,比如說同一細(xì)胞的基因組數(shù)據(jù)、轉(zhuǎn)錄組數(shù)據(jù)、蛋白組數(shù)據(jù)、ATAC數(shù)據(jù)等等,而多模態(tài)分析就是指同一細(xì)胞的多個類型數(shù)據(jù)的聯(lián)合分析。多模態(tài)分析represents a new and exciting frontier for single-cell genomics,我們現(xiàn)在每個人都有很多的單細(xì)胞相關(guān)的數(shù)據(jù),甚至于文章中的多模態(tài)數(shù)據(jù)都可以拿來用,而多模態(tài)聯(lián)合分析,就用到了這里說到的WNN。
WNN(weighted nearest neighbor analysis),直譯就是權(quán)重最近鄰分析,an unsupervised strategy to learn the information content of each modality in each cell, and to define cellular state based on a weighted combination of both modalities.可以看出來,多模態(tài)的數(shù)據(jù)主要是依據(jù)各個類型計算一個權(quán)重,由此來對多模態(tài)數(shù)據(jù)進(jìn)行聯(lián)合,具體的分析需要看文章的算法,我們這里詳細(xì)了解一下。
文章在這里Integrated analysis of multimodal single-cell data
我們提煉一下其中的信息:
首先WNN的作用,文獻(xiàn)中這樣描述an analytical framework to integrate multiple data types measured within a cell, and to obtain a joint definition of cellular state。也就是說,依據(jù)多模態(tài)數(shù)據(jù)的聯(lián)合分析來定義細(xì)胞的類型與狀態(tài)。Our approach is based on an unsupervised strategy to learn cell-specific modality ‘weights’, which reflect the information content for each modality, and determine its relative importance in downstream analyses。
WNN的構(gòu)建過程分為四步:
(1)Constructing independent knearest neighbor (KNN) graphs for both modalities。
(2)Performing within and across-modality prediction
(3)Calculating cell-specific modality weights
(4)Calculating a WNN graph
我們這里詳細(xì)進(jìn)行每一步的操作分析:
首先第一步:Constructing independent knearest neighbor (KNN) graphs for both modalities
機(jī)器學(xué)習(xí)KNN的講解之前多次提到過,這里不再贅述,大家可以看我的文章機(jī)器學(xué)習(xí)的常用算法(生信必備),其中需要注意的是k值的選取,文章中給出了具體的過程:
1、對于單細(xì)胞數(shù)據(jù):We analyze scRNA-seq data using standard pipelines in Seurat which include
normalization, feature selection, and dimensional reduction with PCA. We then construct a KNN graph
after dimensional reduction.這里基本就是Seurat的分析流程。
2、對于單細(xì)胞蛋白水平的表達(dá):We analyze single-cell protein data (representing the quantification of antibody-derived tags (ADTs) in CITE-seq or ASAP-seq data) using a similar workflow to scRNA-seq. We normalize protein expression levels within a cell using the centered-log ratio (CLR) transform, followed by dimensional reduction with PCA, and subsequently construct a KNN graph. Unless otherwise specified, we do not perform feature selection on protein data, and use all measured proteins during dimensional reduction.這里可以看到,對蛋白數(shù)據(jù)進(jìn)行不進(jìn)行特征選擇,直接降維,計算KNN graph。
3、對于單細(xì)胞ATAC數(shù)據(jù):We analyze single-cell ATAC-seq data using our previously described workflow , as implemented in the Signac package. We reduced the dimensionality of the scATAC-seq data by performing latent semantic indexing (LSI) on the scATAC-seq peak matrix, as suggested by Cusanovich and Hill et al.. We first computed the term frequency-inverse document frequency (TFIDF) of the peak matrix by dividing the accessibility of each peak in each cell by the total accessibility in the cell (the “term frequency”), and multiplied this by the inverse accessibility of the peak in the cell population. This step ‘upweights’ the contribution of highly variable peaks and downweights peaks that are accessible in all cells. We then multiplied these values by 10,000 and log-transformed this TF-IDF matrix, adding a pseudocount of 1 to avoid computing the log of 0. We decomposed the TF-IDF matrix via SVD to return LSI components, and scaled LSI loadings for each LSI component to mean 0 and standard deviation 1. ATAC的數(shù)據(jù)有經(jīng)驗(yàn)的話,大家可以多介紹一下。
第二步:Performing within and across-modality prediction
這一步是關(guān)鍵,我們來看看原理,我們以文章中的數(shù)據(jù)為例來進(jìn)行解釋:
Suppose we have a CITE-seq dataset where two modalities, RNA and protein, are measured in each single cell. From the previous step, we define the following:
這里是一些基礎(chǔ)定義,大家需要知道:
We average the low-dimensional profiles of each neighbor set, which represents a prediction for the molecular contents for cell ?? based on its local neighborhoods. We perform both within-modality and across-modality prediction:
內(nèi)部預(yù)測(RNA):
可以看出這個值的計算主要是對某一個細(xì)胞的鄰居的低維數(shù)據(jù)求平均,而新的向量則代表了該細(xì)胞的分子含量預(yù)測。
同樣的道理應(yīng)用于蛋白數(shù)據(jù):
注意這里還是數(shù)據(jù)內(nèi)部的預(yù)測,接下來是不同數(shù)據(jù)類型之間的預(yù)測:
我們分析這里的公式,可以看出來:
組間的預(yù)測關(guān)鍵在于,對于蛋白數(shù)據(jù),我們預(yù)測其轉(zhuǎn)錄組水平,就是將該目標(biāo)細(xì)胞的蛋白數(shù)據(jù)的鄰居中,對應(yīng)到轉(zhuǎn)錄組上的細(xì)胞求其平均,這樣就是代表了該細(xì)胞的轉(zhuǎn)錄組水平預(yù)測,轉(zhuǎn)錄預(yù)測蛋白水平的方式也一樣。這樣對與轉(zhuǎn)錄組數(shù)據(jù),我們每一個細(xì)胞都會有其預(yù)測的蛋白水平,蛋白數(shù)據(jù)的每個細(xì)胞都有預(yù)測的轉(zhuǎn)錄組水平,預(yù)測完了之后,我們就要計算權(quán)重了。
第三步:Calculating cell-specific modality weights
這里我們需要注意的是,預(yù)測值與真值的距離(歐氏距離),也就是說,這個距離越小,越接近我們測序得到的真實(shí)水平,而這個距離依據(jù)公式轉(zhuǎn)化成親和力。按照公式來看,親和力大小與預(yù)測值和真值之間的差異高度相關(guān),差別越大親和力越小。
Our approach is inspired by the concept of large margin nearest neighbors, which aims to identify kernel bandwidths that separate data points in the same class from those in different classes, even if the classes are closely related. In the context of unsupervised single-cell analysis (where the data points are unlabeled),we aim to identify a kernel bandwidth that groups together cells in the same state, yet divides cells that originate from closely related (but different) states.
bandwidth的判定需要我們詳細(xì)認(rèn)識一下,這里我們了解其用法
Recent work has clearly demonstrated that KNN-graphs are prone to the formation of spurious edges, which represent links between cells that share some similarity molecular profiles, but are not in a matched molecular state. However, it is possible to identify these spurious edges through the use of the Jaccard metric. This identifies the number of shared nearest neighbors between two cells, thereby exploiting the local density of each data point to separate well-supported from spurious edges.
這里我們需要注意一個概念:Jaccard metric
這個指標(biāo)的概念是:用于比較有限樣本集之間的相似性與差異性。Jaccard系數(shù)值越大,樣本相似度越高,其算法如下圖:
這里我們需要知道。
接下來就是對bandwidth的計算:
For each cell ??, we therefore aim to identify the 20 cells in the dataset with the lowest non-zero Jaccard similarity. We expect that these represent cells that exhibit some similarity with cell ??, but are unlikely to reside in the same molecular state. If more than 20 cells share the same Jaccard value, we select the 20 with the furthest euclidean distance to cell ??. We take the average of the Euclidean distances from cell ?? to the 20 selected cells, and set this as the cell-specific kernel bandwidth。
也就是說,對bandwidth的定義主要依據(jù)該細(xì)胞與鄰居之間歐氏距離的平均值。
第四步:Calculating cell-specific modality weights
計算模態(tài)數(shù)據(jù)的權(quán)重:又是一堆公式。
簡而言之,就是計算預(yù)測值占真值的一個比重,值越大預(yù)測的越好,但與此同時也會得到兩個比率值,在這里,作者做了softmax transformation,然后計算親和力比重。
這里我們需要知道一下softmax transformation:
如下圖:
是不是很復(fù)雜???唯有數(shù)學(xué)不會,不會就是不會
第四步:Calculating a WNN graph
第三步我們得到了多模態(tài)數(shù)據(jù)之間的比重,接下來就是要構(gòu)建WNN的graph。
We leverage the cell-specific modality weights calculated above to define a new similarity metric between
any two cells, which reflects a weighted combination of RNA and protein affinities. For two cells ?? and cell
??, we define their weighted similarity as:
兩兩細(xì)胞之間重新定義相似性,這個時候主要就是多模態(tài)聯(lián)合進(jìn)行數(shù)據(jù)分析。
We then construct a WNN graph, defined as a KNN graph constructed using this weighted similarity metric.
For each cell, we consider the set:
and identify the k-most similar cells within this set based on the weighted similarity metric as weighted nearest neighbors.
到了這里,新的分析矩陣形成,就開始了下游分析,這里的矩陣就是融合矩陣,多模態(tài)分析從而實(shí)現(xiàn)。
至于代碼很簡單:
Seurat包里面的一個函數(shù)
bm <- FindMultiModalNeighbors(
bm, reduction.list = list("pca", "apca"),
dims.list = list(1:30, 1:18), modality.weight.name = "RNA.weight"
)
請保持憤怒,讓王多魚傾家蕩產(chǎn)