10X單細胞降維分析之PHATE

目前單細胞數據做降維分析的方法有很多(PCA,TSNE,UMAP),大家不用一個一個的去試,掌握一些主要的分析軟件,深入理解其中的原理和代碼,實現軟件之間的有優勢互補,達到我們的分析目的。

今天給大家分享一個方法,文獻在Visualizing structure and transitions in high-dimensional biological data,影響因子36分多,相當高了。今天我們的任務就是來參透文章及分享代碼,大家一定要認真學習,掌握精髓,而不是簡單的copy 代碼。

文章部分:

一、摘要:

The high-dimensional data created by high-throughput technologies require visualization tools that reveal data structure and patterns in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure using an information-geometric distance between data points. We compare PHATE to other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data, including continual progressions, branches and clusters, better than other tools. We define a manifold preservation metric, which we call denoised embedding manifold preservation (DEMaP), and show that PHATE produces lower-dimensional embeddings that are quantitatively better denoised as compared to existing visualization methods. An analysis of a newly generated single-cell RNA sequencing dataset on human germ-layer differentiation demonstrates how PHATE reveals unique biological insight into the main developmental branches, including identification of three previously undescribed subpopulations. We also show that PHATE is applicable to a wide variety of data types, including mass cytometry, single-cell RNA sequencing, Hi-C and gut microbiome data.(這部分沒什么意思,夸自己的軟件唄

二、簡介

首先單細胞數據確實需要非常好的可視化軟件,目前存在的可視化軟件包括principalcomponent analysis (PCA)、 t-distributed stochastic neighbor embedding (t-SNE)and Uniform Manifold Approximation and Projection (UMAP),其實大家現在用的最多的應該就是UMAP,然而,these methods are suboptimal for exploring high-dimensional biological data.至于原因:
(1)such methods tend to be sensitive to noise.(這個地方不知道大家研究過沒,單細胞數據的降噪和droplet的分析),methods like PCA and Isomap fail to explicitly remove this noise for visualization, rendering fine-grained local structure impossible to recognize.(這個地方需要注意,PCA確實有這個問題)
(2)nonlinear visualization methods such as t-SNE often scramble the global structure in data(全局結構不夠精確,所以現在更多的用UMAP)。
(3)many dimensionality-reduction methods (for example, PCA and diffusion maps) fail to optimize for two-dimensional (2D) visualization as they are not specifically designed for visualization.(聽過我的課程的同學是不是很熟悉!!??)
(4)common implementations of dimensionality reduction methods often lack computational scalability。(擴展性差),State-of-the-art methods such as multidimensional scaling (MDS) and t-SNE were originally presented as proofs-of-concept with somewhat naive implementations, which do not scale well to datasets with hundreds of thousands, let alone millions, of data points owing to speed or memory constraints.(這個地方不知道大家有沒有研究過,再次強調,不要只是照抄代碼,做一個理性的人)。
(5)some methods try to alleviate visualization challenges by directly imposing a fixed geometry or intrinsic structure on the data.However, methods that impose a structure
on the data generally have no way of alerting the user whether the structural assumption is correct.(這個地方許多新的軟件已經修正了)。作者舉了例子,any data will be transformed to fit a tree with Monocle212 or clusters with t-SNE. While such methods are useful for data that fit their prior assumptions, they can generate misleading results otherwise, and are often ill suited for hypothesis generation or data exploration(這個地方大家很熟悉吧,為什么聚類和monocle2的結果總是不盡如人意,明白了吧!!)
接下來就是PHATE軟件的優勢了,我們略過。。。。。。
provides an accurate, denoised representation of both local and global structure of a dataset in the required number of dimensions without imposing any strong assumptions on the structure of the data, and is highly scalable both in memory and runtime.

圖片.png

三、Result

我們現在看一些基礎的知識
(1)t-SNE focuses on preserving local structure, often at the expense of the global structure
(2)PCA focuses on preserving global structure at the expense of the local structure
(3)Although PCA is often used for denoising as a preprocessing step, both PCA and t-SNE provide noisy visualizations when the data is noisy, which can obscure the structure of the data(這個地方大家一定找掌握,不然分析數據完了也不知道對還是錯)。
(4)By contrast, diffusion maps effectively denoise data and learn the local and global structure.However, diffusion maps typically encode this information in higher dimensions, which are not amenable to visualization, and can introduce distortions in the visualization under certain conditions(diffusion maps的方法,之前的課程講過的)。


圖片.png

重點來了,PHATE is designed to overcome these weaknesses and provide a visualization that preserves the local and global structure of the data, denoises the data and presents as much information as possible into low dimensions.


圖片.png

我們來看一下主要的步驟:

(1)Encode local data information via local similarities (局部結構),這里使用的距離仍然是歐氏距離(R語言里面對于距離的定義我課上講過,基礎大家一定要知道)。


圖片.png

(2)Encode global relationships in data using the potential distance。這里用到的就是diffusion map的算法,這個課上我也講過。
(3)Embed potential distance information into low dimensions for visualization.(低維可視化)this ensures that all variability is squeezed into two dimensions for a maximally informative embedding


圖片.png

文獻推薦的分析策略

Here we present new methods that provide suggested end points, branch points and branches on the basis of the information from higher-dimensional PHATE embeddings(數據結構的分析,大家其實可以看得出來,結構與monocle2樹形結構差不多)。
(1)Branch-point identification with local intrinsic dimensionality。大家看一下下圖對于branch points的定義。branch points often encapsulate switch-like decisions where cells sharply veer towards one of a small number of fates。


圖片.png

圖片.png

(2)End-point identification with diffusion extrema.(這個軟件居然還要識別end points,跟URD有一拼。)We identify end points in the PHATE embedding as those that are least central and most distinct by computing the eigenvector centrality and the distinctness of a cellular state relative to the general data by considering the minima and maxima of diffusion eigenvectors as motivated by ref.這個地方有興趣可以好好研究一下, branch point和end spoint的識別,以及填充細胞到軌跡上,對先驗知識要求很高,當然也就意味著更為準確。我們看一下填充的效果


圖片.png

跟力導向布局差不多。

軟件之間的比較。

這部分我們簡單看一下就可以了。


圖片.png

看一下結果,當然,PHATE的準確度高,這個從理論上講試必然的,因為PHATE對于人為的監督要求更高。PHATE had the highest DEMaP score in 22 of 24 comparisons and was the top-performing method overall。Uniform manifold approximation and projection (UMAP) was the second best performing method overall but had the highest DEMaP score in only two of the comparisons, one of which is equal with PHATE.(UMAP的優勢)。

不同方法之間的降維可視化比較
圖片.png

PHATE provides a clean and relatively denoised visualization of the data that highlights both the local and global structure。當然,后面還有一些數據分析的結果,這都是套路了,大家看一下就可以。

其實我們這里總結一句,PHATE解決的問題就是,降維可視化的結果與細胞本身的內在聯系相互對應,PHATE方法最好,UMAP次之。

接下來,我們看一下代碼:

加載模塊

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import phate
import scprep
import sklearn.decomposition # PCA
import sklearn.manifold # t-SNE
import umap

至于讀取數據,質控之類的我們這里就不分享了,就看PHATE降維可視化,

phate_operator.set_params(knn=4, decay=15, t=12)
Y_phate = phate_operator.fit_transform(EBT_counts)
這個地方我們來關注一下參數問題:
    knn : Number of nearest neighbors (default: 5). Increase this (e.g. to 20) if your PHATE embedding appears very disconnected. You should also consider increasing knn if your dataset is extremely large (e.g. >100k cells)
    decay : Alpha decay (default: 15). Decreasing decay increases connectivity on the graph, increasing decay decreases connectivity. This rarely needs to be tuned. Set it to None for a k-nearest neighbors kernel.
    t : Number of times to power the operator (default: 'auto'). This is equivalent to the amount of smoothing done to the data. It is chosen automatically by default, but you can increase it if your embedding lacks structure, or decrease it if the structure looks too compact.
    gamma : Informational distance constant (default: 1). gamma=1 gives the PHATE log potential, but other informational distances can be interesting. If most of the points seem concentrated in one section of the plot, you can try gamma=0.

如果真如文章所說,PHATE有能力learn and maintain local and global distances in low dimensional space,那么這個可視化的結果,高于UMAP,是最合適的。


圖片.png
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
禁止轉載,如需轉載請通過簡信或評論聯系作者。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 229,732評論 6 539
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 99,214評論 3 426
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 177,781評論 0 382
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 63,588評論 1 316
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 72,315評論 6 410
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 55,699評論 1 327
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,698評論 3 446
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,882評論 0 289
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 49,441評論 1 335
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 41,189評論 3 356
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 43,388評論 1 372
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,933評論 5 363
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,613評論 3 348
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 35,023評論 0 28
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 36,310評論 1 293
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 52,112評論 3 398
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 48,334評論 2 377