單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析|| scanpy教程:預(yù)處理與聚類(lèi)

scanpy 是一個(gè)用于分析單細(xì)胞轉(zhuǎn)錄組(single cell rna sequencing)數(shù)據(jù)的python庫(kù),文章2018發(fā)表在Genome Biology。其實(shí)它的許多分析思路借鑒了以seurat為中心的R語(yǔ)言單細(xì)胞轉(zhuǎn)錄數(shù)據(jù)分析生態(tài)的,scanpy以一己之力在python生態(tài)構(gòu)建了單細(xì)胞轉(zhuǎn)錄組數(shù)據(jù)分析框架。我相信借助python的工業(yè)應(yīng)用實(shí)力,其擴(kuò)展性大于R語(yǔ)言分析工具。當(dāng)然,選擇走一遍scanpy的原因,不是因?yàn)樗膹?qiáng)大,只是因?yàn)橄矚g。

在我搬這塊磚之前,中文世界關(guān)于scanpy的介紹已經(jīng)很多了,這當(dāng)然是好事。不知道誰(shuí)會(huì)以怎樣的方式遇見(jiàn)誰(shuí),所以,還是讓我們開(kāi)始吧。

所做的第一步就是配置好python環(huán)境,我建議是用conda來(lái)構(gòu)建,這樣軟件管理起來(lái)很方便。然后是安裝scanpy這個(gè)庫(kù),當(dāng)然可能會(huì)遇到一些問(wèn)題,但是花點(diǎn)時(shí)間總是可以Google掉的。在Windows、mac、linux平臺(tái)scanpy都是可以運(yùn)行的。

在學(xué)習(xí)新的庫(kù)時(shí),文檔是不可不看的。有統(tǒng)計(jì)表明,程序員讀代碼的時(shí)間一般三倍于寫(xiě)代碼的時(shí)間。所以這基本上是一次閱讀體驗(yàn)。我們假裝可愛(ài)的讀者朋友,已經(jīng)配置好scanpy的環(huán)境,也許花了兩三天的時(shí)間。會(huì)不會(huì)python沒(méi)關(guān)系,英語(yǔ)不好沒(méi)關(guān)系,安裝個(gè)某道詞典就可以了。

三天后。。。

我們打開(kāi)scanpy的教程官網(wǎng):https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html ,對(duì),就是這么直白。開(kāi)始在我們的編輯器里復(fù)制粘貼教程代碼,并嘗試運(yùn)行,并理解結(jié)果。

首先我們導(dǎo)入需要的庫(kù),讀入10X標(biāo)準(zhǔn)的數(shù)據(jù)filtered_feature_bc_matrix,當(dāng)然也可以是h5格式的文件,或者一個(gè)二維的表達(dá)譜。

讀入數(shù)據(jù)
# -*- coding: utf-8 -*-
"""
Spyder Editor
https://scanpy.readthedocs.io/en/stable/
This is a temporary script file.
"""
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns 

sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80)
sc.logging.print_versions()
scanpy==1.4.5.1 
anndata==0.7.1 
umap==0.3.10 
numpy==1.16.5 
scipy==1.3.1 
pandas==0.25.1 
scikit-learn==0.21.3 
statsmodels==0.10.1 
python-igraph==0.8.0

不用擔(dān)心這些庫(kù),我們都會(huì)慢慢熟悉的,這里python-igraph要注意一下,不是igraph。這些庫(kù)在《利用python進(jìn)行數(shù)據(jù)分析》這本書(shū)里面都有介紹,特別是pandas和numpy,以后的日子里我們會(huì)看到它們幾乎是構(gòu)成scanpy數(shù)據(jù)結(jié)構(gòu)的核心。

results_file = 'E:\learnscanpy\write\pbmc3k.h5ad' 
help(sc.read_10x_mtx)
adata = sc.read_10x_mtx(
    'E:/learnscanpy/data/filtered_feature_bc_matrix',  # the directory with the `.mtx` file
    var_names='gene_symbols',                  # use gene symbols for the variable names (variables-axis index)
    cache=True) 
adata.var_names_make_unique()  # this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`

這樣就把數(shù)據(jù)讀進(jìn)來(lái)了,他的名字叫做adata,那么這是個(gè)什么東西呢?我認(rèn)為搞清楚這個(gè)基本上學(xué)會(huì)一半scanpy了。

adata
Out[21]: 
AnnData object with n_obs × n_vars = 5025 × 33694 
    var: 'gene_ids', 'feature_types'

adata.var["gene_ids"]
Out[22]: 
RP11-34P13.3    ENSG00000243485
FAM138A         ENSG00000237613
OR4F5           ENSG00000186092
RP11-34P13.7    ENSG00000238009
RP11-34P13.8    ENSG00000239945
      
AC233755.2      ENSG00000277856
AC233755.1      ENSG00000275063
AC240274.1      ENSG00000271254
AC213203.1      ENSG00000277475
FAM231B         ENSG00000268674
Name: gene_ids, Length: 33694, dtype: object

adata.var["feature_types"]
Out[23]: 
RP11-34P13.3    Gene Expression
FAM138A         Gene Expression
OR4F5           Gene Expression
RP11-34P13.7    Gene Expression
RP11-34P13.8    Gene Expression
      
AC233755.2      Gene Expression
AC233755.1      Gene Expression
AC240274.1      Gene Expression
AC213203.1      Gene Expression
FAM231B         Gene Expression
Name: feature_types, Length: 33694, dtype: object

adata.to_df()
Out[24]: 
                    RP11-34P13.3  FAM138A  ...  AC213203.1  FAM231B
AAACCCAAGCGTATGG-1           0.0      0.0  ...         0.0      0.0
AAACCCAGTCCTACAA-1           0.0      0.0  ...         0.0      0.0
AAACCCATCACCTCAC-1           0.0      0.0  ...         0.0      0.0
AAACGCTAGGGCATGT-1           0.0      0.0  ...         0.0      0.0
AAACGCTGTAGGTACG-1           0.0      0.0  ...         0.0      0.0
                         ...      ...  ...         ...      ...
TTTGTTGCAGGTACGA-1           0.0      0.0  ...         0.0      0.0
TTTGTTGCAGTCTCTC-1           0.0      0.0  ...         0.0      0.0
TTTGTTGGTAATTAGG-1           0.0      0.0  ...         0.0      0.0
TTTGTTGTCCTTGGAA-1           0.0      0.0  ...         0.0      0.0
TTTGTTGTCGCACGAC-1           0.0      0.0  ...         0.0      0.0

[5025 rows x 33694 columns]

第一句AnnData object with n_obs × n_vars = 5025 × 33694 ,告訴我們adata是一個(gè)AnnData 對(duì)象,就像seurat也是一個(gè)對(duì)象一樣。什么叫對(duì)象呢?對(duì)象就是一個(gè)實(shí)體、物體,它是一種存在而不是一種動(dòng)作。當(dāng)然,我們可以對(duì)它做一些操作,一個(gè)對(duì)象可以通過(guò)具體的屬性為人們感知。

具體地,anndata是這樣一種構(gòu)象:

單細(xì)胞轉(zhuǎn)錄組的核心就是一個(gè)cell X gene的二維表,但是分群后要存儲(chǔ)cell的分群結(jié)果,特征選擇是要記錄gene的信息,降維后要存儲(chǔ)降維后的結(jié)果。所以,這張表.X的對(duì)象cell相關(guān)的信息記錄在.obs中,屬性gene的信息記錄在.var中,其他的信息在.uns中。那么每一部分是什么呢?

type(adata.var["gene_ids"])
Out[205]: pandas.core.series.Series

哦,原來(lái)是pandas的Series,下面是這個(gè)數(shù)據(jù)結(jié)構(gòu)的詳細(xì)介紹,這個(gè)數(shù)據(jù)結(jié)構(gòu)能存儲(chǔ)的信息一點(diǎn)也不亞于seurat啊。

Type:        AnnData
String form:
AnnData object with n_obs × n_vars = 5025 × 33694 
    var: 'gene_ids', 'feature_types'
Length:      5025
File:        d:\program files (x86)\anconda\lib\site-packages\anndata\_core\anndata.py
Docstring:  
An annotated data matrix.

:class:`~anndata.AnnData` stores a data matrix :attr:`X` together with annotations
of observations :attr:`obs` (:attr:`obsm`, :attr:`obsp`),
variables :attr:`var` (:attr:`varm`, :attr:`varp`),
and unstructured annotations :attr:`uns`.

.. figure:: https://falexwolf.de/img/scanpy/anndata.svg
   :width: 350px

An :class:`~anndata.AnnData` object `adata` can be sliced like a
:class:`~pandas.DataFrame`,
for instance `adata_subset = adata[:, list_of_variable_names]`.
:class:`~anndata.AnnData`’s basic structure is similar to R’s ExpressionSet
[Huber15]_. If setting an `.h5ad`-formatted HDF5 backing file `.filename`,
data remains on the disk but is automatically loaded into memory if needed.
See this `blog post`_ for more details.

.. _blog post: http://falexwolf.de/blog/171223_AnnData_indexing_views_HDF5-backing/

Parameters
----------
X
    A #observations × #variables data matrix. A view of the data is used if the
    data type matches, otherwise, a copy is made.
obs
    Key-indexed one-dimensional observations annotation of length #observations.
var
    Key-indexed one-dimensional variables annotation of length #variables.
uns
    Key-indexed unstructured annotation.
obsm
    Key-indexed multi-dimensional observations annotation of length #observations.
    If passing a :class:`~numpy.ndarray`, it needs to have a structured datatype.
varm
    Key-indexed multi-dimensional variables annotation of length #variables.
    If passing a :class:`~numpy.ndarray`, it needs to have a structured datatype.
layers
    Key-indexed multi-dimensional arrays aligned to dimensions of `X`.
dtype
    Data type used for storage.
shape
    Shape tuple (#observations, #variables). Can only be provided if `X` is `None`.
filename
    Name of backing file. See :class:`File`.
filemode
    Open mode of backing file. See :class:`File`.

See Also
--------
read_h5ad
read_csv
read_excel
read_hdf
read_loom
read_zarr
read_mtx
read_text
read_umi_tools

Notes
-----
:class:`~anndata.AnnData` stores observations (samples) of variables/features
in the rows of a matrix.
This is the convention of the modern classics of statistics [Hastie09]_
and machine learning [Murphy12]_,
the convention of dataframes both in R and Python and the established statistics
and machine learning packages in Python (statsmodels_, scikit-learn_).

Single dimensional annotations of the observation and variables are stored
in the :attr:`obs` and :attr:`var` attributes as :class:`~pandas.DataFrame`\ s.
This is intended for metrics calculated over their axes.
Multi-dimensional annotations are stored in :attr:`obsm` and :attr:`varm`,
which are aligned to the objects observation and variable dimensions respectively.
Square matrices representing graphs are stored in :attr:`obsp` and :attr:`varp`,
with both of their own dimensions aligned to their associated axis.
Additional measurements across both observations and variables are stored in
:attr:`layers`.

Indexing into an AnnData object can be performed by relative position
with numeric indices (like pandas’ :attr:`~pandas.DataFrame.iloc`),
or by labels (like :attr:`~pandas.DataFrame.loc`).
To avoid ambiguity with numeric indexing into observations or variables,
indexes of the AnnData object are converted to strings by the constructor.

Subsetting an AnnData object by indexing into it will also subset its elements
according to the dimensions they were aligned to.
This means an operation like `adata[list_of_obs, :]` will also subset :attr:`obs`,
:attr:`obsm`, and :attr:`layers`.

Subsetting an AnnData object returns a view into the original object,
meaning very little additional memory is used upon subsetting.
This is achieved lazily, meaning that the constituent arrays are subset on access.
Copying a view causes an equivalent “real” AnnData object to be generated.
Attempting to modify a view (at any attribute except X) is handled
in a copy-on-modify manner, meaning the object is initialized in place.
Here’s an example::

    batch1 = adata[adata.obs["batch"] == "batch1", :]
    batch1.obs["value"] = 0  # This makes batch1 a “real” AnnData object

At the end of this snippet: `adata` was not modified,
and `batch1` is its own AnnData object with its own data.

Similar to Bioconductor’s `ExpressionSet` and :mod:`scipy.sparse` matrices,
subsetting an AnnData object retains the dimensionality of its constituent arrays.
Therefore, unlike with the classes exposed by :mod:`pandas`, :mod:`numpy`,
and `xarray`, there is no concept of a one dimensional AnnData object.
AnnDatas always have two inherent dimensions, :attr:`obs` and :attr:`var`.
Additionally, maintaining the dimensionality of the AnnData object allows for
consistent handling of :mod:`scipy.sparse` matrices and :mod:`numpy` arrays.

.. _statsmodels: http://www.statsmodels.org/stable/index.html
.. _scikit-learn: http://scikit-learn.org/

繪制高表達(dá)的基因:

sc.pl.highest_expr_genes(adata, n_top=20)

本著本質(zhì)的好奇心,我們不禁要問(wèn)一句:這張圖是用什么畫(huà)的?一定要養(yǎng)成閱讀文檔的習(xí)慣:

help(sc.pl.highest_expr_genes)
Help on function highest_expr_genes in module scanpy.plotting._qc:

highest_expr_genes(adata: anndata._core.anndata.AnnData, n_top: int = 30, show: Union[bool, NoneType] = None, save: Union[str, bool, NoneType] = None, ax: Union[matplotlib.axes._axes.Axes, NoneType] = None, gene_symbols: Union[str, NoneType] = None, log: bool = False, **kwds)
   Fraction of counts assigned to each gene over all cells.
   
   Computes, for each gene, the fraction of counts assigned to that gene within
   a cell. The `n_top` genes with the highest mean fraction over all cells are
   plotted as boxplots.
   
   This plot is similar to the `scater` package function `plotHighestExprs(type
   = "highest-expression")`, see `here
   <https://bioconductor.org/packages/devel/bioc/vignettes/scater/inst/doc/vignette-qc.html>`__. Quoting
   from there:
   
       *We expect to see the “usual suspects”, i.e., mitochondrial genes, actin,
       ribosomal protein, MALAT1. A few spike-in transcripts may also be
       present here, though if all of the spike-ins are in the top 50, it
       suggests that too much spike-in RNA was added. A large number of
       pseudo-genes or predicted genes may indicate problems with alignment.*
       -- Davis McCarthy and Aaron Lun
   
   Parameters
   ----------
   adata
       Annotated data matrix.
   n_top
       Number of top
   show
        Show the plot, do not return axis.
   save
       If `True` or a `str`, save the figure.
       A string is appended to the default filename.
       Infer the filetype if ending on {`'.pdf'`, `'.png'`, `'.svg'`}.
   ax
       A matplotlib axes object. Only works if plotting a single component.
   gene_symbols
       Key for field in .var that stores gene symbols if you do not want to use .var_names.
   log
       Plot x-axis in log scale
   **kwds
       Are passed to :func:`~seaborn.boxplot`.
   
   Returns
   -------
   If `show==False` a :class:`~matplotlib.axes.Axes`.

可以看出這個(gè)功能借鑒了知名R包scater的一個(gè)函數(shù):plotHighestExprs,繪圖用的是python里面兩個(gè)響當(dāng)當(dāng)?shù)膕eaborn和matplotlib庫(kù)。那我們不禁要自己微調(diào)一下了。

current_palette = sns.color_palette("dark")
sns.palplot(current_palette)
 sc.pl.highest_expr_genes(adata, n_top=20,palette="dark" )
normalizing counts per cell
    finished (0:00:00)

借助seaborn的繪圖系統(tǒng)圖我們還可以這樣設(shè)置顏色:

with sns.color_palette("PuBuGn_d"):
    sc.pl.highest_expr_genes(adata, n_top=20, )

help(sns.set_style)
sns.set_style("white")
sc.pl.highest_expr_genes(adata, n_top=20, )
cell QC

提到cell QC ,我們需要明白在10的體系里面什么是cell。有的同學(xué)說(shuō)了,我上機(jī)6000個(gè)細(xì)胞,為什么cellranger檢出的細(xì)胞不是6000,有時(shí)候差的還不小。因?yàn)樵赾ellranger的邏輯里,cell不過(guò)是每個(gè)umi對(duì)barcode的支持程度。它說(shuō)的cell是基于序列比對(duì)的,盡管這個(gè)理應(yīng)和上機(jī)的cell保持一致,但是這種一致性是通過(guò)序列比對(duì)來(lái)保持的。其實(shí)我們說(shuō)的cell QC指的是對(duì)barcode的質(zhì)控,使留下的barcode能表征下游分析的cell。

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=0)
mito_genes = adata.var_names.str.startswith('MT-')
# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
    adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1

參數(shù)的意思就是字面的意思,有不明白的可以直接help相應(yīng)的函數(shù),看它的說(shuō)明文檔。我基本上就是靠說(shuō)明文檔來(lái)學(xué)習(xí)的。下面讓我們看看過(guò)濾后的adata發(fā)生了哪些變化。

adata
Out[9]: 
AnnData object with n_obs × n_vars = 4960 × 33694 
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'feature_types', 'n_cells'

 adata.obs['percent_mito']
Out[11]: 
AAACCCAAGCGTATGG-1    0.106588
AAACCCAGTCCTACAA-1    0.056101
AAACCCATCACCTCAC-1    0.532847
AAACGCTAGGGCATGT-1    0.106913
AAACGCTGTAGGTACG-1    0.078654
  
TTTGTTGCAGGTACGA-1    0.071226
TTTGTTGCAGTCTCTC-1    0.055394
TTTGTTGGTAATTAGG-1    0.183441
TTTGTTGTCCTTGGAA-1    0.081037
TTTGTTGTCGCACGAC-1    0.070987
Name: percent_mito, Length: 4960, dtype: float32

adata.var['n_cells']
Out[12]: 
RP11-34P13.3     0
FAM138A          0
OR4F5            0
RP11-34P13.7    43
RP11-34P13.8     0
                ..
AC233755.2       1
AC233755.1       1
AC240274.1      83
AC213203.1       0
FAM231B          0
Name: n_cells, Length: 33694, dtype: int32

細(xì)胞屬性obs中有了percent_mito;有了基因?qū)傩缘膙ar。

像seurat一樣,我們看看cell三個(gè)屬性的小提琴圖:

sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
             jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='n_counts', y='percent_mito')
sc.pl.scatter(adata, x='n_counts', y='n_genes')

顯然這是分別畫(huà)的兩張圖,我們能不能把它們組合到一起呢?顯然是可以的:

help(sns.FacetGrid)
adata.obs['percent_mito1'] = adata.obs['percent_mito']*10000
adata.obs.melt(id_vars=['n_counts','percent_mito'], var_name='myVariable', value_name='myValue')
import matplotlib.pyplot as plt
sns.set(style="ticks")
g = sns.FacetGrid(adata.obs.melt(id_vars=['n_counts','percent_mito'], var_name='myVariable', value_name='myValue'),col = 'myVariable')
g.map(plt.scatter,'n_counts',"myValue", alpha=.7)
g.add_legend();

有不少人問(wèn)這兩張圖怎么看。就拿n_counts與n_genes的散點(diǎn)圖(scatter)來(lái)說(shuō)吧:每個(gè)點(diǎn)代表一個(gè)細(xì)胞,斜率代表隨著count的增加gene的增加程度。count和gene一般呈現(xiàn)線性關(guān)系,斜率越大也就是較少的count就可以檢出較多的gene,說(shuō)明這批細(xì)胞基因的豐度偏低(普遍貧窮);反之,斜率較小,說(shuō)明這批細(xì)胞基因豐度較高(少數(shù)富有)。有的時(shí)候不是一條可擬合的線,或者是兩條可擬合的線,也反映出細(xì)胞的異質(zhì)性。總之,他就是一個(gè)散點(diǎn)圖,描述的是兩個(gè)變量的關(guān)系。

根據(jù)統(tǒng)計(jì)的n_genes 和線粒體基因含量percent_mito 進(jìn)一步過(guò)濾:

adata = adata[adata.obs.n_genes < 2500, :]
adata = adata[adata.obs.percent_mito < 0.1, :]

過(guò)濾掉基因數(shù)大于2500,線粒體基因含量大于0.1的barcode。這個(gè)過(guò)濾方法就像filter一樣,操作的是adata.obs這張表。

adata.obs
Out[82]: 
                    n_genes  percent_mito  n_counts  percent_mito1
AAACGCTGTGTGATGG-1     2348      0.099821    6131.0     998.205811
AAACGCTTCTAGGCAT-1     2470      0.065564    8465.0     655.640869
AAAGAACCACCGTCTT-1      420      0.021944     638.0     219.435730
AAAGAACTCACCTGGG-1     2479      0.074067   10801.0     740.672119
AAAGAACTCGGAACTT-1     2318      0.079713    6837.0     797.133240
                    ...           ...       ...            ...
TTTGGTTTCCGGTAAT-1     2277      0.079324    9291.0     793.240723
TTTGTTGCAGGTACGA-1     1986      0.071226    8101.0     712.257751
TTTGTTGCAGTCTCTC-1     2485      0.055394    9604.0     553.935852
TTTGTTGTCCTTGGAA-1     1998      0.081037    5479.0     810.366882
TTTGTTGTCGCACGAC-1     2468      0.070987    7438.0     709.868225

[2223 rows x 4 columns]
特征提取

cellQC可以看做是對(duì)細(xì)胞的過(guò)濾,即保留可用的細(xì)胞。但是每個(gè)細(xì)胞又有成千的基因(細(xì)胞的特征),所以一般需要做特征的選擇和提取(也就是降維)。什么意思呢?就是比如每個(gè)人都有很多特征,年齡、性別、身高、國(guó)籍、學(xué)歷、生日、愛(ài)好、血型。。。我要對(duì)一百個(gè)人分類(lèi),所有的特征都拿來(lái)比較,可能得不到準(zhǔn)確的劃分。而用幾個(gè)關(guān)鍵的特征可以分出某個(gè)國(guó)家某個(gè)年齡段的人。特征的選擇是選擇一部分特征,比如選擇高變基因;特征提取是提取主要的特征結(jié)構(gòu),如主成分分析。

在特征提取之前要保證細(xì)胞之間是有可比性的,一般用的是歸一化的方法,得到高變基因之后,為了使同一個(gè)基因在不同細(xì)胞之間具有可比性采用標(biāo)準(zhǔn)化。

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata

adata
AnnData object with n_obs × n_vars = 2223 × 33694 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
    var: 'gene_ids', 'feature_types', 'n_cells'
    uns: 'log1p'
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
adata = adata[:, adata.var.highly_variable]

#時(shí)刻關(guān)注我們數(shù)據(jù)的變化
adata
Out[117]: 
View of AnnData object with n_obs × n_vars = 2223 × 2208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p'

回歸每個(gè)細(xì)胞的總計(jì)數(shù)和線粒體基因表達(dá)百分比的影響,將數(shù)據(jù)縮放到單位方差。

sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')


adata
Out[121]: 
AnnData object with n_obs × n_vars = 2223 × 2208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs

adata.obsm['X_pca']
adata.varm['PCs']
sc.pl.pca(adata, color='CST3')
sc.pl.pca_variance_ratio(adata, log=True)
adata.write(results_file)

特征選擇我們用高變基因,特征提取我們用pca。這里就有一個(gè)都的問(wèn)題:如何選擇高變基因以及pc的維度?

聚類(lèi)

scanpy提供leiden和louvain兩種圖聚類(lèi)算法,圖聚類(lèi)在開(kāi)始之前要先找鄰居:

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
adata

Out[128]: 
AnnData object with n_obs × n_vars = 2223 × 2208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'pca', 'neighbors'
    obsm: 'X_pca'
    varm: 'PCs'

sc.tl.leiden(adata)

adata.obs['leiden']
Out[188]: 
AAACGCTGTGTGATGG-1    10
AAACGCTTCTAGGCAT-1     6
AAAGAACCACCGTCTT-1     4
AAAGAACTCACCTGGG-1     2
AAAGAACTCGGAACTT-1     6
                      ..
TTTGGTTTCCGGTAAT-1     3
TTTGTTGCAGGTACGA-1     1
TTTGTTGCAGTCTCTC-1     3
TTTGTTGTCCTTGGAA-1     7
TTTGTTGTCGCACGAC-1     3
Name: leiden, Length: 2223, dtype: category
Categories (12, object): [0, 1, 2, 3, ..., 8, 9, 10, 11]
#sc.tl.louvain(adata)
#sc.tl.paga(adata)

和Seurat等人一樣,scanpy推薦Traag *等人(2018)提出的Leiden graph-clustering方法(基于優(yōu)化模塊化的社區(qū)檢測(cè))。注意,Leiden集群直接對(duì)cell的鄰域圖進(jìn)行聚類(lèi),我們?cè)趕c.pp.neighbors已經(jīng)計(jì)算過(guò)了。
讀一下leiden的說(shuō)明文檔:

 help(sc.tl.leiden)
Help on function leiden in module scanpy.tools._leiden:

leiden(adata: anndata._core.anndata.AnnData, resolution: float = 1, *, restrict_to: Union[Tuple[str, Sequence[str]], NoneType] = None, random_state: Union[int, mtrand.RandomState, NoneType] = 0, key_added: str = 'leiden', adjacency: Union[scipy.sparse.base.spmatrix, NoneType] = None, directed: bool = True, use_weights: bool = True, n_iterations: int = -1, partition_type: Union[Type[leidenalg.VertexPartition.MutableVertexPartition], NoneType] = None, copy: bool = False, **partition_kwargs) -> Union[anndata._core.anndata.AnnData, NoneType]
    Cluster cells into subgroups [Traag18]_.
    
    Cluster cells using the Leiden algorithm [Traag18]_,
    an improved version of the Louvain algorithm [Blondel08]_.
    It has been proposed for single-cell analysis by [Levine15]_.
    
    This requires having ran :func:`~scanpy.pp.neighbors` or
    :func:`~scanpy.external.pp.bbknn` first.
    
    Parameters
    ----------
    adata
        The annotated data matrix.
    resolution
        A parameter value controlling the coarseness of the clustering.
        Higher values lead to more clusters.
        Set to `None` if overriding `partition_type`
        to one that doesn’t accept a `resolution_parameter`.
    random_state
        Change the initialization of the optimization.
    restrict_to
        Restrict the clustering to the categories within the key for sample
        annotation, tuple needs to contain `(obs_key, list_of_categories)`.
    key_added
        `adata.obs` key under which to add the cluster labels.
    adjacency
        Sparse adjacency matrix of the graph, defaults to
        `adata.uns['neighbors']['connectivities']`.
    directed
        Whether to treat the graph as directed or undirected.
    use_weights
        If `True`, edge weights from the graph are used in the computation
        (placing more emphasis on stronger edges).
    n_iterations
        How many iterations of the Leiden clustering algorithm to perform.
        Positive values above 2 define the total number of iterations to perform,
        -1 has the algorithm run until it reaches its optimal clustering.
    partition_type
        Type of partition to use.
        Defaults to :class:`~leidenalg.RBConfigurationVertexPartition`.
        For the available options, consult the documentation for
        :func:`~leidenalg.find_partition`.
    copy
        Whether to copy `adata` or modify it inplace.
    **partition_kwargs
        Any further arguments to pass to `~leidenalg.find_partition`
        (which in turn passes arguments to the `partition_type`).
    
    Returns
    -------
    `adata.obs[key_added]`
        Array of dim (number of samples) that stores the subgroup id
        (`'0'`, `'1'`, ...) for each cell.
    `adata.uns['leiden']['params']`
        A dict with the values for the parameters `resolution`, `random_state`,
        and `n_iterations`.

seurat的findclusters中聚類(lèi)的算法亦是有Leiden的,關(guān)于leiden和louvain兩者的區(qū)別,可以看看From Louvain to Leiden: guaranteeing well-connected communities

algorithm   
Algorithm for modularity optimization (
1 = original Louvain algorithm;
2 = Louvain algorithm with multilevel refinement; 
3 = SLM algorithm;
4 = Leiden algorithm). Leiden requires the leidenalg python.

為了可視化聚類(lèi)的結(jié)果我們還是先做一下umap降維吧,然后看看分群結(jié)果。

sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])
adata
Out[139]: 
AnnData object with n_obs × n_vars = 2223 × 2208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1', 'leiden'
    var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'pca', 'neighbors', 'leiden', 'paga', 'leiden_sizes', 'umap', 'leiden_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
差異分析

分群后我們得到以數(shù)值為標(biāo)識(shí)的亞群編號(hào),我們急于知道每一個(gè)數(shù)值背后的含義,意義還得從基因上找。差異分析就是為了這目的的。

 sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
ranking genes
D:\Program Files (x86)\anconda\lib\site-packages\scanpy\tools\_rank_genes_groups.py:252: RuntimeWarning: invalid value encountered in log2
  rankings_gene_logfoldchanges.append(np.log2(foldchanges[global_indices]))
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)

為了使我們的文章圖文并茂一些,來(lái)看看't-test'檢驗(yàn)的每個(gè)亞群的差異基因排序:

sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
sc.settings.verbosity = 2  # reduce the verbosity
help(sc.tl.rank_genes_groups)
Help on function rank_genes_groups in module scanpy.tools._rank_genes_groups:

rank_genes_groups(adata: anndata._core.anndata.AnnData, groupby: str, use_raw: bool = True, groups: Union[scanpy._compat.Literal_, Iterable[str]] = 'all', reference: str = 'rest', n_genes: int = 100, rankby_abs: bool = False, key_added: Union[str, NoneType] = None, copy: bool = False, method: scanpy._compat.Literal_ = 't-test_overestim_var', corr_method: scanpy._compat.Literal_ = 'benjamini-hochberg', layer: Union[str, NoneType] = None, **kwds) -> Union[anndata._core.anndata.AnnData, NoneType]
    Rank genes for characterizing groups.
    
    Parameters
    ----------
    adata
        Annotated data matrix.
    groupby
        The key of the observations grouping to consider.
    use_raw
        Use `raw` attribute of `adata` if present.
    layer
        Key from `adata.layers` whose value will be used to perform tests on.
    groups
        Subset of groups, e.g. [`'g1'`, `'g2'`, `'g3'`], to which comparison
        shall be restricted, or `'all'` (default), for all groups.
    reference
        If `'rest'`, compare each group to the union of the rest of the group.
        If a group identifier, compare with respect to this group.
    n_genes
        The number of genes that appear in the returned tables.
    method
        The default 't-test_overestim_var' overestimates variance of each group,
        `'t-test'` uses t-test, `'wilcoxon'` uses Wilcoxon rank-sum,
        `'logreg'` uses logistic regression. See [Ntranos18]_,
        `here <https://github.com/theislab/scanpy/issues/95>`__ and `here
        <http://www.nxn.se/valent/2018/3/5/actionable-scrna-seq-clusters>`__,
        for why this is meaningful.
    corr_method
        p-value correction method.
        Used only for `'t-test'`, `'t-test_overestim_var'`, and `'wilcoxon'`.
    rankby_abs
        Rank genes by the absolute value of the score, not by the
        score. The returned scores are never the absolute values.
    key_added
        The key in `adata.uns` information is saved to.
    **kwds
        Are passed to test methods. Currently this affects only parameters that
        are passed to :class:`sklearn.linear_model.LogisticRegression`.
        For instance, you can pass `penalty='l1'` to try to come up with a
        minimal set of genes that are good predictors (sparse solution meaning
        few non-zero fitted coefficients).
    
    Returns
    -------
    **names** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        Structured array to be indexed by group id storing the gene
        names. Ordered according to scores.
    **scores** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        Structured array to be indexed by group id storing the z-score
        underlying the computation of a p-value for each gene for each
        group. Ordered according to scores.
    **logfoldchanges** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        Structured array to be indexed by group id storing the log2
        fold change for each gene for each group. Ordered according to
        scores. Only provided if method is 't-test' like.
        Note: this is an approximation calculated from mean-log values.
    **pvals** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        p-values.
    **pvals_adj** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        Corrected p-values.
    
    Notes
    -----
    There are slight inconsistencies depending on whether sparse
    or dense data are passed. See `here <https://github.com/theislab/scanpy/blob/master/scanpy/tests/test_rank_genes_groups.py>`__.
    
    Examples
    --------
import scanpy as sc
adata = sc.datasets.pbmc68k_reduced()
sc.tl.rank_genes_groups(adata, 'bulk_labels', method='wilcoxon')
    
    # to visualize the results
sc.pl.rank_genes_groups(adata)

scanpy差異分析的方法沒(méi)有seurat豐富了,除了t-test,還有wilcoxon和logreg。

查看差異分析的結(jié)果:

sc.get.rank_genes_groups_df(adata, group="0")
Out[148]: 
       scores     names  logfoldchanges         pvals     pvals_adj
0   15.179968      IL7R             NaN  1.472705e-43  1.711438e-42
1   13.268551     RGS10             NaN  6.462721e-35  5.595956e-34
2   11.459023     LRRN3             NaN  9.208903e-27  5.465930e-26
3   10.671810       CD7             NaN  3.213348e-24  1.665510e-23
4    8.615908     INTS6             NaN  1.053249e-16  3.856673e-16
..        ...       ...             ...           ...           ...
95   1.755937    DUSP16             NaN  7.978160e-02  1.094148e-01
96   1.724632      GALM             NaN  8.523017e-02  1.163811e-01
97   1.713933  KCNQ1OT1             NaN  8.719264e-02  1.188403e-01
98   1.706563      PASK             NaN  8.851861e-02  1.202764e-01
99   1.696579      SUCO             NaN  9.047560e-02  1.227089e-01

[100 rows x 5 columns]

指定哪兩組進(jìn)行差異分析:

sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
sc.get.rank_genes_groups_df(adata, group="0")
Out[169]: 
       scores     names  logfoldchanges         pvals     pvals_adj
0   19.284122     EFNB2        2.265986  7.301753e-83  4.030568e-81
1   19.205927  C11orf96             NaN  3.301754e-82  1.350050e-80
2   19.158936     NPDC1       -2.627818  8.152369e-82  2.686632e-80
3   18.702019      EGR4             NaN  4.766577e-78  6.190942e-77
4   18.405581     FBXL8       -0.640331  1.185097e-75  1.104091e-74
..        ...       ...             ...           ...           ...
95  11.469543     AP4S1       -3.958649  1.876464e-30  3.019849e-30
96  11.459700    FBXO33        3.057788  2.102432e-30  3.378580e-30
97  11.423295   LDLRAD4        0.052481  3.198732e-30  5.121683e-30
98  11.368316      H1F0             NaN  6.013658e-30  9.587116e-30
99  11.310737    ZCWPW1        0.455119  1.161100e-29  1.836468e-29

[100 rows x 5 columns]
差異基因的可視化
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']

pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'pvals']}).head(5)

        0_n           0_p
0     EFNB2  7.301753e-83
1  C11orf96  3.301754e-82
2     NPDC1  8.152369e-82
3      EGR4  4.766577e-78
4     FBXL8  1.185097e-75
sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')
new_cluster_names = [
    'CD4 T', 'CD14 Monocytes',
    'B', 'CD8 T',
    'NK', 'FCGR3A Monocytes',
    'Dendritic', 'Megakaryocytes',
   'NewCluster1','NewCluster2','NewCluster3','NewCluster4' ]
adata.rename_categories('leiden', new_cluster_names)

\#Omitting rank_genes_groups/scores as old categories do not match.
\#Omitting rank_genes_groups/names as old categories do not match.
\#Omitting rank_genes_groups/logfoldchanges as old categories do not match.
\#Omitting rank_genes_groups/pvals as old categories do not match.
\#Omitting rank_genes_groups/pvals_adj as old categories do not match.

ax = sc.pl.dotplot(adata, marker_genes, groupby='leiden')
ax = sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90)

像seurat一樣,scanpy也有豐富的數(shù)據(jù)可視化的方式,太多以至于需要單獨(dú)再寫(xiě)另一篇文章,這里就不介紹了。

保存分析結(jié)果以供下一次調(diào)用
adata.write(results_file, compression='gzip')  # `compression='gzip'` saves disk space, but slows down writing and subsequent reading

adata.X = None
adata.write('E:\learnscanpy\write\pbmc3k_withoutX.h5ad', compression='gzip')
adata.obs[['n_counts', 'leiden']].to_csv(
     'E:\learnscanpy\write\pbmc3k_corrected_lei_groups.csv')

\# Export single fields of the annotation of observations
\# adata.obs[['n_counts', 'louvain_groups']].to_csv(
\#     './write/pbmc3k_corrected_louvain_groups.csv')

\# Export single columns of the multidimensional annotation
\# adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(
\#     './write/pbmc3k_corrected_X_pca.csv')

\# Or export everything except the data using `.write_csvs`.
\# Set `skip_data=False` if you also want to export the data.
\# adata.write_csvs(results_file[:-5], )

今天,我們學(xué)習(xí)了scanpy的一般流程,我們發(fā)現(xiàn)不管工具如何變,單細(xì)胞轉(zhuǎn)錄組的數(shù)據(jù)分析的大框架是沒(méi)有變化的,幾個(gè)分析的工具也是相互借鑒的。但是python的就不值得一學(xué)了嗎?



SCANPY: large-scale single-cell gene expression data analysis

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容