10X單細胞(10X空間轉錄組)BCR(TCR)數據分析之(9)changeo

hello,大家好,前面分享了很多有關TCR的數據分析,今天我們來繼續這個專題,今天的分享都是為后續最終的目的做準備,現在還都只是中間過程,好了,我們今天我分享的軟件是changeo,大家可以參考文章Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data,2015年便發表了,而我們分享這個的目的在于集成這個軟件的算法,達到我們最后這個專題的目的,這個軟件主要關注在BCR的數據分析,我們來看看這個軟件的算法和原理。

背景

高通量測序技術的進步現在允許對 B 細胞受體 (BCR) 和 T 細胞受體 (TCR) 庫進行大規模表征。 適應性免疫受體庫 (AIRR) 的高種系和體細胞多樣性對具有生物學意義的分析提出了挑戰——需要開發專門的計算方法。
Imcantation 框架為高通量 AIRR-seq 數據集提供了一個從頭到尾的分析生態系統。 從原始讀取開始,提供 Python 和 R 包用于預處理、群體結構確定和repertoire分析。 (這個我們當然喜聞樂見)。
核心軟件(軟件非常的多,這個專題我們也要持續很久)。
圖片.png
Contributed Packages
圖片.png

當然,時至今日,我們只關心軟件對于10X數據的分析(限于個人背景),我們來看看軟件的處理和結果。

1、Converting 10X V(D)J data into the AIRR Community standardized format

To process 10X V(D)J data, a combination of AssignGenes.py and MakeDb.py(這是兩個集成的分析模塊,作者已經封裝好,這會兒我們只需要知道其作用,后續會做詳細的分享) can be used to generate a TSV file compliant with the AIRR Community Rearrangement(當然這里又是一個集成的算法,我們也需要了解,后續我們詳細的分享,先了解一下概念,重排是描述重排的適應性免疫受體鏈(例如,抗體重鏈或 TCR β 鏈)以及大量注釋的序列。 這些注釋由 AIRR Rearrangement 模式定義,包括八個類別。 ) schema that incorporates annotation information provided by the Cell Ranger pipeline. The --10x filtered_contig_annotations.csv specifies the path of the contig annotations file generated by cellranger vdj, which can be found in the outs directory.免疫真的很難,需要耐心和時間。
Generate AIRR Rearrangement data from the 10X V(D)J FASTA files using the steps below:
AssignGenes.py igblast -s filtered_contig.fasta -b ~/share/igblast \
   --organism human --loci ig --format blast
MakeDb.py igblast -i filtered_contig_igblast.fmt7 -s filtered_contig.fasta \
   -r IMGT_Human_*.fasta --10x filtered_contig_annotations.csv --extended
  • 注:all_contig.fasta can be exchanged for filtered_contig.fasta, and all_contig_annotations.csv can be exchanged for filtered_contig_annotations.csv.
一些參數的轉換
  • To process mouse data and/or TCR data alter the --organism and --loci arguments to AssignGenes.py accordingly (e.g., --organism mouse, --loci tcr) and use the appropriate V, D and J IMGT reference databases (e.g., IMGT_Mouse_TR*.fasta)

2、 Identifying clones from B cells in AIRR formatted 10X V(D)J data

Splitting into separate light and heavy chain files(這個地方跟之前TCR的處理類似,但是方式不一樣)。
要將 B 細胞從 AIRR 重排數據分組為克隆,必須將 MakeDb.py 的輸出解析為輕鏈文件和重鏈文件:
ParseDb.py select -d 10x_igblast_db-pass.tsv -f locus -u "IGH" \
        --logic all --regex --outname heavy
ParseDb.py select -d 10x_igblast_db-pass.tsv -f locus -u "IG[LK]" \
        --logic all --regex --outname light

接下來就是要進行數據的過濾了,這個過濾和我們的普通轉錄組差別巨大。

The ParseDb.py(又是一個集成模塊) tool provides a basic set of operations for manipulating Change-O database files from the commandline, including removing or updating rows and columns.

Removing non-productive sequences

After building a Change-O database from either IMGT/HighV-QUEST or IgBLAST output, you may wish to subset your data to only productive sequences. This can be done in one of two roughly equivalent ways using the ParseDb.py tool:(挑選有用的數據)
ParseDb.py select -d HD13M_db-pass.tsv -f productive -u T
ParseDb.py split -d HD13M_db-pass.tsv -f productive
上面的第一行使用 select 子命令輸出標記為 parse-select 的單個文件,該文件僅包含productive列 (-f productive) 中值為 T (-u T) 的記錄。
或者,上面的第二行使用 split 子命令輸出多個文件,每個文件包含具有productive列中找到的值之一的記錄 (-f productive)。 這將生成標記為productive-Tproductive-F的兩個文件。

消除 C 區引物和參考比對之間的分歧

If you have data that includes both heavy and light chains in the same library, the V-segment and J-segment alignments from IMGT/HighV-QUEST or IgBLAST may not always agree with the isotype assignments from the C-region primers. In these cases, you can filter out such reads with the select subcommand of ParseDb.py. An example function call using an imaginary file db.tsv is provided below:
ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IGH" \
    --logic all --regex --outname heavy
ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IG[LK]" \
    --logic all --regex --outname light
這些命令將要求所有 v_call、j_call 和 c_call 區域(-f v_call j_call c_call 和 --logic all)包含字符串 IGH(第 1-2 行)或 IGK 或 IGL(第 3-4 行)之一。 --regex參數允許對正則表達式進行部分匹配和解釋。 這兩個命令的輸出是兩個文件,一個只包含重鏈(heavy_parse-select.tsv),一個只包含輕鏈(light_parse-select.tsv)。

Exporting records to FASTA files

You may want to use external tools, or tools from pRESTO, on your Change-O result files. The ConvertDb.py tool provides two options for exporting data from tab-delimited files to FASTA format.
Standard FASTA
ConvertDb.py fasta -d HD13M_db-pass.tsv --if sequence_id \
    --sf sequence_alignment --mf v_call duplicate_count
Where the column containing the sequence identifier is specified by --if sequence_id, the nucleotide sequence column is specified by --sf sequence_id, and additional annotations to be added to the sequence header are specified by --mf v_call duplicate_count.(我們一般也就需要轉換成fasta)

Assign clonal groups to the heavy chain data(最為關鍵的聚類)

The heavy chain file must then be clonally clustered separately.(重鏈需要單獨聚類,那到底如何做呢?)

Clustering sequences into clonal groups

首先第一步:Determining a clustering threshold
Before running DefineClones.py(集成模塊,我們暫時先知道是做聚類的), it is important to determine an appropriate threshold for trimming the hierarchical clustering into B cell clones.(看來這里的聚類是層次聚類)。The distToNearest function in the SHazaM R package calculates the distance between each sequence in the data and its nearest-neighbor.(利用了R包SHazaM來計算序列的distance,這里就像是之前分享的利用TCRdist計算TCR之間的序列距離),結果分布應該是雙峰的,第一種模式表示數據集中具有克隆親屬的序列,第二種模式表示singletons(這個單詞大家意會一下)。The ideal threshold for separating clonal groups is the value that separates the two modes of this distribution and can be found using the findThreshold function in the SHazaM R package.(閾值的選擇既有人工也有軟件識別),The distToNearest function allows selection of all parameters that are available in DefineClones.py. Using the length normalization parameter ensures that mutations are weighted equally regardless of junction sequence length.The distance to nearest-neighbor distribution for the example data is shown below. The threshold is approximately 0.16 - indicated by the red dotted line.(看來最好還是人工選擇,當然關于閾值的劃分我們還會再分享一篇文章)。
圖片.png
第二步、克隆聚類(主要針對B細胞)
There are several parameter choices when grouping Ig sequences into B cell clones. The argument --act set accounts for ambiguous V gene and J gene calls when grouping similar sequences. The distance metric --model ham is nucleotide Hamming distance. Because the threshold was generated using length normalized distances, the --norm len argument is selected with the previously determined threshold --dist 0.16:
DefineClones.py -d HD13M_db-pass.tsv --act set --model ham \
    --norm len --dist 0.16

Correct clonal groups based on light chain data

DefineClones.py currently does not support light chain cloning. However, cloning can be performed after heavy chain cloning using light_cluster.py provided on the Immcantation Bitbucket repository in the scripts directory:
light_cluster.py -d heavy_select-pass_clone-pass.tsv -e light_select-pass.tsv \
        -o 10X_clone-pass.tsv
Here, heavy_select-pass_clone-pass.tsv refers to the cloned heavy chain AIRR Rearrangement file, light_select-pass.tsv refers to the light chain file, and 10X_clone-pass.tsv is the resulting output file.
算法優勢
  • remove cells associated with more than one heavy chain
  • correct heavy chain clone definitions based on an analysis of the light chain partners associated with the heavy chain clone.(相互輔助的結果)


    圖片.png

接下來就是Reconstructing germline sequences(重建種系序列)

Adding germline sequences to the database
The CreateGermlines.py tool is used to reconstruct the germline V(D)J sequence, from which the Ig lineage and mutations can be inferred(推斷ig的譜系和突變)。 In addition to the alignment information parsed by MakeDb.py to generate the initial database, CreateGermlines.py also requires the set of germline sequences that were used for the alignment passed to the -r argument. In the case of V-segment germlines, the reference sequences must be IMGT-gapped. Because the D-segment call for B cell receptor alignments is often low confidence, the default germline format (-g dmask) places Ns in the N/P and D-segments of the junction region rather than using the D-segment assigned during reference alignment; this can be modified to generate a complete germline (-g full) or a V-segment only germline (-g vonly) if you wish. The command below adds the germline sequence to the germline_alignment_d_mask column of the output database:(這一部分應該是可選的,但是譜系示蹤確實非常重要)
CreateGermlines.py -d HD13M_db-pass.tsv -g dmask \
    -r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta

Alternatively, if you have run the clonal assignment task prior to invoking CreateGermlines.py, then adding the --cloned argument is recommended, as this will generate a single germline of consensus length for each clone:(當然,會涉及到很多其他軟件的適用,我們慢慢來,欲速則不達)。

CreateGermlines.py -d HD13M_db-pass_clone-pass.tsv -g dmask --cloned \
    -r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
圖片.png

IgPhyML lineage tree analysis(這也是這個軟件的最終目的)

IgPhyML 是一個程序,旨在構建系統發育樹并測試有關 B 細胞親和力成熟的進化假設。

B 細胞體細胞超突變 (SHM) 的生物學違反了大多數標準系統發育替代模型中的重要假設; 此外,雖然大多數系統發育程序旨在分析單個譜系,但 B 細胞庫通常包含數千個譜系。 IgPhyML 通過實施替代模型來解決這兩個問題,這些模型糾正了 SHM 的上下文敏感性質,并結合了來自多個譜系的信息,以提供更精確估計的全庫模型參數估計。

An in-depth description of IgPhyML installation and usage can be found at the IgPhyML website.(這個軟件我們也需要深入分析和分享了,真的太多了).

分析開始
Once installed, IgPhyML can be run through BuildTrees by specifying the --igphyml option. IgPhyML is easiest to run through the Immcantation Docker image. If this is not possible, these instructions require Change-O 0.4.6 or higher, Alakazam 0.3.0 or higher, and IgPhyML to be installed, with the executable in your PATH variable.
The following commands should work as a first pass on many reasonably sized datasets, but if you really want to understand what’s going on or make sure what you’re doing makes sense, please check out the IgPhyML website.(后續分享這個軟件)。
Build trees and estimate model parameters
# Clone IgPhyML repository to get example files
git clone https://bitbucket.org/kleinstein/igphyml

# Move to examples directory
cd igphyml/examples

# Run BuildTrees
BuildTrees.py -d example.tsv --outname ex --log ex.log --collapse \
    --sample 3000 --igphyml --clean all --nproc 1
此命令處理 BCR 序列的 AIRR 格式數據集,該數據集已與重建的生殖系進行克隆聚類。 然后使用 GY94 模型快速構建樹,并使用這些固定拓撲估計 HLP19 模型參數。 這可以通過增加 --nproc 選項來加速。 使用 --sample 選項進行子采樣并不是絕對必要的,但 IgPhyML 在應用于大型數據集時會運行緩慢。 在這里, --collapse 標志用于折疊相同的序列。 強烈建議這樣做,因為相同的序列會減慢計算速度,而不會影響 IgPhyML 中的似然值。

可視化

可以使用 Alakazam 的 readIgphyml 函數讀取上述命令的輸出文件(又是一個集成模塊)。 在示例子文件夾中打開 R 會話后,輸入以下命令。 請注意,使用 Docker 容器時,您需要在繪制樹后運行 dev.off() 以在示例目錄中創建 pdf 繪圖:

library(alakazam)
library(igraph)

db = readIgphyml("ex_igphyml-pass.tab")

# Plot largest lineage tree
plot(db$trees[[1]],layout=layout_as_tree)

# Show HLP10 parameters
print(t(db$param[1,]))
CLONE         "REPERTOIRE"
NSEQ          "4"
NSITE         "107"
TREE_LENGTH   "0.286"
LHOOD         "-290.7928"
KAPPA_MLE     "2.266"
OMEGA_FWR_MLE "0.5284"
OMEGA_CDR_MLE "2.3324"
WRC_2_MLE     "4.8019"
GYW_0_MLE     "3.4464"
WA_1_MLE      "5.972"
TW_0_MLE      "0.8131"
SYC_2_MLE     "-0.99"
GRS_0_MLE     "0.2583"
圖片.png
To visualize a larger dataset with bigger trees, and bifurcating tree topologies, again open an R session in the examples directory:
library(alakazam)
library(ape)

db = readIgphyml("sample1_igphyml-pass.tab",format="phylo")

# Plot largest lineage tree
plot(ladderize(db$trees[[1]]),cex=0.7,no.margin=TRUE)
圖片.png

真的是又多又難,生活很好,有你更好

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
禁止轉載,如需轉載請通過簡信或評論聯系作者。

推薦閱讀更多精彩內容