Cicero將數(shù)據(jù)存儲(chǔ)在Cell DataSet類(lèi)的對(duì)象中,該類(lèi)繼承自Bioconductor的ExpressionSet類(lèi)。使用以下三個(gè)函數(shù)來(lái)操作該對(duì)象:
- fData: 獲取feature的元信息
- pData: 獲取cell,sample的元信息
- exprs: 獲取cell-by-peak的count矩陣
loading data
Cicero使用peak作為其特征數(shù)據(jù)fData,而不是基因或轉(zhuǎn)錄本。許多Cicero函數(shù)需要形式為chr1_10390134_10391134的峰值信息,例如:
site_name chromosome bp1 bp2
chr10_100002625_100002940 chr10_100002625_100002940 10 100002625 100002940
chr10_100006458_100007593 chr10_100006458_100007593 10 100006458 100007593
chr10_100011280_100011780 chr10_100011280_100011780 10 100011280 100011780
1. 創(chuàng)建CDS class作為Cicero的輸入文件
1.1. 從簡(jiǎn)單的稀疏矩陣格式加載數(shù)據(jù)
-
Cicero包含一個(gè)名為make_atac_cds的函數(shù),這個(gè)函數(shù)的輸入數(shù)據(jù)第一列是峰坐標(biāo),格式為“ chr10_100013372_100013596”,第二列是cell名稱(chēng),第三列是整數(shù),表示該細(xì)胞與該峰重疊的read數(shù)。該文件不應(yīng)包含標(biāo)題行。如下:
chr10_100002625_100002940 cell1 1 chr10_100006458_100007593 cell2 2 chr10_100006458_100007593 cell3 1 chr10_100013372_100013596 cell2 1 chr10_100015079_100015428 cell4 3
-
將此文件使用make_atac_cds函數(shù)轉(zhuǎn)存為cds對(duì)象,以便接下來(lái)的處理(cicero軟件包需要的數(shù)據(jù)大致都為cds形式)
# read in the data cicero_data <- read.table("D:/biowork/cicero/kidney_data.txt") input_cds <- make_atac_cds(cicero_data, binarize = TRUE)
1.2. 加載 10X scATAC-seq data
- Cicero還支持利用cellranger-ATAC處理scATAC-seq數(shù)據(jù),在輸出的結(jié)果中有名為filtered_peak_bc_matrix(過(guò)濾peak-barcode矩陣)的文件夾。filtered_peak_bc_matrix文件夾中包括:
- matrix.mtx
barcodes.tsv%%MatrixMarket matrix coordinate integer general %metadata_json: {"format_version": 2, "software_version": "1.2.0"} 125987 427 1849286 125903 1 2 125834 1 2
peaks.bedAAACTGCAGAGAGTTT-1 AAAGATGAGGCTAAAT-1 AAAGGATAGAGTTCGG-1 AAAGGATTCTACTTTG-1
處理以上三個(gè)文件,同樣存儲(chǔ)為cds對(duì)象:chr1 10035 10358 chr1 629447 630122 chr1 633794 634270 chr1 775088 775154
# read in matrix data using the Matrix package indata <- Matrix::readMM("filtered_peak_bc_matrix/matrix.mtx") # binarize the matrix indata@x[indata@x > 0] <- 1 # format cell info cellinfo <- read.table("filtered_peak_bc_matrix/barcodes.tsv") row.names(cellinfo) <- cellinfo$V1 names(cellinfo) <- "cells" # format peak info peakinfo <- read.table("filtered_peak_bc_matrix/peaks.bed") names(peakinfo) <- c("chr", "bp1", "bp2") peakinfo$site_name <- paste(peakinfo$chr, peakinfo$bp1, peakinfo$bp2, sep="_") row.names(peakinfo) <- peakinfo$site_name row.names(indata) <- row.names(peakinfo) colnames(indata) <- row.names(cellinfo) # make CDS input_cds <- suppressWarnings(new_cell_data_set(indata, cell_metadata = cellinfo, gene_metadata = peakinfo)) #對(duì)于cell_data_set對(duì)象中的每個(gè)基因,detect_genes計(jì)算有多少細(xì)胞的表達(dá)超過(guò)了最小閾值,此外,對(duì)于每個(gè)細(xì)胞,detect_genes會(huì)統(tǒng)計(jì)超過(guò)該閾值的可檢測(cè)基因的數(shù)量。結(jié)果分別作為列num_cells_expressed和num_genes_expressed添加到rowData表和colData表中。 input_cds <- monocle3::detect_genes(input_cds) #Ensure there are no peaks included with zero reads input_cds <- input_cds[Matrix::rowSums(exprs(input_cds)) != 0,]
- 處理后三個(gè)文件中的內(nèi)容如下:
matrix.mtx 二進(jìn)制化
barcodes.tsv format cell info6 x 427 sparse Matrix of class "dgTMatrix" [1,] . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...... [2,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . ...... [3,] . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . ......
peaks.bed format peak infocells AAACTGCAGAGAGTTT-1 AAACTGCAGAGAGTTT-1 AAAGATGAGGCTAAAT-1 AAAGATGAGGCTAAAT-1 AAAGGATAGAGTTCGG-1 AAAGGATAGAGTTCGG-1
chr bp1 bp2 site_name chr1_10035_10358 chr1 10035 10358 chr1_10035_10358 chr1_629447_630122 chr1 629447 630122 chr1_629447_630122 chr1_633794_634270 chr1 633794 634270 chr1_633794_634270
2.Constructing cis-regulatory networks
2.1Running Cicero
- 創(chuàng)建一個(gè)Cicero CDS:
- 單細(xì)胞染色質(zhì)可及性數(shù)據(jù)極為稀疏,因此對(duì)可及性分?jǐn)?shù)的準(zhǔn)確估計(jì)需要匯總相似的細(xì)胞以創(chuàng)建更密集的計(jì)數(shù)數(shù)據(jù)。Cicero使用k最近鄰方法來(lái)做到這一點(diǎn),該方法會(huì)創(chuàng)建重疊的細(xì)胞集。Cicero基于細(xì)胞相似度的降維坐標(biāo)圖(例如UMAP或t-sne)構(gòu)造這些集合。
- 使用Monocle 3的功能,我們首先找到input_cds的UMAP坐標(biāo)
#隨機(jī)產(chǎn)生種子 set.seed(2017) #檢測(cè)超過(guò)最小閾值的基因。并將超過(guò)閾值的基因數(shù)據(jù)存放到表中 input_cds <- detect_genes(input_cds) #用于計(jì)算單細(xì)胞RNA-seq數(shù)據(jù)的大小因子 input_cds <- estimate_size_factors(input_cds) #對(duì)cds進(jìn)行預(yù)處理,為軌跡分析做準(zhǔn)備 input_cds <- preprocess_cds(input_cds, method = "LSI") #進(jìn)行降維處理--UMAP input_cds <- reduce_dimension(input_cds, reduction_method = 'UMAP', preprocess_method = "LSI") #使用Monocle的繪圖功能來(lái)可視化縮小尺寸圖: plot_cells(input_cds)
- 使用函數(shù)make_cicero_cds創(chuàng)建聚合的CDS對(duì)象。make_cicero_cds的輸入是input_CDS對(duì)象,以及縮小的尺寸坐標(biāo)圖。縮小的尺寸圖reduce_coordinates應(yīng)該采用data.frame或矩陣的形式,其中行名稱(chēng)與CDS的pData表中的cell ID匹配。
- 讀取input_cds的UMAP坐標(biāo)
reduce_coordinates的列應(yīng)為降維對(duì)象的坐標(biāo)#input_CDS對(duì)象訪(fǎng)問(wèn)UMAP坐標(biāo) umap_coords <- reducedDims(input_cds)$UMAP
#這是把細(xì)胞降維了,本來(lái)應(yīng)該是行是peak,列是細(xì)胞的矩陣,他想對(duì)細(xì)胞進(jìn)行降維,然后看細(xì)胞之間的相似程度,因?yàn)樵镜木仃嚲暥忍吡耍运阉械膒eak變成了兩個(gè)特征值,就把矩陣降成了二維--得到umap1,umap2。這樣便可以確定細(xì)胞的位置(在圖中),做數(shù)據(jù)之前會(huì)先對(duì)細(xì)胞進(jìn)行預(yù)設(shè)顏色,聚類(lèi)之后看圖中細(xì)胞聚類(lèi)效果是否好。 umap_coord1 umap_coord2 cell1 -0.7084047 -0.7232994 cell2 -4.4767964 0.8237284 cell3 1.4870098 -0.4723493
- 使用input_cds和input_cds的降維坐標(biāo)創(chuàng)建cicero_cds
#創(chuàng)建Cicero_cds數(shù)據(jù) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = umap_coords)
- Run cicero:Cicero軟件包的主要功能是估計(jì)基因組中位點(diǎn)的共可及性,以預(yù)測(cè)順式調(diào)節(jié)相互作用。有兩種獲取此信息的方法:
- 第一種方法:直接運(yùn)行run_cicero
- 要運(yùn)行run_cicero,需要一個(gè)cicero_CDS對(duì)象(在上面創(chuàng)建)和一個(gè)基因組坐標(biāo)文件,其中包含生物體中每個(gè)染色體的長(zhǎng)度。
- 導(dǎo)入的cicero包中包括人類(lèi)hg19坐標(biāo)和鼠mm9坐標(biāo),可以通過(guò)data("human.hg19.genome") and data("mouse.mm9.genome")進(jìn)行訪(fǎng)問(wèn)。
- 運(yùn)行cicero
data("mouse.mm9.genome") # use only a small part of the genome for speed sample_genome <- subset(mouse.mm9.genome, V1 == "chr2") sample_genome$V2[1] <- 10000000 ## Usually use the whole mouse.mm9.genome ## ## Usually run with sample_num = 100 ## conns <- run_cicero(cicero_cds, sample_genome, sample_num = 2) head(conns)
- run_cicero后得到如下數(shù)據(jù)格式:
#這類(lèi)細(xì)胞中這兩段peak的共可及性 Peak1 Peak2 coaccess chr2_3005180_3006128 chr2_3006405_3006928 0.35327965 chr2_3005180_3006128 chr2_3019616_3020066 -0.02461107 chr2_3005180_3006128 chr2_3021952_3022152 0.00000000
- 第二種方法Call functions separately:分別調(diào)用函數(shù),逐步進(jìn)行
- estimate_distance_parameter:此函數(shù)基于基因組的隨機(jī)小窗口計(jì)算距離懲罰參數(shù)。
- generate_cicero_models:此函數(shù)使用上面確定的距離參數(shù),并使用圖形化LASSO使用基于距離的懲罰來(lái)計(jì)算基因組重疊窗口的共可及性得分。
- assemble_connections:此函數(shù)將generate_cicero_models的輸出作為輸入,并協(xié)調(diào)重疊的模型以創(chuàng)建最終可訪(fǎng)問(wèn)性分?jǐn)?shù)列表。
2.2 Visualizing Cicero Connections
plot_connections:Cicero程序包包含的一個(gè)通用的繪圖功能,用于可視化。plot_connections有很多選項(xiàng),在他們的“Advanced Visualization”部分中進(jìn)行了詳細(xì)介紹,若是從可訪(fǎng)問(wèn)性表中獲取基本圖非常簡(jiǎn)單。
需要先從ensembl下載與此數(shù)據(jù)(mm9)關(guān)聯(lián)的GTF并加載它,用于畫(huà)圖
# Download the GTF associated with this data (mm9) from ensembl and load it
# using rtracklayer
# download and unzip
temp <- tempfile()
download.file("ftp://ftp.ensembl.org/pub/release-65/gtf/mus_musculus/Mus_musculus.NCBIM37.65.gtf.gz", temp)
gene_anno <- rtracklayer::readGFF(temp)
# gene_anno <- rtracklayer::readGFF("Mus_musculus.NCBIM37.65.gtf.gz")
unlink(temp)
# rename some columns to match requirements
gene_anno$chromosome <- paste0("chr", gene_anno$seqid)
gene_anno$gene <- gene_anno$gene_id
gene_anno$transcript <- gene_anno$transcript_id
gene_anno$symbol <- gene_anno$gene_name
#以“chr2”形式繪制的區(qū)域的染色體。
plot_connections(conns, "chr2", 9773451, 9848598,
gene_model = gene_anno,
coaccess_cutoff = .25,
connection_width = .5,
collapseTranscripts = "longest" )
2.3 比較Cicero與其他數(shù)據(jù)集的聯(lián)系
- compare_connections:將Cicero連接與具有類(lèi)似連接類(lèi)型的其他數(shù)據(jù)集進(jìn)行比較。此函數(shù)將連接對(duì)的兩個(gè)數(shù)據(jù)幀conns1和conns2作為輸入,并從conns2中找到的conns1返回連接的邏輯向量。
#虛構(gòu)一組ChIA-PET的connections
chia_conns <- data.frame(Peak1 = c("chr2_3005100_3005200", "chr2_3004400_3004600",
"chr2_3004900_3005100"),
Peak2 = c("chr2_3006400_3006600", "chr2_3006400_3006600",
"chr2_3035100_3035200"))
head(chia_conns)
# Peak1 Peak2
# 1 chr2_3005100_3005200 chr2_3006400_3006600
# 2 chr2_3004400_3004600 chr2_3006400_3006600
# 3 chr2_3004900_3005100 chr2_3035100_3035200
conns$in_chia <- compare_connections(conns, chia_conns)
對(duì)兩個(gè)聯(lián)系進(jìn)行比較,得到如下格式數(shù)據(jù),true代表在conns2中也預(yù)測(cè)到了peak1與peak2存在互作關(guān)系:
# Peak1 Peak2 coaccess in_chia
# 2 chr2_3005180_3006128 chr2_3006405_3006928 0.35327965 TRUE
# 3 chr2_3005180_3006128 chr2_3019616_3020066 -0.02461107 FALSE
# 4 chr2_3005180_3006128 chr2_3021952_3022152 0.00000000 FALSE
# 5 chr2_3005180_3006128 chr2_3024576_3025188 0.05050716 FALSE
# 6 chr2_3005180_3006128 chr2_3026145_3026392 -0.03521344 FALSE
# 7 chr2_3005180_3006128 chr2_3035075_3037296 0.01305855 FALSE
如果覺(jué)得比對(duì)結(jié)果過(guò)于緊密,也可以使用max_gap 參數(shù)放松比對(duì)松緊度。
- comparison_track:Cicero的繪圖功能可以直觀(guān)地比較數(shù)據(jù)集,比較數(shù)據(jù)幀必須包括前兩個(gè)峰值列之外的第三列,稱(chēng)為“ coaccess”,此列用于繪制連接線(xiàn)條高度。
# Add a column of 1s called "coaccess"
chia_conns <- data.frame(Peak1 = c("chr2_3005100_3005200", "chr2_3004400_3004600",
"chr2_3004900_3005100"),
Peak2 = c("chr2_3006400_3006600", "chr2_3006400_3006600",
"chr2_3035100_3035200"),
coaccess = c(1, 1, 1))
plot_connections(conns, "chr2", 3004000, 3040000,
gene_model = gene_anno,
coaccess_cutoff = 0,
connection_width = .5,
comparison_track = chia_conns,
comparison_connection_width = .5,
nclude_axis_track = FALSE,
collapseTranscripts = "longest")
2.4 Finding cis-Co-accessibility Networks (CCANS)(尋找順式可訪(fǎng)問(wèn)性網(wǎng)絡(luò))
- generate_ccans:Cicero還具有查找Cis-Co-accessibility網(wǎng)絡(luò)(CCAN)的功能,將“connection data frame”作為輸入,并為每個(gè)輸入峰值輸出具有CCAN分配的數(shù)據(jù)幀(data frame)。未包含在輸出數(shù)據(jù)幀中的site未分配CCAN。
函數(shù)generate_ccans具有一個(gè)可選輸入,稱(chēng)為coaccess_cutoff_override。當(dāng)coaccess_cutoff_override為NULL時(shí),該函數(shù)將根據(jù)不同截止點(diǎn)的總CCAN數(shù)量來(lái)確定并報(bào)告CCAN生成的合適的可訪(fǎng)問(wèn)性得分截止值。還可以將coaccess_cutoff_override設(shè)置為介于0和1之間的數(shù)字,以覆蓋該函數(shù)的臨界值查找部分。
CCAN_assigns <- generate_ccans(conns)
# [1] "Coaccessibility cutoff used: 0.14"
- generate_ccans的輸出數(shù)據(jù)格式為:
# Peak CCAN
# chr2_3005180_3006128 chr2_3005180_3006128 6
# chr2_3006405_3006928 chr2_3006405_3006928 6
# chr2_3019616_3020066 chr2_3019616_3020066 1
# chr2_3024576_3025188 chr2_3024576_3025188 6
# chr2_3026145_3026392 chr2_3026145_3026392 1
# chr2_3045478_3046610 chr2_3045478_3046610 6
2.5 Cicero gene activity scores
區(qū)域可及性的綜合得分與基因表達(dá)有更好的一致性,我們將此分?jǐn)?shù)稱(chēng)為Cicero基因活性分?jǐn)?shù),它是使用兩個(gè)函數(shù)計(jì)算得出的。
- build_gene_activity_matrix:此函數(shù)需要“input CDS ”和“ Cicero connection list”,并輸出基因活性得分的未標(biāo)準(zhǔn)化表格。重要說(shuō)明:輸入CDS必須在fData表中的一列中稱(chēng)為“基因”,如果該峰是啟動(dòng)子,則指示該基因;如果該峰是末端,則指示NA。
#### Add a column for the pData table indicating the gene if a peak is a promoter ####
# Create a gene annotation set that only marks the transcription start sites of
# the genes. We use this as a proxy for promoters.
# To do this we need the first exon of each transcript
pos <- subset(gene_anno, strand == "+")
pos <- pos[order(pos$start),]
# remove all but the first exons per transcript
pos <- pos[!duplicated(pos$transcript),]
# make a 1 base pair marker of the TSS
pos$end <- pos$start + 1
neg <- subset(gene_anno, strand == "-")
neg <- neg[order(neg$start, decreasing = TRUE),]
# remove all but the first exons per transcript
neg <- neg[!duplicated(neg$transcript),]
neg$start <- neg$end - 1
gene_annotation_sub <- rbind(pos, neg)
# Make a subset of the TSS annotation columns containing just the coordinates
# and the gene name
gene_annotation_sub <- gene_annotation_sub[,c("chromosome", "start", "end", "symbol")]
# Rename the gene symbol column to "gene"
names(gene_annotation_sub)[4] <- "gene"
#使用基于坐標(biāo)重疊的特征數(shù)據(jù)注釋cds的site。
input_cds <- annotate_cds_by_site(input_cds, gene_annotation_sub)
tail(fData(input_cds))
# DataFrame with 6 rows and 7 columns
# site_name chr bp1 bp2
# <factor> <character> <numeric> <numeric>
# chrY_590469_590895 chrY_590469_590895 Y 590469 590895
# chrY_609312_609797 chrY_609312_609797 Y 609312 609797
# chrY_621772_623366 chrY_621772_623366 Y 621772 623366
# chrY_631222_631480 chrY_631222_631480 Y 631222 631480
# chrY_795887_796426 chrY_795887_796426 Y 795887 796426
# chrY_2397419_2397628 chrY_2397419_2397628 Y 2397419 2397628
# num_cells_expressed overlap gene
# <integer> <integer> <character>
# chrY_590469_590895 5 NA NA
# chrY_609312_609797 7 NA NA
# chrY_621772_623366 106 2 Ddx3y
# chrY_631222_631480 2 NA NA
# chrY_795887_796426 1 2 Usp9y
# chrY_2397419_2397628 4 NA NA
#### Generate gene activity scores ####
# generate unnormalized gene activity matrix
unnorm_ga <- build_gene_activity_matrix(input_cds, conns)
# remove any rows/columns with all zeroes
unnorm_ga <- unnorm_ga[!Matrix::rowSums(unnorm_ga) == 0,
!Matrix::colSums(unnorm_ga) == 0]
# make a list of num_genes_expressed
num_genes <- pData(input_cds)$num_genes_expressed
names(num_genes) <- row.names(pData(input_cds))
# normalize
cicero_gene_activities <- normalize_gene_activities(unnorm_ga, num_genes)
# if you had two datasets to normalize, you would pass both:
# num_genes should then include all cells from both sets
unnorm_ga2 <- unnorm_ga
- normalize_gene_activities:對(duì)上一個(gè)未標(biāo)準(zhǔn)化的結(jié)果進(jìn)行標(biāo)準(zhǔn)化。normalize_gene_activities還需要每個(gè)單元總共可訪(fǎng)問(wèn)站點(diǎn)的命名向量。這可以在CDS的pData表中輕松找到,該表稱(chēng)為“ num_genes_expressed”,標(biāo)準(zhǔn)化的基因活性得分范圍是0到1。
cicero_gene_activities <- normalize_gene_activities(list(unnorm_ga, unnorm_ga2),
num_genes)
2.6 Advanced visualizaton
- “plot_connections”函數(shù)的Some useful parameters
- Viewpoints:可讓您僅查看來(lái)自基因組中特定位置的連接。這在將數(shù)據(jù)與4C-seq數(shù)據(jù)進(jìn)行比較時(shí)可能很有用。
- alpha_by_coaccess:使您在擔(dān)心過(guò)度繪圖時(shí)很有用。此參數(shù)使連接曲線(xiàn)的Alpha(透明度)基于協(xié)同訪(fǎng)問(wèn)的大小進(jìn)行縮放
- Colors: 有幾個(gè)與顏色有關(guān)的參數(shù):peak_color,comparison_peak_color,connection_color,comparison_connection_color,gene_model_color,viewpoint_color,viewpoint_fill
- 使用return_as_list自定義所有內(nèi)容
3. 單細(xì)胞可及性軌跡
Cicero軟件包的第二個(gè)主要功能是擴(kuò)展Monocle 3,以用于單細(xì)胞可訪(fǎng)問(wèn)性數(shù)據(jù)。染色質(zhì)可訪(fǎng)問(wèn)性數(shù)據(jù)要克服的主要障礙是稀疏性,因此大多數(shù)擴(kuò)展和方法都旨在解決這一問(wèn)題。
3.1 使用可訪(fǎng)問(wèn)性數(shù)據(jù)構(gòu)造軌跡
- 簡(jiǎn)而言之,Monocle通過(guò)三個(gè)步驟推斷偽時(shí)間軌跡:
- Preprocess the data
- Reduce the dimensionality of the data
- Cluster the cells
- Learn the trajectory graph
- Order the cells in pseudotime
- 僅需進(jìn)行少量修改就可以對(duì)可訪(fǎng)問(wèn)性數(shù)據(jù)運(yùn)行以下步驟:
2.1 首先,我們下載并加載數(shù)據(jù)(與上面相同):
2.2 接下來(lái),我們使用潛在語(yǔ)義索引(LSI)預(yù)處理數(shù)據(jù),然后繼續(xù)使用Monocle 3中使用的標(biāo)準(zhǔn)降維方法。# Code to download (54M) and unzip the file - can take a couple minutes # depending on internet connection: temp <- textConnection(readLines(gzcon(url("http://staff.washington.edu/hpliner/data/kidney_data.txt.gz")))) # read in the data cicero_data <- read.table(temp) input_cds <- make_atac_cds(cicero_data)
2.3 Plot the results(繪制結(jié)果)set.seed(2017) input_cds <- estimate_size_factors(input_cds) #1 input_cds <- preprocess_cds(input_cds, method = "LSI") #2 input_cds <- reduce_dimension(input_cds, reduction_method = 'UMAP', preprocess_method = "LSI") #3 input_cds <- cluster_cells(input_cds) #4 input_cds <- learn_graph(input_cds) #5 # cell ordering can be done interactively by leaving out "root_cells" input_cds <- order_cells(input_cds, root_cells = "GAGATTCCAGTTGAATCACTCCATCGAGATAGAGGC")
plot_cells(input_cds, color_cells_by = "pseudotime")
3.2 差異可及性分析
- Aggregation(聚合):解決稀疏性以進(jìn)行差異分析,Cicero軟件包處理稀疏單細(xì)胞染色質(zhì)可及性數(shù)據(jù)的主要方式是通過(guò)聚合。
- 可視化跨偽時(shí)的可訪(fǎng)問(wèn)性
- 使用單細(xì)胞染色質(zhì)可訪(fǎng)問(wèn)性數(shù)據(jù)運(yùn)行fit_models:判斷位點(diǎn)是否在偽時(shí)更改,因此我們將聚集類(lèi)似的細(xì)胞。為此,Cicero提供了函數(shù)aggregate_by_cell_bin。
- 可視化跨偽時(shí)的可訪(fǎng)問(wèn)性
#pseudotime:從CDS對(duì)象中提取偽時(shí)間
input_cds_lin <- input_cds[,is.finite(pseudotime(input_cds))]
#利用偽時(shí)間繪制可及性圖
plot_accessibility_in_pseudotime(input_cds_lin[c("chr1_3238849_3239700",
"chr1_3406155_3407044",
"chr1_3397204_3397842")])
plot_accessibility_in_pseudotime:
每個(gè)柱的基數(shù)為多個(gè)cell,這些細(xì)胞在當(dāng)前偽時(shí)間段內(nèi)的開(kāi)放性(eg:10個(gè)細(xì)胞中有2個(gè)開(kāi)發(fā),縱坐標(biāo)則為2)比例作為縱坐標(biāo)。
黑線(xiàn)表示偽時(shí)間依賴(lài)的平均可達(dá)性平滑二項(xiàng)回歸。
- 使用單細(xì)胞染色質(zhì)可訪(fǎng)問(wèn)性數(shù)據(jù)運(yùn)行fit_models
# First, assign a column in the pData table to umap pseudotime
pData(input_cds_lin)$Pseudotime <- pseudotime(input_cds_lin)
#將偽時(shí)間軌跡切成10個(gè)部分,從而將cell元分配給bin。
pData(input_cds_lin)$cell_subtype <- cut(pseudotime(input_cds_lin), 10)
binned_input_lin <- aggregate_by_cell_bin(input_cds_lin, "cell_subtype")
- 運(yùn)行fit_models
# For speed, run fit_models on 1000 randomly chosen genes
set.seed(1000)
acc_fits <- fit_models(binned_input_lin[sample(1:nrow(fData(binned_input_lin)), 1000),],
model_formula_str = "~Pseudotime + num_genes_expressed" )
fit_coefs <- coefficient_table(acc_fits)
# Subset out the differentially accessible sites with respect to Pseudotime
pseudotime_terms <- subset(fit_coefs, term == "Pseudotime" & q_value < .05)
head(pseudotime_terms)
# # A tibble: 2 x 12
# site_name num_cells_expre… use_for_ordering status term estimate std_err
# <fct> <int> <lgl> <chr> <chr> <dbl> <dbl>
# 1 chr12_32… 3 FALSE OK Pseu… -0.352 0.0404
# 2 chr14_61… 2 FALSE OK Pseu… -0.413 0.0280
# # … with 5 more variables: test_val <dbl>, p_value <dbl>,
# # normalized_effect <dbl>, model_component <chr>, q_value <dbl>
4. Useful Functions
- annotate_cds_by_site:向CDS對(duì)象添加關(guān)于峰值的附加注釋。例如,您可能想知道哪些峰與外顯子或轉(zhuǎn)錄起始位點(diǎn)重疊。輸入信息有
- CDS
- 具有bed格式的信息(染色體,bp1,bp2,其他列)的數(shù)據(jù)幀或文件路徑
- find_overlapping_coordinates:只想知道哪些峰與基因組的特定區(qū)域重疊