基本概念
基迪奧有篇文章寫得非常的簡單明了,我這里就不再贅述,大家移步去搞清楚基本知識。
STRUCTURE軟件的使用準則
軟件假設輸入的標記數據中,每個標記都是獨立的,所以在分析之前,需要對標記按照一定規則進行篩選。常見篩選方法有如下三種Nat Rev Genet, 2015:
- 一定物理距離取一個代表用于分析
- 全基因組上隨機抽取一部分標記進行分析
- 按照LD篩選:LD強度大于一定閾值的標記只保留一個用于分析
STRUCTURE軟件實操:
前期準備
給標記加上ID
SNP data通常都是以VCF格式文件呈現,拿到VCF文件的第一件事情就是添加各個SNP位點的ID。
先看一下最開始生成的VCF文件:
可以看到,ID列都是".",需要我們自己加上去。我用的是某不知名大神寫好的perl腳本,可以去我的github上下載,用法:
perl path2file/VCF_add_id.pl YourDataName.vcf YourDataName-id.vcf`
當然也可以用excel手工添加。添加后的文件如下圖所示(格式:CHROMID__POS):
SNP位點過濾(Missing rate and maf filtering)
SNP位點過濾前需要問自己一個問題,我的數據需要過濾嗎?
一般要看后期是否做關聯分析(GWAS);如果只是單純研究群體結構建議不過濾,因為過濾掉低頻位點可能會改變某些樣本之間的關系;如果需要和表型聯系其來做關聯分析,那么建議過濾,因為在后期分析中低頻位點是不在考慮范圍內的,需要保持前后一致。
如果過濾,此處用到強大的plink軟件,用法:
plink --vcf YourDataName-id.vcf --maf 0.05 --geno 0.2 --recode vcf-iid -out YourDataName-id-maf0.05 --allow-extra-chr
參數解釋:--maf 0.05:過濾掉次等位基因頻率低于0.05的位點;--geno 0.2:過濾掉有20%的樣品缺失的SNP位點;--allow-extra-chr:我的參考數據是Contig級別的,個數比常見分析所用的染色體多太多,所以需要加上此參數。
LD篩選(LD pruning and make bed file)
前文提到STRUCTURE軟件假設輸入的標記數據中,每個標記都是獨立的,所以我們需要對標記按照一定規則進行篩選,這里用其中的一種方法——LD篩選。
plink --vcf YourDataName-id-maf0.05.vcf --indep-pairwise 100 50 0.2 -out YourDataName-id-maf0.05-LD --allow-extra-chr --make-bed
100—以100個kb為單位;50—SNP數目,50個SNP的步長;0.2—LD強度。
轉換為STRUCTURE格式
plink --bfile YourDataName-id-maf0.05-LD --extract YourDataName-id-maf0.05-LD.prune.in --out YourDataName-id-maf0.05-LD-structure --recode structure --allow-extra-chr
填寫STRUCTURE配置文件:
配置文件有兩個,分別是mainparams和extraparams。我們需要填寫mainparams同時生成空extraparams文件。
注意:mainparams配置文件的個數為最大K值乘重復次數,如計算K從1到10,每個重復3次,則要有30個該文件,也要有對應的30個命令行。
如果計算K從1到10,每個重復3次,30個配置文件可以這樣命名:
STRUCTURE運行
運行STRUCTURE很簡單:
#單個運行:
structure -m mainparams_1_1 -e extraparams
structure -m mainparams_1_2 -e extraparams
structure -m mainparams_1_3 -e extraparams
。
。
#同時運行:將mainparams配置文件名放到一個list中,用for循環調用運行STRUCTURE:
for i in $(less mainparams.list); do nohup structure -m ${i} -e extraparams & done
結果可視化
Structure的結果可視化用到一個R包——pophelper,需要在R環境中安裝后調用。注意:新版pophelper用下述命令會報錯,最好使用V2.2.9
#安裝pophelper 2.2.9軟件:
install.packages(c("Cairo","devtools","ggplot2","gridExtra","gtable","tidyr"),dependencies=T)
devtools::install_github('royfrancis/pophelper')
數據可視化包括兩個方面,1)計算K值并畫圖,2)繪制Structure堆疊圖。方法很簡單,把所有的結果都放在同一個文件夾里,調用pophelpe即可,寫好的R命令如下所示,按需求執行:
另外,需要準備分組文件(pop_list.txt),我分了如下圖的幾列,大家可以自己DIY。注意:該文件中的樣品排序需要與VCF中的樣品排序相對應
# read structure results
#更改工作路徑(該路徑下存有Structure所有的運行結果)
setwd("F:structure_results")
#調用pophelper
library(pophelper)
file_list <- list.files(path = "./out/", full.names = T) # list file directory
qlist <- readQ(file_list) # read result files
# evanno method to calculate deltaK
tbq <- tabulateQ(qlist)
smq <- summariseQ(tbq)
###繪制最佳K值線
evannoMethodStructure(smq, exportplot = T, writetable = T,
imgtype = "png", height = 16, width = 18,outputfilename = "evanno")
evannoMethodStructure(smq, exportplot = T, writetable = T,
imgtype = "pdf", height = 16, width = 18,outputfilename = "evanno")
# clumpp repeat results
clumppExport(qlist = qlist, parammode = 3, prefix = "pop", useexe = T) # run clumpp
collectClumppOutput(prefix = "pop", filetype = "both", runsdir = getwd()) # collect clumpp results
# read clumpp merged results
fclum <- list.files(path = "pop-both", full.names = T, pattern = "merge")
qclum <- readQ(fclum)
sample_order <- read.table("./pop_list.txt", header = T, stringsAsFactors = F)
ind_name <- sample_order[,1]
for(i in 1:length(qclum)){
row.names(qclum[[i]]) <- ind_name
}
mink <- 2
maxk <- 10
k_order <- vector()
if(maxk < 10){
k_order <- 1:length(qclum)
} else if (maxk < 20) {
end1 <- maxk - 10 + 1
start2 <- end1 + 1
k_order <- c(start2:length(qclum), 1:end1)
}
klab <- vector()
if(mink == 1){
klab <- 2:maxk
} else {
klab <- mink:maxk
}
# 繪制全局structure圖
# plot global barplot without group information
prefix <- "demo"
height <- 2
width <- 16
plotQ(qclum[k_order], imgoutput="join",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=NA,
outputfilename=prefix,imgtype="pdf", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splab = paste0("K = ", klab), splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA)
plotQ(qclum[k_order], imgoutput="join",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=0.1,
outputfilename=prefix,imgtype="png", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splab = paste0("K = ", klab), splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA)
# 繪制全局并帶有組信息的structure圖
# plot global barplot with group information
prefix <- "demo_label"
plotQ(qclum[k_order], imgoutput="join",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=NA,
outputfilename=prefix,imgtype="pdf", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splab = paste0("K = ", klab), splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA,
grplab=sample_order[,2:3,drop=FALSE],ordergrp=T, grplabsize=2, grplabheight = 4)
plotQ(qclum[k_order], imgoutput="join",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=0.1,
outputfilename=prefix,imgtype="png", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splab = paste0("K = ", klab), splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA,
grplab=sample_order[,2:3,drop=FALSE],ordergrp=T,grplabsize=2, grplabheight = 4)
# 繪制各個k值的structure圖
# plot single K barplot
plotQ(qclum, imgoutput = "sep",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=NA,
imgtype="pdf", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA)
plotQ(qclum, imgoutput = "sep",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=NA,
imgtype="pdf", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA,
grplab=sample_order[,2:3,drop=FALSE],ordergrp=T,grplabsize=2, grplabheight = 4)
## for admixture plot
library(pophelper)
setwd("F:/works/developing/course/gwas/data/lecture07/admixture_results")
file_list_admix <- list.files("admixture_output/", pattern = ".Q", full.names = T)
info <- read.table("sample_order.txt", header = T, stringsAsFactors = F)
qlist_admix <- readQ(file_list_admix)
for(i in 1:length(qlist_admix)){
row.names(qlist_admix[[i]]) <- info$sample
}
k_order <- vector()
mink <- 1
maxk <- 10
if(maxk < 10){
k_order <- 1:length(qlist_admix)
} else if (maxk < 20) {
end1 <- maxk - 10 + 1
start2 <- end1 + 1
k_order <- c(start2:length(qlist_admix), 1:end1)
}
klab <- vector()
if(mink == 1){
klab <- 2:maxk
} else {
klab <- mink:maxk
}
prefix <- "admix"
height <- 1
width <- 16
plotQ(qlist_admix[k_order], imgoutput="join",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=NA,
outputfilename=prefix,imgtype="pdf", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splab = paste0("K = ", klab), splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA)
plotQ(qlist_admix[k_order], imgoutput="join",showindlab=T, showlegend=F, sortind = "all",
indlabsize=0.5,indlabheight=0,indlabspacer=0.05,barbordersize=NA,
outputfilename=prefix,imgtype="png", sharedindlab = F,
useindlab = T, showyaxis = T, basesize = 10, sppos = "right", showticks = T,
splab = paste0("K = ", klab), splabsize = 6, splabface = "bold",
width = width, height = height, panelspacer = 0.02, dpi = 600, barbordercolour = NA)
參考:
群體結構圖形——structure堆疊圖
Sehraiber J G. Methods and models for unravelling human evolutionary history. Nature Reviews. Genetics, 2015