0.軟件的安裝略
1.物種樹的構建
參見 http://www.lxweimin.com/p/336b65ca1b67
假設我們已經通過上述網頁拿到了A-c.phy和相應的物種樹文件
2.估計堿基替換速率(對應軟件版本4.8, 4.9的參數和本文不同)
軟件版本4.8
/data/01/user157/software/paml4.8/bin/baseml baseml.ctl
baseml.input.tree的內容:
6 1
((Bamboo,XD),((sppCa,sppJu)'@.001',(sppGa,sppGo)));
#需要標定時間,@.001表示大約在10萬年
#標定的時間可以用timetree獲得,輸入兩個物種,可以得到這兩個物種的最晚分化時間
baseml.ctl內容如下, 前兩行是上面提到的輸入文件:
seqfile = A-c.phy
treefile = baseml.input.tree
outfile = mlb * main result file
noisy = 3 * 0,1,2,3: how much rubbish on the screen
verbose = 0 * 1: detailed output, 0: concise output
runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic
* 3: StepwiseAddition; (4,5):PerturbationNNI
model = 7 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
* 5:T92, 6:TN93, 7:REV, 8:UNREST, 9:REVu; 10:UNRESTu
Mgene = 0 * 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff
clock = 1 * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
fix_kappa = 0 * 0: estimate kappa; 1: fix kappa at value below; 2: kappa for branches
kappa = 2 * initial or fixed kappa
fix_alpha = 0 * 0: estimate alpha; 1: fix alpha at value below
alpha = 0.5 * initial or fixed alpha, 0:infinity (constant rate)
Malpha = 0 * 1: different alpha's for genes, 0: one alpha
ncatG = 5 * # of categories in the dG, AdG, or nparK models of rates
nparK = 0 * rate-class models. 1:rK, 2:rK&fK, 3:rK&MK(1/K), 4:rK&MK
nhomo = 0 * 0 & 1: homogeneous, 2: kappa for branches, 3: N1, 4: N2
getSE = 1 * 0: don't want them, 1: want S.E.s of estimates
RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states
Small_Diff = 7e-6
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
* icode = 0 * (with RateAncestor=1. try "GC" in data,model=4,Mgene=4)
* fix_blength = 1 * 0: ignore, -1: random, 1: initial, 2: fixed, 3: proportional
method = 0 * Optimization method 0: simultaneous; 1: one branch a time
輸出文件mlb中會有這么幾行,記住這個數字, 這個就是替換率
Substitution rate is per time unit
0.395351 +- 0.000685
3.第一次運行mcmctree
/data/01/user157/software/paml4.8/bin/mcmctree mcmctree.ctl
輸出文件out.BV, 用作下面步驟的輸入文件
mcmctreel.input.tree的內容:
6 1
((Bamboo,XD),((sppCa,sppJu)'>.001',(sppGa,sppGo)));
#需要標定時間,>.001表示大于10萬年
seed = -1
seqfile = A-c.phy
treefile = mcmc.input.tree
outfile = out
ndata = 1
seqtype = 0 * 0: nucleotides; 1:codons; 2:AAs
usedata = 3 * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV
clock = 2 * 1: global clock; 2: independent rates; 3: correlated rates
RootAge = <1.0 * safe constraint on root age, used if no fossil for root.
model = 0 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
alpha = 0 * alpha for gamma rates at sites
ncatG = 5 * No. categories in discrete gamma
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
BDparas = 1 1 0 * birth, death, sampling
kappa_gamma = 6 2 * gamma prior for kappa
alpha_gamma = 1 1 * gamma prior for alpha
rgene_gamma = 1 2.529397927411 1 * gamma prior for overall rates for genes ### 1/替換率
sigma2_gamma = 1 10 1 * gamma prior for sigma^2 (for clock=2 or 3)
finetune = 1: .05 0.1 0.12 0.1 .3 * auto (0 or 1) : times, rates, mixing, paras, RateParas, FossilErr
print = 1
burnin = 500000
sampfreq = 5000
nsample = 20000
cat out.BV > in.BV #用作后面的輸入文件
4.第二次運行mcmctree,把上面的in.BV拷貝在第二次運行的目錄下
/data/01/user157/software/paml4.8/bin/mcmctree mcmctree.ctl
這一次的mcmctree.ctl的內容:(區(qū)別在于usedata = 2
seed = -1
seqfile = A-c.phy
treefile = mcmc.input.tree
outfile = out
ndata = 1
seqtype = 0 * 0: nucleotides; 1:codons; 2:AAs
usedata = 2 * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV
clock = 2 * 1: global clock; 2: independent rates; 3: correlated rates
RootAge = <1.0 * safe constraint on root age, used if no fossil for root.
model = 0 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
alpha = 0 * alpha for gamma rates at sites
ncatG = 5 * No. categories in discrete gamma
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
BDparas = 1 1 0 * birth, death, sampling
kappa_gamma = 6 2 * gamma prior for kappa
alpha_gamma = 1 1 * gamma prior for alpha
rgene_gamma = 1 2.292111240743 1 * gamma prior for overall rates for genes ### 1/替換率
sigma2_gamma = 1 10 1 * gamma prior for sigma^2 (for clock=2 or 3)
finetune = 1: .05 0.1 0.12 0.1 .3 * auto (0 or 1) : times, rates, mixing, paras, RateParas, FossilErr
print = 1
burnin = 500000
sampfreq = 5000
nsample = 20000
*** Note: Make your window wider (100 columns) before running the program.
然后就會輸出FigTree.tre,也就是最后的結果了
5.可以重復第四步,結果穩(wěn)定即可
mcmc.ctl的最后幾個參數可以嘗試更換
6.我們就拿到了Cafe的輸入文件(raw)
內容如下:
#NEXUS
BEGIN TREES;
UTREE 1 = ((Bamboo: 0.507727, XD: 0.507727) [&95%={0.266, 0.907}]: 0.055557, ((sppCa: 0.010037, sppJu: 0.010037) [&95%={0.004, 0.020}]: 0.010716, (sppGa: 0.012062, sppGo: 0.012062) [&95%={0.005, 0.024}]: 0.008690) [&95%={0.009, 0.041}]: 0.542531) [&95%={0.286, 0.978}];
END;
需要刪掉空格和置信區(qū)間,空格一定要刪干凈,這個就是cafe要求的輸入文件格式
我命名為FigTree.nwk
#grep "UTREE 1 =" FigTree.tre | sed -E -e "s/\[[^]]*\]//g" -e "s/[ \t]//g" -e "/^$/d" -e "s/UTREE1=//" > FigTree.nwk
((Bamboo:0.507727,XD:0.507727):0.055557,((sppCa:0.010037,sppJu: 0.010037):0.010716,(sppGa:0.012062,sppGo:0.012062):0.008690):0.542531);
7.cafe的安裝可以用conda,這里略,然后準備第二個輸入文件,也是OrthoFinder的結果
awk -v OFS="\t" '{if($1=="Orthogroup"){print"Descript",$1,$2,$3,$4,$5,$6,$7}else{print"(null)",$1,$2,$3,$4,$5,$6,$7}}' Orthogroups.GeneCount.tsv > GeneCounts.tsv
python /data/01/user158/kuangzhuoran/software/CAFE5/docs/tutorial/clade_and_size_filter.py -i GeneCounts.tsv -o gene_family_filter.txt -s
cafe5 -i gene_family_filter.txt -t FigTree.nwk -o out -c 20
8.輸出的文件都仔細看一下, 提取某個物種顯著擴張收縮的基因
cat Base_family_results.txt | grep "y"|cut -f1 >p0.05.significant
#提取顯著的OG
head -n 1 Base_change.tab > tmp1
grep -f p0.05.significant Base_change.tab > tmp2
cat tmp1 tmp2 > Base_p0.05change.tab
#提取對應OG的expand和expansion數
cat Base_p0.05change.tab | cut -f1,4 | grep "+[1-9]" | cut -f1 > sppJu.significant.expand
#cut -f1,4 的這個4,就是對應的節(jié)點/物種,我想要研究的物種
#grep "+[1-9]" 就是提取顯著擴張的
cat Base_p0.05change.tab | cut -f1,4 | awk '{if($2!="+0") print}' | grep "-" | cut -f1 > sppJu.significant.expansion
#cut -f1,4 的這個4,就是對應的節(jié)點/物種,我想要研究的物種
#grep "-" 就是提取顯著收縮的
#這個Orthogroups.tsv也是OrthoFinder的結果
grep -f judaei.significant.expansion Orthogroups.tsv | cut -f7 | sed "s/ /\n/g" | sed "s/\t/\n/g" | sed "s/,//g" | sort | uniq > sppJu.significant.expansion.gene
#cut -f7 就是就是對應的節(jié)點/物種
#這樣就拿到了顯著收縮的基因
9.如果是提取某個節(jié)點顯著收縮與擴張的基因
awk 'NR!=1 && $13>0 {print $0}' Base_count.tab | cut -f1 > node11.orthogroups
#比如我要提取Node11節(jié)點
grep -f node11.orthogroups Orthogroups.txt |sed "s/ /\n/g" | grep -E "carmeli|galili|golani|judaei" | sort | uniq > node11.genes
#這個就是背景庫
cat Base_p0.05change.tab | cut -f1,13 | grep "+[1-9]" | cut -f1 > node11.significant.expand
grep -f node11.significant.expand Orthogroups.tsv | cut -f7,8,9,10 | sed "s/ /\n/g" | sed "s/\t/\n/g" | sed "s/,//g" | sort | uniq > node11.significant.expand.gene
cat Base_p0.05change.tab | cut -f1,13 | awk '{if($2!="+0") print}' | grep "-" | cut -f1 > node11.significant.expansion
grep -f node11.significant.expansion Orthogroups.tsv | cut -f7,8,9,10 | sed "s/ /\n/g" | sed "s/\t/\n/g" | sed "s/,//g" | sort | uniq > node11.significant.expansion.gene
#Node11顯著擴張的基因
#cut -f1,13 的這個13,就是Node11節(jié)點對應的列
#node11.genes、node11.significant.expand.gene、node11.significant.expansion.gene就是后續(xù)做富集要用到的