基因家族的收縮與擴張 與 物種分歧時間的估計

0.軟件的安裝略
1.物種樹的構建
參見 http://www.lxweimin.com/p/336b65ca1b67
假設我們已經通過上述網頁拿到了A-c.phy和相應的物種樹文件

2.估計堿基替換速率(對應軟件版本4.8, 4.9的參數和本文不同)

軟件版本4.8

/data/01/user157/software/paml4.8/bin/baseml baseml.ctl

baseml.input.tree的內容:

6 1
((Bamboo,XD),((sppCa,sppJu)'@.001',(sppGa,sppGo)));
#需要標定時間,@.001表示大約在10萬年
#標定的時間可以用timetree獲得,輸入兩個物種,可以得到這兩個物種的最晚分化時間

baseml.ctl內容如下, 前兩行是上面提到的輸入文件:

      seqfile = A-c.phy
     treefile = baseml.input.tree

      outfile = mlb       * main result file
        noisy = 3   * 0,1,2,3: how much rubbish on the screen
      verbose = 0   * 1: detailed output, 0: concise output
      runmode = 0   * 0: user tree;  1: semi-automatic;  2: automatic
                    * 3: StepwiseAddition; (4,5):PerturbationNNI 

        model = 7   * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
                    * 5:T92, 6:TN93, 7:REV, 8:UNREST, 9:REVu; 10:UNRESTu

        Mgene = 0   * 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff

        clock = 1   * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
    fix_kappa = 0   * 0: estimate kappa; 1: fix kappa at value below; 2: kappa for branches
        kappa = 2  * initial or fixed kappa

    fix_alpha = 0    * 0: estimate alpha; 1: fix alpha at value below
        alpha = 0.5   * initial or fixed alpha, 0:infinity (constant rate)
       Malpha = 0   * 1: different alpha's for genes, 0: one alpha
        ncatG = 5   * # of categories in the dG, AdG, or nparK models of rates
        nparK = 0   * rate-class models. 1:rK, 2:rK&fK, 3:rK&MK(1/K), 4:rK&MK 

        nhomo = 0   * 0 & 1: homogeneous, 2: kappa for branches, 3: N1, 4: N2
        getSE = 1   * 0: don't want them, 1: want S.E.s of estimates
 RateAncestor = 0   * (0,1,2): rates (alpha>0) or ancestral states

   Small_Diff = 7e-6
    cleandata = 0  * remove sites with ambiguity data (1:yes, 0:no)?
*        icode = 0  * (with RateAncestor=1. try "GC" in data,model=4,Mgene=4)
*  fix_blength = 1  * 0: ignore, -1: random, 1: initial, 2: fixed, 3: proportional
       method = 0  * Optimization method 0: simultaneous; 1: one branch a time

輸出文件mlb中會有這么幾行,記住這個數字, 這個就是替換率

Substitution rate is per time unit
   0.395351 +- 0.000685

3.第一次運行mcmctree
/data/01/user157/software/paml4.8/bin/mcmctree mcmctree.ctl
輸出文件out.BV, 用作下面步驟的輸入文件

mcmctreel.input.tree的內容:

6 1
((Bamboo,XD),((sppCa,sppJu)'>.001',(sppGa,sppGo)));
#需要標定時間,>.001表示大于10萬年
          seed = -1
       seqfile = A-c.phy
      treefile = mcmc.input.tree
       outfile = out

         ndata = 1
       seqtype = 0  * 0: nucleotides; 1:codons; 2:AAs
       usedata = 3    * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV
         clock = 2    * 1: global clock; 2: independent rates; 3: correlated rates
       RootAge = <1.0  * safe constraint on root age, used if no fossil for root.

         model = 0    * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
         alpha = 0    * alpha for gamma rates at sites
         ncatG = 5    * No. categories in discrete gamma

     cleandata = 0    * remove sites with ambiguity data (1:yes, 0:no)?

       BDparas = 1 1 0    * birth, death, sampling
   kappa_gamma = 6 2      * gamma prior for kappa
   alpha_gamma = 1 1      * gamma prior for alpha

   rgene_gamma = 1 2.529397927411 1  * gamma prior for overall rates for genes ### 1/替換率
  sigma2_gamma = 1 10 1   * gamma prior for sigma^2     (for clock=2 or 3)

      finetune = 1: .05  0.1  0.12  0.1 .3  * auto (0 or 1) : times, rates, mixing, paras, RateParas, FossilErr

         print = 1
        burnin = 500000
      sampfreq = 5000
       nsample = 20000

cat out.BV > in.BV #用作后面的輸入文件

4.第二次運行mcmctree,把上面的in.BV拷貝在第二次運行的目錄下
/data/01/user157/software/paml4.8/bin/mcmctree mcmctree.ctl

這一次的mcmctree.ctl的內容:(區(qū)別在于usedata = 2

         seed = -1
       seqfile = A-c.phy
      treefile = mcmc.input.tree
       outfile = out

         ndata = 1
       seqtype = 0  * 0: nucleotides; 1:codons; 2:AAs
       usedata = 2    * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV
         clock = 2    * 1: global clock; 2: independent rates; 3: correlated rates
       RootAge = <1.0  * safe constraint on root age, used if no fossil for root.

         model = 0    * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
         alpha = 0    * alpha for gamma rates at sites
         ncatG = 5    * No. categories in discrete gamma

     cleandata = 0    * remove sites with ambiguity data (1:yes, 0:no)?

       BDparas = 1 1 0    * birth, death, sampling
   kappa_gamma = 6 2      * gamma prior for kappa
   alpha_gamma = 1 1      * gamma prior for alpha

   rgene_gamma = 1 2.292111240743 1  * gamma prior for overall rates for genes  ### 1/替換率
  sigma2_gamma = 1 10 1   * gamma prior for sigma^2     (for clock=2 or 3)

      finetune = 1: .05  0.1  0.12  0.1 .3  * auto (0 or 1) : times, rates, mixing, paras, RateParas, FossilErr

         print = 1
        burnin = 500000
      sampfreq = 5000
       nsample = 20000

*** Note: Make your window wider (100 columns) before running the program.

然后就會輸出FigTree.tre,也就是最后的結果了

5.可以重復第四步,結果穩(wěn)定即可
mcmc.ctl的最后幾個參數可以嘗試更換

6.我們就拿到了Cafe的輸入文件(raw)
內容如下:

#NEXUS
BEGIN TREES;

        UTREE 1 = ((Bamboo: 0.507727, XD: 0.507727) [&95%={0.266, 0.907}]: 0.055557, ((sppCa: 0.010037, sppJu: 0.010037) [&95%={0.004, 0.020}]: 0.010716, (sppGa: 0.012062, sppGo: 0.012062) [&95%={0.005, 0.024}]: 0.008690) [&95%={0.009, 0.041}]: 0.542531) [&95%={0.286, 0.978}];

END;

需要刪掉空格和置信區(qū)間,空格一定要刪干凈,這個就是cafe要求的輸入文件格式
我命名為FigTree.nwk

#grep  "UTREE 1 =" FigTree.tre | sed -E -e "s/\[[^]]*\]//g" -e "s/[ \t]//g" -e "/^$/d" -e "s/UTREE1=//" > FigTree.nwk

((Bamboo:0.507727,XD:0.507727):0.055557,((sppCa:0.010037,sppJu: 0.010037):0.010716,(sppGa:0.012062,sppGo:0.012062):0.008690):0.542531);

7.cafe的安裝可以用conda,這里略,然后準備第二個輸入文件,也是OrthoFinder的結果

awk -v  OFS="\t" '{if($1=="Orthogroup"){print"Descript",$1,$2,$3,$4,$5,$6,$7}else{print"(null)",$1,$2,$3,$4,$5,$6,$7}}' Orthogroups.GeneCount.tsv > GeneCounts.tsv

python /data/01/user158/kuangzhuoran/software/CAFE5/docs/tutorial/clade_and_size_filter.py -i GeneCounts.tsv -o gene_family_filter.txt -s

cafe5 -i gene_family_filter.txt -t FigTree.nwk -o out -c 20

8.輸出的文件都仔細看一下, 提取某個物種顯著擴張收縮的基因

cat Base_family_results.txt | grep "y"|cut -f1 >p0.05.significant
#提取顯著的OG

head -n 1 Base_change.tab > tmp1
grep -f p0.05.significant Base_change.tab > tmp2
cat tmp1 tmp2 > Base_p0.05change.tab
#提取對應OG的expand和expansion數

cat Base_p0.05change.tab | cut -f1,4 | grep "+[1-9]" | cut -f1  > sppJu.significant.expand
#cut -f1,4 的這個4,就是對應的節(jié)點/物種,我想要研究的物種
#grep "+[1-9]" 就是提取顯著擴張的

cat Base_p0.05change.tab | cut -f1,4 | awk '{if($2!="+0") print}' | grep "-" | cut -f1 > sppJu.significant.expansion
#cut -f1,4 的這個4,就是對應的節(jié)點/物種,我想要研究的物種
#grep "-" 就是提取顯著收縮的

#這個Orthogroups.tsv也是OrthoFinder的結果
grep -f judaei.significant.expansion Orthogroups.tsv | cut -f7 | sed "s/ /\n/g" | sed "s/\t/\n/g" | sed "s/,//g" | sort | uniq > sppJu.significant.expansion.gene
#cut -f7 就是就是對應的節(jié)點/物種
#這樣就拿到了顯著收縮的基因

9.如果是提取某個節(jié)點顯著收縮與擴張的基因

awk 'NR!=1 && $13>0 {print $0}' Base_count.tab | cut -f1 > node11.orthogroups
#比如我要提取Node11節(jié)點

grep -f node11.orthogroups Orthogroups.txt |sed "s/ /\n/g" | grep -E "carmeli|galili|golani|judaei" | sort | uniq > node11.genes
#這個就是背景庫

cat Base_p0.05change.tab | cut -f1,13 | grep "+[1-9]" | cut -f1  > node11.significant.expand
grep -f node11.significant.expand Orthogroups.tsv | cut -f7,8,9,10 | sed "s/ /\n/g" | sed "s/\t/\n/g" | sed "s/,//g" | sort | uniq > node11.significant.expand.gene

cat Base_p0.05change.tab | cut -f1,13 | awk '{if($2!="+0") print}' | grep "-" | cut -f1 > node11.significant.expansion
grep -f node11.significant.expansion Orthogroups.tsv | cut -f7,8,9,10 | sed "s/ /\n/g" | sed "s/\t/\n/g" | sed "s/,//g" | sort | uniq > node11.significant.expansion.gene
#Node11顯著擴張的基因
#cut -f1,13 的這個13,就是Node11節(jié)點對應的列

#node11.genes、node11.significant.expand.gene、node11.significant.expansion.gene就是后續(xù)做富集要用到的
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容