攻略還發在https://github.com/gotouerina/GenomeMethPipeline 這個更全,如果好用可以點個star
需要用GPU,V100或者A100都行, 4090什么的估計不行
下機是FASTA5格式,轉成pod5格式儲存。
pod5軟件應該可以用pip裝
pod5 convert fast5 fast5/*.fast5 --output pod5/ --one-to-one ./fast5
然后用dorado提取修飾,這一步要GPU操作
/groups/lzu_public/home/u220220932211/software/dorado-0.7.1-linux-x64/bin/dorado/dorado basecaller /groups/lzu_public/home/u220220932211/software/dna_r10.4.1_e8.2_400bps_hac@v5.0.0/ /groups/lzu_public/home/u220220932211/work/ --modified-bases-models /groups/lzu_public/home/u220220932211/software/dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mC_5hmC@v1/ > calls.bam
最后用modkit提取甲基化信息
modkit pileup calls.bam out.bed --log-filepath pileup.log
提取結果長這樣:
官方解釋:
Definitions:
Nmod - Number of calls passing filters that were classified as a residue with a specified base modification.
Ncanonical - Number of calls passing filters were classified as the canonical base rather than modified. The exact base must be inferred by the modification code. For example, if the modification code is m (5mC) then the canonical base is cytosine. If the modification code is a, the canonical base is adenosine.
Nother mod - Number of calls passing filters that were classified as modified, but where the modification is different from the listed base (and the corresponding canonical base is equal). For example, for a given cytosine there may be 3 reads with h calls, 1 with a canonical call, and 2 with m calls. In the bedMethyl row for h Nother_mod would be 2. In the m row Nother_mod would be 3.
Nvalid_cov - the valid coverage. Nvalid_cov = Nmod + Nother_mod + Ncanonical, also used as the score in the bedMethyl
Ndiff - Number of reads with a base other than the canonical base for this modification. For example, in a row for h the canonical base is cytosine, if there are 2 reads with C->A substitutions, Ndiff will be 2.
Ndelete - Number of reads with a deletion at this reference position
Nfail - Number of calls where the probability of the call was below the threshold. The threshold can be set on the command line or computed from the data (usually failing the lowest 10th percentile of calls).
Nnocall - Number of reads aligned to this reference position, with the correct canonical base, but without a base modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a CG->CH substitution such that no modification call was produced by the basecaller.
附圖