數(shù)據(jù)分析實(shí)戰(zhàn) | Hi-C數(shù)據(jù)格式轉(zhuǎn)化

Raw contact to .hic → juicer_tools


軟件地址:https://github.com/aidenlab/juicer/wiki/Download
使用說明:https://github.com/aidenlab/juicer/wiki/Pre

軟件介紹

軟件安裝

wget https://s3.amazonaws.com/hicfiles.tc4ga.com/public/juicer/juicer_tools_1.22.01.jar

軟件使用

pre 命令用于將 text file (<infile>) 轉(zhuǎn)化為不同resolution下的.hic file(<outfile>
.hic格式詳見:https://www.cell.com/cell-systems/fulltext/S2405-4712(16)30219-8

默認(rèn)的resolution包括:2.5M, 1M, 500K, 250K, 100K, 50K, 25K, 10K, and 5K,或者可以通過-r 參數(shù)指定
如果沒有使用-n選項(xiàng),默認(rèn)輸出的hic file中已包括VC、VC_SQRT、KR和SCALE normalization 結(jié)果

Usage:   juicer_tools pre [options] <infile> <outfile> <genomeID>
   : -d only calculate intra chromosome (diagonal) [false]
   : -f <restriction site file> calculate fragment map
   : -m <int> only write cells with count above threshold m [0]
   : -q <int> filter by MAPQ score greater than or equal to q [not set]
   : -c <chromosome ID> only calculate map on specific chromosome [not set]
   : -r <comma-separated list of resolutions> Only calculate specific resolutions [not set]
   : -t <tmpDir> Set a temporary directory for writing
   : -s <statistics file> Add the text statistics file to the Hi-C file header
   : -g <graphs file> Add the text graphs file to the Hi-C file header
   : -n Don't normalize the matrices
   : -z <double> scale factor for hic file
   : -a <1, 2, 3, 4, 5> filter based on inner, outer, left-left, right-right, tandem pairs respectively
   : --randomize_position randomize positions between fragment sites
   : --random_seed <long> for seeding random number generator
   : --frag_site_maps <fragment site files> for randomization
   : -k normalizations to include
   : -j number of CPU threads to use
   : --threads <int> number of threads 
   : --mndindex <filepath> to mnd chr block indices

Input 格式

short format

  • 包含8列:<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2>
    說明:
    • str:strand(0 for forward, anything else for reverse; 目前.hic file中不存儲(chǔ)鏈信息)
    • frag:restriction site fragment ( juicer_tool pre 會(huì)自動(dòng)丟棄掉Map到相同restriction fragment的read,因此當(dāng)沒有fragment信息的時(shí)候,推薦設(shè)定frag1為0, frag2為1>

此外,數(shù)據(jù)還需要滿足

  1. chr1 <= chr2
  2. 按chr1, chr2進(jìn)行排序(即chr3-chr3的read必須在一起)

使用案例

原始數(shù)據(jù)格式: <seqID> <chr1> <pos1> <chr2> <pos2>


Step 1. Re-organization of raw data
將原始數(shù)據(jù)轉(zhuǎn)化為short-format,并按染色體排序

cat ${raw_contact_file} | \
    awk 'BEGIN{OFS="\t"}{print 0, $1, $2, 0, 1, $3, $4, 1}' | 
    sort -k2,2d -k6,6d \
    >  ${short_format_contact_file} 

Step 2. From short-format txt to .hic

juice_tools=~/Softwares/juicer_tools_1.22.01.jar
infile=${short_format_contact_file}
outfile=${hic_file}
genomeID=mm10
java -Xmx2g -jar ${juicer_tool} pre ${infile} ${outfile} ${genomeID} --threads 4

Trouble-shooting Tips

  1. Error: the chromosome combination 1_1 appears in multiple blocks
    原因:read沒有按照染色體進(jìn)行排序
    解決方案: sort -k2,2d -k6,6d (根據(jù)實(shí)際染色體所在列)

  2. Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    原因:JAVA內(nèi)存不足
    解決方案:調(diào)整-Xmx**

Raw contact to .cool → cooler


軟件地址:https://github.com/open2c/cooler
使用說明:https://cooler.readthedocs.io/en/latest/cli.html#cooler-cload-pairs

軟件安裝:

pip install cooler

軟件使用

cooler cload pair命令用于將contact file轉(zhuǎn)化為不同resolution下的.cool file

Usage: cooler cload pairs [OPTIONS] BINS PAIRS_PATH COOL_PATH

  Bin any text file or stream of pairs.

  Pairs data need not be sorted. Accepts compressed files. To pipe input from
  stdin, set PAIRS_PATH to '-'.

  BINS : One of the following

      <TEXT:INTEGER> : 1. Path to a chromsizes file, 2. Bin size in bp

      <TEXT> : Path to BED file defining the genomic bin segmentation.

  PAIRS_PATH : Path to contacts (i.e. read pairs) file.

  COOL_PATH : Output COOL file path or URI.

Options:
  --metadata TEXT                 Path to JSON file containing user metadata.
  --assembly TEXT                 Name of genome assembly (e.g. hg19, mm10)
  -c1, --chrom1 INTEGER           chrom1 field number (one-based)  [required]
  -p1, --pos1 INTEGER             pos1 field number (one-based)  [required]
  -c2, --chrom2 INTEGER           chrom2 field number (one-based)  [required]
  -p2, --pos2 INTEGER             pos2 field number (one-based)  [required]
  --chunksize INTEGER             Number of input lines to load at a time
  -0, --zero-based                Positions are zero-based  [default: False]
  --comment-char TEXT             Comment character that indicates lines to
                                  ignore.  [default: #]
  -N, --no-symmetric-upper        Create a complete square matrix without
                                  implicit symmetry. This allows for distinct
                                  upper- and lower-triangle values
  --input-copy-status [unique|duplex]
                                  Copy status of input data when using
                                  symmetric-upper storage. | `unique`:
                                  Incoming data comes from a unique half of a
                                  symmetric map, regardless of how the
                                  coordinates of a pair are ordered. `duplex`:
                                  Incoming data contains upper- and lower-
                                  triangle duplicates. All input records that
                                  map to the lower triangle will be discarded!
                                  | If you wish to treat lower- and upper-
                                  triangle input data as distinct, use the
                                  ``--no-symmetric-upper`` option.   [default:
                                  unique]
  --field TEXT                    Specify quantitative input fields to
                                  aggregate into value columns using the
                                  syntax ``--field <field-name>=<field-
                                  number>``. Optionally, append ``:`` followed
                                  by ``dtype=<dtype>`` to specify the data
                                  type (e.g. float), and/or ``agg=<agg>`` to
                                  specify an aggregation function different
                                  from sum (e.g. mean). Field numbers are
                                  1-based. Passing 'count' as the target name
                                  will override the default behavior of
                                  storing pair counts. Repeat the ``--field``
                                  option for each additional field.
  --temp-dir DIRECTORY            Create temporary files in a specified
                                  directory. Pass ``-`` to use the platform
                                  default temp dir.
  --no-delete-temp                Do not delete temporary files when finished.
  --max-merge INTEGER             Maximum number of chunks to merge before
                                  invoking recursive merging  [default: 200]
  --storage-options TEXT          Options to modify the data filter pipeline.
                                  Provide as a comma-separated list of key-
                                  value pairs of the form 'k1=v1,k2=v2,...'.
                                  See http://docs.h5py.org/en/stable/high/data
                                  set.html#filter-pipeline for more details.
  -h, --help                      Show this message and exit.

使用案例

contact -> 1kb .cool file

cooler cload pairs -c1 1 -p1 2 -c2 3 -p2 4 \
    mm10.chrom.sizes:1000 \
    129G1_chr19.contact.bedpe \
    129G1_chr19.1000.cool

.cool to multi-resolution .mcool file

cooler zoomify 129G1_chr19.1000.cool

.hic to .mcool → hic2cool


軟件地址:https://github.com/4dn-dcic/hic2cool

軟件安裝

pip install hic2cool

軟件使用

hic2cool convert <infile> <outfile> -r <resolution> -p <nproc>

positional arguments:
  infile                hic input file path
  outfile               cooler output file path

optional arguments:
  -h, --help            show this help message and exit
  -r RESOLUTION, --resolution RESOLUTION
                        integer bp resolution desired in cooler file. Setting to 0 (default) will use all resolutions. If all resolutions are
                        used, a multi-res .cool file will be created, which has a different hdf5 structure. See the README for more info
  -p NPROC, --nproc NPROC
                        number of processes to use to parse hic file. default set to 1
  -s, --silent          if used, silence standard program output
  -w, --warnings        if used, print out non-critical WARNING messages, which are hidden by default. Silent mode takes precedence over this

使用案例

生成 multi-resolution .mcool file

hic2cool convert ${hic_file} ${mcool_file} 

生成特定resolution下.cool file

hic2cool convert ${hic_file} ${cool_50kb_file} -r 50000
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 228,443評(píng)論 6 532
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 98,530評(píng)論 3 416
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事?!?“怎么了?”我有些...
    開封第一講書人閱讀 176,407評(píng)論 0 375
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)。 經(jīng)常有香客問我,道長(zhǎng),這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 62,981評(píng)論 1 312
  • 正文 為了忘掉前任,我火速辦了婚禮,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 71,759評(píng)論 6 410
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 55,204評(píng)論 1 324
  • 那天,我揣著相機(jī)與錄音,去河邊找鬼。 笑死,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,263評(píng)論 3 441
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 42,415評(píng)論 0 288
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 48,955評(píng)論 1 336
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 40,782評(píng)論 3 354
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 42,983評(píng)論 1 369
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,528評(píng)論 5 359
  • 正文 年R本政府宣布,位于F島的核電站,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 44,222評(píng)論 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,650評(píng)論 0 26
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 35,892評(píng)論 1 286
  • 我被黑心中介騙來泰國(guó)打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 51,675評(píng)論 3 392
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 47,967評(píng)論 2 374

推薦閱讀更多精彩內(nèi)容