文章:grabseqs: simple downloading of reads and metadata from multiple next-generation sequencing data repositories | Bioinformatics | Oxford Academic (oup.com)
GitHub:louiejtaylor/grabseqs: A utility for easy downloading of reads from next-gen sequencing repositories like NCBI SRA (github.com)
grabseqs是一個可以從NCBI SRA, MG-RAST和iMicrobe數據庫批量下載數據的工具,2020年發表在Bioinformatics 雜志,可下載sra數據并直接轉換為fastq文件
其轉化依賴于fasterq-dump或fastq-dump,因此安裝前注意要下載sra-tools:conda install -c bioconda sra-tools
還要注意其他依賴條件有python3環境、sra-tools版本大于2.9、pigz和wget
1 下載安裝
conda安裝:
conda install grabseqs -c louiejtaylor -c bioconda -c conda-forge
或者pip安裝:
pip install grabseqs
2 使用
2.1 詳盡參數:
grabseqs sra [-h] [-m METADATA] [-o OUTDIR] [-r RETRIES] [-t THREADS]
[-f] [-l] [--no_parsing] [--parse_run_ids]
[--use_fastq_dump]
id [id ...]
positional arguments:
id One or more BioProject, ERR/SRR or ERP/SRP number(s)
optional arguments:
-h, --help show this help message and exit
-m METADATA filename in which to save SRA metadata (.csv format,
relative to OUTDIR)
-o OUTDIR directory in which to save output. created if it doesn't
exist
-r RETRIES number of times to retry download
-t THREADS threads to use (for fasterq-dump/pigz)
-f force re-download of files
-l list (but do not download) samples to be grabbed
--parse_run_ids parse SRR/ERR identifers (do not pass straight to fasterq-
dump)
--custom_fqdump_args CUSTOM_FQD_ARGS
"string" containing args to pass to fastq-dump
--use_fastq_dump use legacy fastq-dump instead of fasterq-dump (no
multithreaded downloading)
2.2 示例如下:
- 使用10個線程,保存數據到proj/metadata.csv,下載到文件夾 proj/,下載失敗重試的次數為3,從SRP#######獲取所有樣本
# use 10 threads, save metadata to proj/metadata.csv, download to the dir proj/, retry failed downloads 3x, get all samples from SRP#######)
grabseqs sra -t 10 -m metadata.csv -o proj/ -r 3 SRP*********
- 如果想將參數傳遞給fastq -dump獲取數據,可以這樣做:
# If you'd like to pass your own arguments to fasterq-dump to get data in a slightly different format, you can do so like this
grabseqs sra SRP******* -r 0 --custom_fqdump_args="--split-spot --progress"
其他常用命令的簡單示例:
- 從單個SRA項目下載所有樣本
#Download all samples from a single SRA Project:
grabseqs sra SRP********
- 或者結合其他各類項目一起下載
#Or any combination of projects (S/ERP), runs (S/ERR), BioProjects (PRJNA):
grabseqs sra SRR******** ERP******** PRJNA******** ERR********
- 只想獲取樣本編號的話使用 -l 參數
#If you'd like to do a dry run and just get a list of samples that will be downloaded, pass -l:
grabseqs sra -l SRP********
- 從 MG-RAST、iMicrobe數據庫下載數據也是類似用法,(樣本編號前加“s”,項目編號前加“p”)
#Similar syntax works for MG-RAST:
grabseqs mgrast mgp****** mgm*******
#And iMicrobe (prefixing the sample numbers with "s" and project numbers with "p"):
grabseqs imicrobe p4 s3