本文阐述宏基因组物种分类、组装、bining、基因预测及注释……
A、软件列表及安装
A.1 分类相关
A.1.1 metaphlan 3.0
MetaPhlAn is a tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data.
a. 安装主文件
1 | $ conda create -n metaphlan python=3.7 metaphlan |
b. 安装数据库
1 | $ metaphlan --install |
c. 安装依赖包
- hclust
1
$ conda install -c bioconda hclust2
- R, vegan, ape
1
$ conda install r-base r-vegan r-ape
- rbiom
1
2
3$ R
$ install.packages("rbiom")
$ quit()
A.1.2 gtdbtk
a. Hardware requirements
Domain | Memory | Storage | Time |
---|---|---|---|
Archaea | ~8 GB | ~27 GB | ~1 hour / 1,000 genomes @ 64 CPUs |
Bacteria | ~150 GB | ~27 GB | ~1 hour / 1,000 genomes @ 64 CPUs |
b. Install GTDB-Tk with conda
1 | $ conda create -n gtdbtk -c conda-forge -c bioconda gtdbtk |
c. GTDB-Tk reference data
- Note that different versions of the GTDB release data may not run on all versions of GTDB-Tk, below are all supported versions:
GTDB Release | Minimum version | Maximum version |
---|---|---|
R95 | 1.3.0 | N/A |
R89 | 0.3.0 | 0.1.2 |
R86.2 | 0.2.1 | 0.2.2 |
R86 | 0.1.0 | 0.1.6 |
R83 | 0.0.6 | 0.0.7 |
- Download the reference data
1 | $ wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/auxillary_files/gtdbtk_r95_data.tar.gz |
注意:GTDB-Tk requires an environment variable named GTDBTK_DATA_PATH to be set to the directory containing the unarchived GTDB-Tk reference data.
- You can automatically alias GTDBTK_DATA_PATH whenever the environment is activated by editing {gtdbtk environment path}/etc/conda/activate.d/gtdbtk.sh, e.g.:
1 | # Determine the GTDB-Tk environment path |
A.1.3 Kraken2
a. Hardware requirements
- Disk space: Construction of a Kraken 2 standard database requires approximately 100 GB of disk space. A test on 01 Jan 2018 of the default installation showed 42 GB of disk space was used to store the genomic library files, 26 GB was used to store the taxonomy information from NCBI, and 29 GB was used to store the Kraken 2 compact hash table.
- Memory: To run efficiently, Kraken 2 requires enough free memory to hold the database (primarily the hash table) in RAM. While this can be accomplished with a ramdisk, Kraken 2 will by default load the database into process-local RAM; the –memory-mapping switch to kraken2 will avoid doing so. The default database size is 29 GB (as of Jan. 2018), and you will need slightly more than that in RAM if you want to build the default database.
b. Install Kraken2 with conda
1 | $ conda create -n kraken2 kraken2 |
c. Build the database
- 下载数据库。找到一个存储空间比较大的目录并进入,运行如下命令,这里下载的数据库包括archaea,bacteria,plasmid,viral,fungi,protozoa,UniVec和UniVec_Core:
1
nohup bash -c 'for i in archaea bacteria plasmid viral fungi protozoa UniVec UniVec_Core; do kraken2-build --download-library $i --threads 24 --db db_name; done' &
- 下载分类信息
1
nohup kraken2-build --download-taxonomy --threads 24 --db db_name &
- 建立索引
1
nohup kraken2-build --build --threads 24 --db db_name &
d. 序列分类
1 | kraken2 --paired --threads 24 --unclassified-out unclassified#.fq --classified-out classified#.fq --output outfile --confidence 0.5 --memory-mapping --use-names --report reportname --report-zero-counts --db $DBNAME reads1 read2 |
A.2 组装、Bining、质量评估
A.2.1 metawrap
MetaWRAP is a pipeline for genome-resolved metagenomic data analysis.
a. 安装主文件
1 | $ conda create -n metawrap python=2.7 metaphlan |
b. 安装其他分析工具到metawrap环境中
- cd-hit
- coverm DNA read coverage and relative abundance calculator focused on metagenomics applications
- bamm Metagenomics-focused BAM file manipulation
- unitem Ensemble binning strategies for combining the output of multiple binning methods
- humann2 The HMP Unified Metabolic Analysis Network 2
- graphlan
- export2graphlan
- checkm database
1
2
3
4
5
6
7
8$ conda activeta metawrap
$ conda install cd-hit coverm bamm unitem humann2 graphlan export2graphlan
# 找到一个合适的目录并cd进入以便存储checkm数据库
$ wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
$ tar zxvf checkm_data_2015_01_16.tar.gz
$ checkm data setRoot
# 随后输入数据库所在的full path
A.3 非冗余基因功能注释
A.3.1 eggNOG-mapper
功能注释,uniref, eggNOG, KEGG, GO; CAZy; VFDB; CARD; TCDB; PHI。
A.3.2 EnrichM
EnrichM is a set of comparative genomics tools for large sets of metagenome assembled genomes (MAGs). The current functionality includes:
- A basic annotation pipeline for MAGs.
- A pipeline to determine the metabolic pathways that are encoded by MAGs, using KEGG modules as a reference (although custom pathways can be specified)
- A pipeline to identify genes or metabolic pathways that are enriched within and between user-defined groups of genomes (groups can be genomes that are related functionally, phylogenetically, recovered from different environments, etc).
- To construct metabolic networks from annotated population genomes.
- Construct random forest machine learning models from the functional composition of either MAGs, metagenomes or transcriptomes.
- Apply these random forest machine learning models to classify new MAGs metagenomes.
a. 安装主文件
1 | $ conda create -n enrichm python=3 |
b. 安装数据库
1 | # 约5.7 G |
报错 :
Traceback (most recent call last):
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/data.py”, line 114, in do
version_remote = urllib.request.urlopen(self.ftp + self.VERSION).readline().strip().decode(“utf-8”)
AttributeError: module ‘urllib’ has no attribute ‘request’During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/home/hualin/miniconda3/envs/enrichm/bin/enrichm”, line 342, in <module>
run.run_enrichm(args, sys.argv)
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/run.py”, line 288, in run_enrichm
d.do(args.uninstall, args.dry)
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/data.py”, line 117, in do
“Unable to find most current EnrichM database VERSION in ftp. Please complain at https://github.com/geronimp/enrichM“)
Exception: Unable to find most current EnrichM database VERSION in ftp. Please complain at https://github.com/geronimp/enrichM
解决方案:将data.py中的’import urllib’替换为’import urllib.request’
1 | $ vim /home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/data.py |
c. Sepcifying the location of the EnrichM database
If you would like to store the EnrichM database outside of your home directory, move you need to tell EnrichM where to look. To do this, export a BASH variable named “ENRICHM_DB”:
1 | $ export ENRICHM_DB=/path/to/database/ >>~/.bashrc |
注意 :/path/to/database/根据实际情况而定!
报错:
$ enrichm
Traceback (most recent call last):
File “/home/hualin/miniconda3/envs/enrichm/bin/enrichm”, line 38, in <module>
from enrichm.run import Run
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/run.py”, line 24, in <module>
from enrichm.network_analyzer import NetworkAnalyser
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/network_analyzer.py”, line 22, in <module>
from enrichm.network_builder import NetworkBuilder
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/network_builder.py”, line 24, in <module>
from enrichm.databases import Databases
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/databases.py”, line 28, in <module>
class Databases:
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/databases.py”, line 36, in Databases
PICKLE_VERSION = open(os.path.join(CUR_DATABASE_DIR, ‘VERSION’)).readline().strip()
FileNotFoundError: [Errno 2] No such file or directory: ‘/new_data/hualin/db/enrichm_database_v10/26-11-2018/VERSION’
Solve: 将下载的数据库文件全部复制一份到“26-11-2018”目录中,否则后续运行annotaton时会提示找不到数据库文件。
B、数据分析
B.1 使用metaphlan从Reads中获取物种分类信息
Step 1. 激活环境
1 | $ conda activate metaphlan |
Step 2. 对paired-end Reads进行注释
1 | $ metaphlan Reads1,Reads2 --input_type fastq --bowtie2out Str1.bowtie2.bz2 --nproc 10 -o Str1_profiled.txt |
Reads1和Reads2分别代表双端测序得到的正向和反向数据;–input_type指定文件格式,我们拿到的下机数据一般为fastq格式;–bowtie2out参数将会保存运行产生的中间文件以便后续重新运行程序时作为输入文件;–nproc指定使用的线程数量;-o指定输出文件名。
Step 3. 汇总所有的结果文件
对所有的文件均运行Step 2 ,产生了多个输出文件(*_profiled.txt),可以将它们汇总到一个文件中(merged_abundance_table.txt),便于后续对多个样品进行相互比较。
1 | $ merge_metaphlan_tables.py *_profiled.txt > merged_abundance_table.txt |
Step 4. 从合并的文件中提取种水平的分类
1 | $ grep -E "s__|clade" merged_abundance_table.txt | sed 's/^.*s__//g' | cut -f1,3-8 | sed -e 's/clade_name/body_site/g' > merged_abundance_table_species.txt |
Step 5. 绘制样本间种水平的热图
1 | $ hclust2.py -i merged_abundance_table_species.txt -o abundance_heatmap_species.png --ftop 50 --f_dist_f braycurtis --s_dist_f braycurtis --cell_aspect_ratio 0.5 -l --flabel_size 10 --slabel_size 10 --max_flabel_len 100 --max_slabel_len 100 --minv 0.1 --dpi 300 |
Step 6. 计算样本间的unifrac距离
1 | # 下载依赖的tree文件和脚本,与要分析的文件放于同一目录下 |
Step 7. 绘制cladogram图
1 | # 激活依赖软件所在的环境 |
B.2 使用metawrap对Reads进行组装
Step 1. 激活环境
1 | $ conda activate metawrap |
Step 2. 组装
1 | $ metawrap assembly -1 Reads1 -2 Reads2 -o Assemble1 -m 300 -t 15 --metaspades |
点击此处查看参数
Usage: metaWRAP assembly [options] -1 reads_1.fastq -2 reads_2.fastq -o output_dir- Options:
-1 STR forward fastq reads
-2 STR reverse fastq reads
-o STR output directory
-m INT memory in GB (default=24)
-t INT number of threads (defualt=1)
-l INT minimum length of assembled contigs (default=1000)
--megahit assemble with megahit (default)
--metaspades assemble with metaspades instead of megahit (better results but slower and higher memory requirement)
Reads1和Reads2分别代表双端测序得到的正向和反向数据;-o指定输出目录,-m指定最大可用内存大小,超过设定值后程序会自动退出,建议设大一点,我10G的数据大概需要180G内存;-t指定线程数;–metaspades表示用metaspades进行组装,特别慢,但是组装结果相对好一些。
B.3 使用metawrap对Contigs进行Bining
Step 1. 激活环境
1 | $ conda activate metawrap |
Step 2. Bining
1 | $ metawrap binning -o Str1.INITIAL_BINNING -t 20 -m 200 --universal --run-checkm -a <path of contigs> --metabat2 --maxbin2 --concoct 解压的Reads1 解压的Reads2 |
点击此处查看参数
Usage: metaWRAP binning [options] -a assembly.fa -o output_dir readsA_1.fastq readsA_2.fastq ... [readsX_1.fastq readsX_2.fastq]Note1: Make sure to provide all your separately replicate read files, not the joined file.
Note2: You may provide single end or interleaved reads as well with the use of the correct option
Note3: If the output already has the .bam alignments files from previous runs, the module will skip re-aligning the reads
Options:
- -a STR metagenomic assembly file
- -o STR output directory
- -t INT number of threads (default=1)
- -m INT amount of RAM available (default=4)
- -l INT minimum contig length to bin (default=1000bp). Note: metaBAT will default to 1500bp minimum
- --metabat2 bin contigs with metaBAT2
- --metabat1 bin contigs with the original metaBAT
- --maxbin2 bin contigs with MaxBin2
- --concoct bin contigs with CONCOCT
- --universal use universal marker genes instead of bacterial markers in MaxBin2 (improves Archaea binning)
- --run-checkm immediately run CheckM on the bin results (requires 40GB+ of memory)
- --single-end non-paired reads mode (provide *.fastq files)
- --interleaved the input read files contain interleaved paired-end reads
Step 3. 整合三种方法的bins (metabat2, maxbin2, concoct) 结果
1 | $ metawrap bin_refinement -o F01_BIN_REFINEMENT(输出目录) -t 20 -A str1.INITIAL_BINNING/metabat2_bins/ -B str1.INITIAL_BINNING/maxbin2_bins/ -C str1.INITIAL_BINNING/concoct_bins/ -c 70 -x 5 |
点击此处查看参数
Usage: metaWRAP bin_refinement [options] -o output_dir -A bin_folderA [-B bin_folderB -C bin_folderC]Note: the contig names in different bin folders must be consistant (must come from the same assembly).
Options:
- -o STR output directory
- -t INT number of threads (default=1)
- -m INT memory available (default=40)
- -c INT 完整度minimum % completion of bins [should be >50%] (default=70)
- -x INT 污染度maximum % contamination of bins that is acceptable (default=10)
- -A STR folder with metagenomic bins (files must have .fa or .fasta extension)
- -B STR another folder with metagenomic bins
- -C STR another folder with metagenomic bins
- --skip-refinement dont use binning_refiner to come up with refined bins based on combinations of binner outputs
- --skip-checkm dont run CheckM to assess bins
- --skip-consolidation choose the best version of each bin from all bin refinement iteration
- --keep-ambiguous for contigs that end up in more than one bin, keep them in all bins (default: keeps them only in the best bin)
- --remove-ambiguous for contigs that end up in more than one bin, remove them in all bins (default: keeps them only in the best bin)
- --quick adds –reduced_tree option to checkm, reducing runtime, especially with low memory
Step 4. Blobology可视化bin
一个坑: metawrap安装的blast为2.6版本,只能用Version 4 的nt库。而最新的nt库为Version 5,v4已经不再维护了。因此需要更新metawrap安装环境中的blast至2.8.0及以上版本,这里无法通过‘conda updata blast’实现更新,而是需要下载新版本的blast可执行程序,覆盖metawrap环境中的blast程度们。 如果你采用的是默认版本的blast和V5的nt库,会得到报错:“BLAST Database error: Error: Not a valid version 4 database”.
1 | $ metawrap blobology -a Result/Assemble/F01/final_assembly.fasta -t 20 -o F01.BLOBOLOGY --bins F01_BIN_REFINEMENT/metawrap_70_5_bins reads/F01_clean_1.fastq reads/F01_clean_2.fastq |
点击此处查看参数
Usage: metaWRAP blobology [options] -a assembly.fasta -o output_dir readsA_1.fastq readsA_2.fastq [readsB_1.fastq readsB_2.fastq ... ]Options:
- -a STR assembly fasta file
- -o STR output directory
- -t INT number of threads
- --subsamble INT Number of contigs to run blobology on. Subsampling is randomized. (default=ALL)
- --bins STR Folder containing bins. Contig names must match those of the assembly file. (default=None)
Step 5. Bins 定量
1 | $ metawrap quant_bins -b F01_BIN_REFINEMENT/metawrap_70_5_bins -t 8 -o F01.QUANT_BINS -a Result/Assemble/F01/final_assembly.fasta reads/F01_clean_1.fastq reads/F01_clean_2.fastq |
点击此处查看参数
Usage: metaWRAP quant_bins [options] -b bins_folder -o output_dir -a assembly.fa readsA_1.fastq readsA_2.fastq ... [readsX_1.fastq readsX_2.fastq]Options:
- -b STR folder containing draft genomes (bins) in fasta format
- -o STR output directory
- -a STR fasta file with entire metagenomic assembly (strongly recommended!)
- -t INT number of threads
Step 6. 重组装
1 | metawrap reassemble_bins -o F01.BIN_REASSEMBLY -1 reads/F01_clean_1.fastq -2 reads/F01_clean_2.fastq -t 20 -m 400 -c 70 -x 10 -b F01_BIN_REFINEMENT/metawrap_70_5_bins |
点击此处查看参数
Usage: metaWRAP reassemble_bins [options] -o output_dir -b bin_folder -1 reads_1.fastq -2 reads_2.fastqOptions:
- -b STR folder with metagenomic bins
- -o STR output directory
- -1 STR forward reads to use for reassembly
- -2 STR reverse reads to use for reassembly
- -t INT number of threads (default=1)
- -m INT memory in GB (default=40)
- -c INT minimum desired bin completion % (default=70)
- -x INT maximum desired bin contamination % (default=10)
- -l INT minimum contig length to be included in reassembly (default=500)
- --strict-cut-off maximum allowed SNPs for strict read mapping (default=2)
- --permissive-cut-off maximum allowed SNPs for permissive read mapping (default=5)
- --skip-checkm dont run CheckM to assess bins
- --parallel run spades reassembly in parallel, but only using 1 thread per bin
B.4 MAGs注释
B.4.1 GTDB-TK 进行物种分类和注释,构建系统发育树
1 | $ conda activate gtdbtk |
classify_wf——Classify workflow (包括了Step 1-3)
The classify workflow consists of three steps: identify, align, and classify.
1 | $ time gtdbtk classify_wf --genome_dir metawrap_70_5_bins/ --out_dir classify_wf_output -x .fa --prefix F --cpus 6 -r --debug |
点击此处查看参数
usage: gtdbtk classify_wf (--genome_dir GENOME_DIR | --batchfile BATCHFILE) --out_dir OUT_DIR [-x EXTENSION] [--min_perc_aa MIN_PERC_AA] [--prefix PREFIX] [--cpus CPUS] [--pplacer_cpus PPLACER_CPUS] [--force] [--scratch_dir SCRATCH_DIR] [-r] [--min_af MIN_AF] [--debug] [-h]mutually exclusive required arguments:
- --genome_dir GENOME_DIR
directory containing genome files in FASTA format - --batchfile BATCHFILE
path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])
- --genome_dir GENOME_DIR
required named arguments:
- --out_dir OUT_DIR directory to output files
optional arguments:
- -x, --extension EXTENSION
extension of files to process, gz = gzipped (default: fna) - --min_perc_aa MIN_PERC_AA
exclude genomes that do not have at least this percentage of AA in the MSA (inclusive bound) (default: 10) - --prefix PREFIX prefix for all output files (default: gtdbtk)
- --cpus CPUS number of CPUs to use (default: 1)
- --pplacer_cpus PPLACER_CPUS
use pplacer_cpus during placement (default: cpus) - --force continue processing if an error occurs on a single genome (default: False)
- --scratch_dir SCRATCH_DIR
Reduce pplacer memory usage by writing to disk (slower). - -r, --recalculate_red
recalculate RED values based on the reference tree and all added user genomes (default: False) - --min_af MIN_AF minimum alignment fraction to consider closest genome (default: 0.65)
- --debug create intermediate files for debugging purposes (default: False)
- -h, --help show help message
- -x, --extension EXTENSION
Step 1: identify——在基因组中寻找marker genes
Translation table选择:use table 11 unless the coding density using table 4 is 5% higher than when using table 11 and the coding density under table 4 is >70%. GTDB-Tk 不会区分 tables 4和tables 5. 若用户清楚使用哪一个table,可以通过--batchfile指定。
1 | $ time gtdbtk identify --genome_dir metawrap_70_5_bins/ --out_dir identify_output --cpus 6 --prefix F --debug -x .fa --write_single_copy_genes |
Step 2: align
Create a multiple sequence alignment based on the AR122/BAC120 marker set.
Time 3m50.019s
1 | $ time gtdbtk align --identify_dir identify_output/ --out_dir align_output --cpus 3 --prefix F --debug |
Step 3: classify
Determine taxonomic classification of genomes.
Time 118m9.386s
1 | $ time gtdbtk classify --genome_dir metawrap_70_5_bins/ --align_dir align_output/ --out_dir classify_output --cpus 3 --prefix F --debug -x .fa -r |
注意:如果内存较小,则加上“–scratch_dir”参数。
Step 4: export_msa (这一步不运行)
The export_msa will export the untrimmed archaeal or bacterial MSA used in the reference data.
1 | # 古菌 0m1.503s |
Step 5: trim_msa (这一步不运行)
The trim_msa command will trim a MSA given a user-specified mask file, or the archaeal/bacterial mask present in the reference data.
1 | # 古菌 0m10.675s |
Step 6: infer
Infer tree from multiple sequence alignment.
1 | #古菌 44m54.884s |
点击此处查看参数
usage: gtdbtk infer --msa_file MSA_FILE --out_dir OUT_DIR [--prot_model {JTT,WAG,LG}] [--no_support] [--gamma] [--prefix PREFIX] [--cpus CPUS] [--debug] [-h]required named arguments:
- --msa_file MSA_FILE multiple sequence alignment in FASTA format
- --out_dir OUT_DIR directory to output files
optional arguments:
- --prot_model {JTT,WAG,LG}
protein substitution model for tree inference (default: WAG) - --no_support do not compute local support values using the Shimodaira-Hasegawa test (default: False)
- --gamma rescale branch lengths to optimize the Gamma20 likelihood (default: False)
- --prefix PREFIX prefix for all output files (default: gtdbtk)
- --cpus CPUS number of CPUs to use (default: 1)
- --debug create intermediate files for debugging purposes (default: False)
- -h, --help show help message
- --prot_model {JTT,WAG,LG}
Step 7: decorate
Decorate a tree with the GTDB-Tk taxonomy.
1 | # 古菌 |
点击此处查看参数
usage: gtdbtk decorate --input_tree INPUT_TREE --output_tree OUTPUT_TREE [--gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE] [--custom_taxonomy_file CUSTOM_TAXONOMY_FILE] [--debug] [-h]required named arguments:
- --input_tree INPUT_TREE
path to the unrooted tree in Newick format - --output_tree OUTPUT_TREE
path to output the tree
- --input_tree INPUT_TREE
optional arguments:
- --gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE
file with GTDB-Tk classifications produced by theclassify
command - --custom_taxonomy_file CUSTOM_TAXONOMY_FILE
file indicating custom taxonomy strings for user genomes, that should contain any genomes belonging to the outgroup - --debug create intermediate files for debugging purposes (default: False)
- -h, --help show help message
- --gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE
Step 8: root
Root a tree using an outgroup.
1 | $ time gtdbtk root --input_tree input.tree --outgroup_taxon p__Nanoarchaeota --output_tree output.tree --gtdbtk_classification_file <file with GTDB-Tk classifications produced by the classify command> --debug |
点击此处查看参数
usage: gtdbtk root --input_tree INPUT_TREE --outgroup_taxon OUTGROUP_TAXON --output_tree OUTPUT_TREE [--gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE] [--custom_taxonomy_file CUSTOM_TAXONOMY_FILE] [--debug] [-h]required named arguments:
- --input_tree INPUT_TREE
path to the unrooted tree in Newick format - --outgroup_taxon OUTGROUP_TAXON
taxon to use as outgroup (e.g., p__Patescibacteria or p__Altarchaeota) - --output_tree OUTPUT_TREE
path to output the tree
- --input_tree INPUT_TREE
optional arguments:
- --gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE
file with GTDB-Tk classifications produced by theclassify
command - --custom_taxonomy_file CUSTOM_TAXONOMY_FILE
file indicating custom taxonomy strings for user genomes, that should contain any genomes belonging to the outgroup - --debug create intermediate files for debugging purposes (default: False)
- -h, --help show help message
- --gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE
Step 9: ani_rep——计算ANI值
Compute the ANI of input genomes to all GTDB-Tk representative genomes.
1 | $ time gtdbtk ani_rep --genome_dir metawrap_70_5_bins/ --out_dir ani_rep/ --cpus 6 -x .fa --prefix F --debug |
1 | $ gtdbtk de_novo_wf --genome_dir metawrap_70_5_bins/ --out_dir de_novo_wf --extension .fa --bacteria --outgroup_taxon p__Patescibacteria --prefix F --cpus 6 --debug |
点击此处查看参数
usage: gtdbtk ani_rep (--genome_dir GENOME_DIR | --batchfile BATCHFILE) --out_dir OUT_DIR [--no_mash] [--mash_k MASH_K] [--mash_s MASH_S] [--mash_d MASH_D] [--mash_v MASH_V] [--mash_db MASH_DB] [--min_af MIN_AF] [-x EXTENSION] [--prefix PREFIX] [--cpus CPUS] [--debug] [-h]mutually exclusive required arguments:
--genome_dir GENOME_DIR
directory containing genome files in FASTA format--batchfile BATCHFILE
path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])- required named arguments:
--out_dir OUT_DIR directory to output files
optional Mash arguments:
- --no_mash skip pre-filtering of genomes using Mash (default: False)
- --mash_k MASH_K k-mer size [1-32] (default: 16)
- --mash_s MASH_S maximum number of non-redundant hashes (default: 5000)
- --mash_d MASH_D maximum distance to keep [0-1] (default: 0.1)
- --mash_v MASH_V maximum p-value to keep [0-1] (default: 1.0)
- --mash_db MASH_DB path to save/read (if exists) the Mash reference sketch database (.msh)
optional FastANI arguments:
- --min_af MIN_AF minimum alignment fraction to consider closest genome (default: 0.65)
optional arguments:
- -x, --extension EXTENSION
extension of files to process, gz = gzipped (default: fna) - --prefix PREFIX prefix for all output files (default: gtdbtk)
- --cpus CPUS number of CPUs to use (default: 1)
- --debug create intermediate files for debugging purposes (default: False)
- -h, --help show help message
- -x, --extension EXTENSION
B.4.2 EnrichM 注释
Step 1. annotate
基因组注释管道,可以使用dbCAN将基因组与KO, PFAM, TIGRFAM和CAZY 数据库进行比对。结果将为每个基因组产生一个.gff文件,并生成每个注释类型的频率矩阵(frequency matrix),其中行是注释IDs,列是基因组。
1 | $ enrichm annotate --genome_directory metawrap_70_5_bins --output EnrichM_annotate --force --ko --ko_hmm --pfam --tigrfam --clusters --orthologs --cazy --ec --threads 30 --parallel 8 --suffix .fa --count_domains --chunk_number 8 |
点击此处查看参数
Output options:
- --output OUTPUT Output directory
- --force Overwrite previous run
Input options (Use one):
- --genome_files GENOME_FILES [GENOME_FILES …]
Space separated list of genomes to annotate - --genome_directory GENOME_DIRECTORY
Directory containing genomes to annotate - --protein_files PROTEIN_FILES [PROTEIN_FILES …]
Space separated list of .faa files of genomes to annotate. Protein files need to be generated by prodigal. - --protein_directory PROTEIN_DIRECTORY
Directory containing .faa files of genomes to annotate. Protein files need to be generated by prodigal.
- --genome_files GENOME_FILES [GENOME_FILES …]
Annotations options:
- --ko Annotate with KO ids
- --ko_hmm Annotate with KO ids
- --pfam Annotate with Pfam HMMs
- --tigrfam Annotate with TIGRFAM HMMs
- --clusters Annotate with cluster ids
- --orthologs Annotate with ortholog ids
- --cazy Annotate with dbCAN HMMs
- --ec Annotate with EC ids
Cutoff options:
- --cut_ga For PFAM and TIGRfam searches: use profiles GA gathering cutoffs to set all thresholding
- --cut_nc For PFAM and TIGRfam searches: use profiles NC noise cutoffs to set all thresholding
- --cut_tc For PFAM and TIGRfam searches: use profiles TC trusted cutoffs to set all thresholding
- --cut_ko For KO HMM annotation searches: use cutoffs defined by KEGG to maximise F-score.
- --evalue EVALUE Use this evalue cutoff to filter false positives (default: 1e-05)
- --bit BIT Use this bit score cutoff to filter false positives (default: 0)
- --id ID Use this percent identity cutoff to filter false positives (default: 0.3)
- --aln_query ALN_QUERY
This fraction of the query must align to the reference (default: 0.7) - --aln_reference ALN_REFERENCE
This fraction of the reference must align to the query (default: 0.7) - --c C When clustering, use matches above this fraction of aligned (covered) query and target residues (default: 0.7)
Runtime options:
- --threads THREADS Use this number of threads when annotating with BLAST and HMMsearch (default: 1)
- --parallel PARALLEL Run this number of jobs in parallel when annotating with HMMsearch (default: 5)
- --inflation INFLATION
Inflation factor to use when calling clusters in ortholog (default = 5.0) - --suffix SUFFIX Treat files ending with this suffix within the –genome_directory as genomes (default: .fna for –genome_directory and .faa for )
- --light Don’t store metadata for genome files (can’t use enrichM compare downstream, default=False)
- --count_domains Fill the frequency matrix with the total number of times an annotation was detected (for example, when one domain more than once within a protein), rather than the count of proteins with with that annotation
- --chunk_number CHUNK_NUMBER
Split loading of genomes into this many chunks (default = 4) - --chunk_max CHUNK_MAX
Maximum number of genomes to load per chunk (default = 2500)
Step 2. classify
Determine what pathways a genome encodes. Classify quickly reads in KO annotations in the form of a matrix (KO IDs as rows, genomes as columns) and determines which KEGG modules are complete. Annotation matrices can be generated using the annotate function.
1 | $ enrichm classify --output EnrichM_classify/ko_hmm_frequency_table --aggregate --genome_and_annotation_matrix EnrichM_annotate/ko_hmm_frequency_table.tsv --cutoff 0 |
点击此处查看参数
Output options:
- --output OUTPUT Output directory
- --force Overwrite previous run
Input options:
- --genome_and_annotation_matrix GENOME_AND_ANNOTATION_MATRIX
Path to file containing a genome annotation matrix - --custom_modules CUSTOM_MODULES
Tab separated file containing module name, definition as the columns - --module_rules_json MODULE_RULES_JSON
json file specifying rules to interpret the annotation and guide module annotation - --gff_files GFF_FILES
GFF files for the genomes being classified.
- --genome_and_annotation_matrix GENOME_AND_ANNOTATION_MATRIX
Cutoff options:
- --cutoff CUTOFF Output only modules with greater than this percent of the requied KO groups (default = print all modules)
Runtime options:
- --aggregate Calculate the abundance of each pathway within each genome/sample (column)
Step 3. enrichment
Enrichment will read in KO or PFAM annotations in the form of a matrix (IDs as rows, genomes as columns) and a metadata file that separates genomes into groups to compare, and will run some basic stats to determine the enrichment of modules or pfam clans between and within the groups.
1 | $ enrichm enrichment --output EnrichM_enrichment/ --metadata genome.list --annotation_matrix EnrichM_annotate/ko_frequency_table.tsv --force |
点击此处查看参数
- Output options:
- --output OUTPUT Output directory
- --force Overwrite previous run
Input options:
- --annotate_output ANNOTATE_OUTPUT
Output directory provided by enrichm annotate - --metadata METADATA Metadata file with two columns, the first with the genome name, the second with the groupings to compare.
- --annotation_matrix ANNOTATION_MATRIX
Annotation matrix to compare. - --gff_files GFF_FILES [GFF_FILES …]
Gff files for genomes to compare. - --abundance 基因组丰度矩阵
- --abundance_metadata ABUNDANCE_METADATA
Metadata grouping abundance samples. - --transcriptome TRANSCRIPTOME 基因组丰度矩阵
- --transcriptome_metadata TRANSCRIPTOME_METADATA
Metadata grouping abundance samples.
- --annotate_output ANNOTATE_OUTPUT
Genome Taxonomy DataBase (GTDB) options:
- --batchfile BATCHFILE
metadata file to compare with.
- --batchfile BATCHFILE
Runtime options:
- --pval_cutoff PVAL_CUTOFF
Only output results with a p-value below a this cutoff (default=0.05). - --proportions_cutoff PROPORTIONS_CUTOFF
Proportion enrichment cutoff. - --threshold THRESHOLD
The threshold to control for in false discovery rate of familywise error rate. - --multi_test_correction MULTI_TEST_CORRECTION
The form of mutiple test correction to use. Uses the statsmodel module and consequently has all of its options.
Default: Benjamini-Hochberg FDR (fdr_bh)
Options: Bonferroni (b)
Sidak (s)
Holm (h)
Holm-Sidak (hs)
Simes-Hochberg (sh)
Hommel (ho)
FDR Benjamini-Yekutieli (fdr_by)
FDR 2-stage Benjamini-Hochberg (fdr_tsbh)
FDR 2-stage Benjamini-Krieger-Yekutieli (fdr_tsbky)
FDR adaptive Gavrilov-Benjamini-Sarkar (fdr_gbs)) - --processes PROCESSES 采用多少个进程进行富集分析
- --allow_negative_values 允许输入的矩阵中有负值
- --ko Compare KO ids (annotated with DIAMOND)
- --ko_hmm Compare KO ids (annotated with HMMs)
- --pfam Compare Pfam ids
- --tigrfam Compare TIGRFAM ids
- --cluster Compare cluster ids
- --ortholog Compare ortholog ids
- --cazy Compare dbCAN ids
- --ec Compare EC ids
- --range RANGE Base pair range to search for synteny within. Default = 2500.
- --subblock_size SUBBLOCK_SIZE
Number of genes clustered in a row to be reported. Default = 2. - --operon_mismatch_cutoff OPERON_MISMATCH_CUTOFF
Number of allowed mismatches when searching for operons across genomes. Defaul - --operon_match_score_cutoff OPERON_MATCH_SCORE_CUTOFF
Score cutoff for operon matches
- --pval_cutoff PVAL_CUTOFF
Step 4. pathway
Pathway reads in a KO matrix and generates a Cytoscape-readable metabolic network and metadata file. Only reactions that are possible given the KOs present in the input matrix are shown, and the modules and reactions that are included in the output can be customized(报错).
1 | $ enrichm pathway --matrix EnrichM_annotate/ko_frequency_table.tsv --genome_metadata genome.list --output EnrichM_pathway --abundance EnrichM_enrichment/F01_vs_F02_ivg_results.cdf.tsv --abundance_metadata genome.list --force |
点击此处查看参数
Input options:
- --matrix KO矩阵. 必须提供
- --genome_metadata GENOME_METADATA
Metadata文件包含两列,第一列为基因组名字,第二列with the groupings to compare. - --abundance 丰度矩阵
- --abundance_metadata ABUNDANCE_METADATA
Metadata文件包含两列,第一列为基因组名字,第二列with the groupings to compare. - --tpm_values TPM_VALUES
DetectM产生的TPM values - --tpm_metadata TPM_METADATA
Metadata文件包含两列,第一列为基因组名字,第二列with the groupings to compare. - --metabolome METABOLOME
Metabolome CID matrix.
Logging options:
- --log LOG Output logging information to this file.
- --verbosity VERBOSITY
Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent
Output options:
- --output 输出路径
- --force 覆盖之前输出的结果
Pathway options:
- --limit LIMIT [LIMIT …]
USE ONLY these reactions, or reactions within this pathway or module (space separated list). - --filter FILTER [FILTER …]
Do not use these reactions, or reactions within this pathway or module (space separated list). - --enrichment_output ENRICHMENT_OUTPUT
Supply an enrichment output to integrate the results into the output network.
- --limit LIMIT [LIMIT …]
Step 5. explore
Explore is similar to pathway, but rather than generating a specified pathway it will start from a given query compound ID, and explore the possible reactions that use that compound given the enzymes present in the input KO matrix.
1 | $ |
点击此处查看参数
Input options:
- --matrix MATRIX KO矩阵. 必须提供
- --genome_metadata GENOME_METADATA
Metadata文件包含两列,第一列为基因组名字,第二列with the groupings to compare. - --abundance 丰度矩阵
- --abundance_metadata ABUNDANCE_METADATA
Metadata文件包含两列,第一列为基因组名字,第二列with the groupings to compare.. - --tpm_values TPM_VALUES
DetectM产生的TPM values - --tpm_metadata TPM_METADATA
Metadata文件包含两列,第一列为基因组名字,第二列with the groupings to compare.. - --metabolome METABOLOME
Metabolome CID matrix.
Logging options:
- --log LOG Output logging information to this file.
- --verbosity VERBOSITY
Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent
Output options:
- --output 输出路径
- --force 覆盖之前输出的结果
Query options:
- --queries QUERIES A file containing the KEGG ids of the compounds from which to start in the metabolic network
- --depth DEPTH Number of steps to take into the metabolic network
C Metagenome functional profiling
- 从宏基因组组装的contigs中预测基因——prodigal -p meta模式
- Metagenome-assembled genes which were not included in the MAGs were subjected to taxonomic classification using Kaiju
- eggNOG-mapper比对eggnog数据库
- HMMER 比对Pfam
- GhostKOALA 和KofamKOALA (v1.0.0)比对KEGG
- BLASTP比对TCDB
- dbCAN2 (v2.0.1)比对CAZy
- BLASTP比对MEROPS
- FeGenie检测Iron-related genes
- Fe-containing domains were characterized using Superfamily (v1.75).
- 砷呼吸和抗性基因挖掘https://github.com/ShadeLab/PAPER_Dunivin_meta_arsenic/tree/master/HMM_search,模型下载https://github.com/ShadeLab/PAPER_Dunivin_meta_arsenic/tree/master/gene_targeted_assembly/gene_resource