宏基因组分析流程及代码
发表于:2021-01-19 | 分类: 生物信息
字数统计: 6.4k | 阅读时长: 36分钟 | 阅读量:

本文阐述宏基因组物种分类、组装、bining、基因预测及注释……

A、软件列表及安装

A.1 分类相关

A.1.1 metaphlan 3.0

MetaPhlAn is a tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data.

a. 安装主文件

1
2
$ conda create -n metaphlan python=3.7 metaphlan
$ conda activate metaphlan

b. 安装数据库

1
$ metaphlan --install

c. 安装依赖包

  • hclust
    1
    $ conda install -c bioconda hclust2
  • R, vegan, ape
    1
    $ conda install r-base r-vegan r-ape
  • rbiom
    1
    2
    3
    $ R
    $ install.packages("rbiom")
    $ quit()

A.1.2 gtdbtk

a. Hardware requirements

Domain Memory Storage Time
Archaea ~8 GB ~27 GB ~1 hour / 1,000 genomes @ 64 CPUs
Bacteria ~150 GB ~27 GB ~1 hour / 1,000 genomes @ 64 CPUs

b. Install GTDB-Tk with conda

1
$ conda create -n gtdbtk -c conda-forge -c bioconda gtdbtk

c. GTDB-Tk reference data

  • Note that different versions of the GTDB release data may not run on all versions of GTDB-Tk, below are all supported versions:
GTDB Release Minimum version Maximum version
R95 1.3.0 N/A
R89 0.3.0 0.1.2
R86.2 0.2.1 0.2.2
R86 0.1.0 0.1.6
R83 0.0.6 0.0.7
  • Download the reference data
1
2
3
$ wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/auxillary_files/gtdbtk_r95_data.tar.gz

$ tar xvzf gtdbtk_r95_data.tar.gz

注意:GTDB-Tk requires an environment variable named GTDBTK_DATA_PATH to be set to the directory containing the unarchived GTDB-Tk reference data.

  • You can automatically alias GTDBTK_DATA_PATH whenever the environment is activated by editing {gtdbtk environment path}/etc/conda/activate.d/gtdbtk.sh, e.g.:
1
2
3
4
5
6
7
# Determine the GTDB-Tk environment path
$ conda activate gtdbtk
$ which gtdbtk
# /miniconda3/envs/gtdbtk-1.3.0/bin/gtdbtk

# Edit the activate file
$ echo "export GTDBTK_DATA_PATH=/path/to/release/package/" > /miniconda3/envs/gtdbtk-1.3.0/etc/conda/activate.d/gtdbtk.sh

A.1.3 Kraken2

a. Hardware requirements

  • Disk space: Construction of a Kraken 2 standard database requires approximately 100 GB of disk space. A test on 01 Jan 2018 of the default installation showed 42 GB of disk space was used to store the genomic library files, 26 GB was used to store the taxonomy information from NCBI, and 29 GB was used to store the Kraken 2 compact hash table.
  • Memory: To run efficiently, Kraken 2 requires enough free memory to hold the database (primarily the hash table) in RAM. While this can be accomplished with a ramdisk, Kraken 2 will by default load the database into process-local RAM; the –memory-mapping switch to kraken2 will avoid doing so. The default database size is 29 GB (as of Jan. 2018), and you will need slightly more than that in RAM if you want to build the default database.

b. Install Kraken2 with conda

1
$ conda create -n kraken2 kraken2

c. Build the database

  • 下载数据库。找到一个存储空间比较大的目录并进入,运行如下命令,这里下载的数据库包括archaea,bacteria,plasmid,viral,fungi,protozoa,UniVec和UniVec_Core:
    1
    nohup bash -c 'for i in archaea bacteria plasmid viral fungi protozoa UniVec UniVec_Core; do kraken2-build --download-library $i --threads 24 --db db_name; done' &
  • 下载分类信息
    1
    nohup kraken2-build --download-taxonomy --threads 24 --db db_name &
  • 建立索引
    1
    nohup kraken2-build --build --threads 24 --db db_name &

    d. 序列分类

1
kraken2 --paired --threads 24 --unclassified-out unclassified#.fq --classified-out classified#.fq --output outfile --confidence 0.5 --memory-mapping --use-names --report reportname --report-zero-counts --db $DBNAME reads1 read2

A.2 组装、Bining、质量评估

A.2.1 metawrap

MetaWRAP is a pipeline for genome-resolved metagenomic data analysis.

a. 安装主文件

1
$ conda create -n metawrap python=2.7 metaphlan

b. 安装其他分析工具到metawrap环境中

  • cd-hit
  • coverm DNA read coverage and relative abundance calculator focused on metagenomics applications
  • bamm Metagenomics-focused BAM file manipulation
  • unitem Ensemble binning strategies for combining the output of multiple binning methods
  • humann2 The HMP Unified Metabolic Analysis Network 2
  • graphlan
  • export2graphlan
  • checkm database
    1
    2
    3
    4
    5
    6
    7
    8
    $ conda activeta metawrap
    $ conda install cd-hit coverm bamm unitem humann2 graphlan export2graphlan

    # 找到一个合适的目录并cd进入以便存储checkm数据库
    $ wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
    $ tar zxvf checkm_data_2015_01_16.tar.gz
    $ checkm data setRoot
    # 随后输入数据库所在的full path

A.3 非冗余基因功能注释

A.3.1 eggNOG-mapper

功能注释,uniref, eggNOG, KEGG, GO; CAZy; VFDB; CARD; TCDB; PHI。

A.3.2 EnrichM

EnrichM is a set of comparative genomics tools for large sets of metagenome assembled genomes (MAGs). The current functionality includes:

  • A basic annotation pipeline for MAGs.
  • A pipeline to determine the metabolic pathways that are encoded by MAGs, using KEGG modules as a reference (although custom pathways can be specified)
  • A pipeline to identify genes or metabolic pathways that are enriched within and between user-defined groups of genomes (groups can be genomes that are related functionally, phylogenetically, recovered from different environments, etc).
  • To construct metabolic networks from annotated population genomes.
  • Construct random forest machine learning models from the functional composition of either MAGs, metagenomes or transcriptomes.
  • Apply these random forest machine learning models to classify new MAGs metagenomes.

a. 安装主文件

1
2
$ conda create -n enrichm python=3
$ conda install enrichm

b. 安装数据库

1
2
# 约5.7 G
$ enrichm data

报错

Traceback (most recent call last):
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/data.py”, line 114, in do
version_remote = urllib.request.urlopen(self.ftp + self.VERSION).readline().strip().decode(“utf-8”)
AttributeError: module ‘urllib’ has no attribute ‘request’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/hualin/miniconda3/envs/enrichm/bin/enrichm”, line 342, in <module>
run.run_enrichm(args, sys.argv)
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/run.py”, line 288, in run_enrichm
d.do(args.uninstall, args.dry)
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/data.py”, line 117, in do
“Unable to find most current EnrichM database VERSION in ftp. Please complain at https://github.com/geronimp/enrichM“)
Exception: Unable to find most current EnrichM database VERSION in ftp. Please complain at https://github.com/geronimp/enrichM

解决方案:将data.py中的’import urllib’替换为’import urllib.request’

1
$ vim /home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/data.py

c. Sepcifying the location of the EnrichM database

If you would like to store the EnrichM database outside of your home directory, move you need to tell EnrichM where to look. To do this, export a BASH variable named “ENRICHM_DB”:

1
$ export ENRICHM_DB=/path/to/database/ >>~/.bashrc

注意 :/path/to/database/根据实际情况而定!
报错

$ enrichm
Traceback (most recent call last):
File “/home/hualin/miniconda3/envs/enrichm/bin/enrichm”, line 38, in <module>
from enrichm.run import Run
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/run.py”, line 24, in <module>
from enrichm.network_analyzer import NetworkAnalyser
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/network_analyzer.py”, line 22, in <module>
from enrichm.network_builder import NetworkBuilder
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/network_builder.py”, line 24, in <module>
from enrichm.databases import Databases
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/databases.py”, line 28, in <module>
class Databases:
File “/home/hualin/miniconda3/envs/enrichm/lib/python3.6/site-packages/enrichm/databases.py”, line 36, in Databases
PICKLE_VERSION = open(os.path.join(CUR_DATABASE_DIR, ‘VERSION’)).readline().strip()
FileNotFoundError: [Errno 2] No such file or directory: ‘/new_data/hualin/db/enrichm_database_v10/26-11-2018/VERSION’

Solve: 将下载的数据库文件全部复制一份到“26-11-2018”目录中,否则后续运行annotaton时会提示找不到数据库文件。

B、数据分析

B.1 使用metaphlan从Reads中获取物种分类信息

Step 1. 激活环境

1
$ conda activate metaphlan

Step 2. 对paired-end Reads进行注释

1
$ metaphlan Reads1,Reads2 --input_type fastq --bowtie2out Str1.bowtie2.bz2 --nproc 10 -o Str1_profiled.txt

Reads1和Reads2分别代表双端测序得到的正向和反向数据;–input_type指定文件格式,我们拿到的下机数据一般为fastq格式;–bowtie2out参数将会保存运行产生的中间文件以便后续重新运行程序时作为输入文件;–nproc指定使用的线程数量;-o指定输出文件名。

Step 3. 汇总所有的结果文件

对所有的文件均运行Step 2 ,产生了多个输出文件(*_profiled.txt),可以将它们汇总到一个文件中(merged_abundance_table.txt),便于后续对多个样品进行相互比较。

1
$ merge_metaphlan_tables.py *_profiled.txt > merged_abundance_table.txt

Step 4. 从合并的文件中提取种水平的分类

1
$ grep -E "s__|clade" merged_abundance_table.txt | sed 's/^.*s__//g' | cut -f1,3-8 | sed -e 's/clade_name/body_site/g' > merged_abundance_table_species.txt

Step 5. 绘制样本间种水平的热图

1
$ hclust2.py -i merged_abundance_table_species.txt -o abundance_heatmap_species.png --ftop 50 --f_dist_f braycurtis --s_dist_f braycurtis --cell_aspect_ratio 0.5 -l --flabel_size 10 --slabel_size 10 --max_flabel_len 100 --max_slabel_len 100 --minv 0.1 --dpi 300

Step 6. 计算样本间的unifrac距离

1
2
3
4
5
6
# 下载依赖的tree文件和脚本,与要分析的文件放于同一目录下
$ wget https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/mpa_v30_CHOCOPhlAn_201901_species_tree.nwk
$ wget https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/calculate_unifrac.R

# 开始计算距离
$ Rscript plot_unifrac.R merged_abundance_table.txt mpa_v30_CHOCOPhlAn_201901_species_tree.nwk unifrac_merged_abundance_table.txt

Step 7. 绘制cladogram图

1
2
3
4
5
6
7
8
9
10
11
12
# 激活依赖软件所在的环境
$ conda activate metawrap

# 生成绘图所需的文件
$ tail -n +2 merged_abundance_table.txt | cut -f1,3- > merged_abundance_table_reformatted.txt

$ export2graphlan.py --skip_rows 1 -i merged_abundance_table_reformatted.txt --tree merged_abundance.tree.txt --annotation merged_abundance.annot.txt --most_abundant 100 --abundance_threshold 1 --least_biomarkers 10 --annotations 5,6 --external_annotations 7 --min_clade_size 1

# 绘图
$ graphlan_annotate.py --annot merged_abundance.annot.txt merged_abundance.tree.txt merged_abundance.xml

$ graphlan.py --dpi 300 merged_abundance.xml merged_abundance.pdf --external_legends

B.2 使用metawrap对Reads进行组装

Step 1. 激活环境

1
$ conda activate metawrap

Step 2. 组装

1
$ metawrap assembly -1 Reads1 -2 Reads2 -o Assemble1 -m 300 -t 15 --metaspades
点击此处查看参数 Usage: metaWRAP assembly [options] -1 reads_1.fastq -2 reads_2.fastq -o output_dir
  • Options:
    • -1 STR forward fastq reads

    • -2 STR reverse fastq reads

    • -o STR output directory

    • -m INT memory in GB (default=24)

    • -t INT number of threads (defualt=1)

    • -l INT minimum length of assembled contigs (default=1000)

    • --megahit assemble with megahit (default)

    • --metaspades assemble with metaspades instead of megahit (better results but slower and higher memory requirement)

Reads1和Reads2分别代表双端测序得到的正向和反向数据;-o指定输出目录,-m指定最大可用内存大小,超过设定值后程序会自动退出,建议设大一点,我10G的数据大概需要180G内存;-t指定线程数;–metaspades表示用metaspades进行组装,特别慢,但是组装结果相对好一些。

B.3 使用metawrap对Contigs进行Bining

Step 1. 激活环境

1
$ conda activate metawrap

Step 2. Bining

1
$ metawrap binning -o Str1.INITIAL_BINNING -t 20 -m 200 --universal --run-checkm -a <path of contigs> --metabat2 --maxbin2 --concoct 解压的Reads1 解压的Reads2
点击此处查看参数 Usage: metaWRAP binning [options] -a assembly.fa -o output_dir readsA_1.fastq readsA_2.fastq ... [readsX_1.fastq readsX_2.fastq]
  • Note1: Make sure to provide all your separately replicate read files, not the joined file.

  • Note2: You may provide single end or interleaved reads as well with the use of the correct option

  • Note3: If the output already has the .bam alignments files from previous runs, the module will skip re-aligning the reads

  • Options:

    • -a STR metagenomic assembly file
    • -o STR output directory
    • -t INT number of threads (default=1)
    • -m INT amount of RAM available (default=4)
    • -l INT minimum contig length to bin (default=1000bp). Note: metaBAT will default to 1500bp minimum
    • --metabat2 bin contigs with metaBAT2
    • --metabat1 bin contigs with the original metaBAT
    • --maxbin2 bin contigs with MaxBin2
    • --concoct bin contigs with CONCOCT
    • --universal use universal marker genes instead of bacterial markers in MaxBin2 (improves Archaea binning)
    • --run-checkm immediately run CheckM on the bin results (requires 40GB+ of memory)
    • --single-end non-paired reads mode (provide *.fastq files)
    • --interleaved the input read files contain interleaved paired-end reads
**注意避坑:** 这里的Reads1和Reads2需要提供解压缩后的Reads文件,不但要解压缩,还需要重命名,即后缀名必须为“\_clean\_1.fastq” 和 “\_clean\_2.fastq”,否则软件无法运行。-o指定输出目录;-t指定线程数;-m指定最大内存限制;--run-checkm表明即时检查Bining的质量;--metabat2 --maxbin2 --concoct 指定同时采用这三个分箱工具进行Bining;--universal指定MaxBin2采用universal marker 基因代替 bacterial markers (improves Archaea binning)。

Step 3. 整合三种方法的bins (metabat2, maxbin2, concoct) 结果

1
$ metawrap bin_refinement -o F01_BIN_REFINEMENT(输出目录) -t 20 -A str1.INITIAL_BINNING/metabat2_bins/ -B str1.INITIAL_BINNING/maxbin2_bins/ -C str1.INITIAL_BINNING/concoct_bins/ -c 70 -x 5
点击此处查看参数 Usage: metaWRAP bin_refinement [options] -o output_dir -A bin_folderA [-B bin_folderB -C bin_folderC]
  • Note: the contig names in different bin folders must be consistant (must come from the same assembly).

  • Options:

    • -o STR output directory
    • -t INT number of threads (default=1)
    • -m INT memory available (default=40)
    • -c INT 完整度minimum % completion of bins [should be >50%] (default=70)
    • -x INT 污染度maximum % contamination of bins that is acceptable (default=10)
    • -A STR folder with metagenomic bins (files must have .fa or .fasta extension)
    • -B STR another folder with metagenomic bins
    • -C STR another folder with metagenomic bins
    • --skip-refinement dont use binning_refiner to come up with refined bins based on combinations of binner outputs
    • --skip-checkm dont run CheckM to assess bins
    • --skip-consolidation choose the best version of each bin from all bin refinement iteration
    • --keep-ambiguous for contigs that end up in more than one bin, keep them in all bins (default: keeps them only in the best bin)
    • --remove-ambiguous for contigs that end up in more than one bin, remove them in all bins (default: keeps them only in the best bin)
    • --quick adds –reduced_tree option to checkm, reducing runtime, especially with low memory

Step 4. Blobology可视化bin

一个坑: metawrap安装的blast为2.6版本,只能用Version 4 的nt库。而最新的nt库为Version 5,v4已经不再维护了。因此需要更新metawrap安装环境中的blast至2.8.0及以上版本,这里无法通过‘conda updata blast’实现更新,而是需要下载新版本的blast可执行程序,覆盖metawrap环境中的blast程度们。 如果你采用的是默认版本的blast和V5的nt库,会得到报错:“BLAST Database error: Error: Not a valid version 4 database”.

1
$ metawrap blobology -a Result/Assemble/F01/final_assembly.fasta -t 20 -o F01.BLOBOLOGY --bins F01_BIN_REFINEMENT/metawrap_70_5_bins reads/F01_clean_1.fastq reads/F01_clean_2.fastq
点击此处查看参数 Usage: metaWRAP blobology [options] -a assembly.fasta -o output_dir readsA_1.fastq readsA_2.fastq [readsB_1.fastq readsB_2.fastq ... ]
  • Options:

    • -a STR assembly fasta file
    • -o STR output directory
    • -t INT number of threads
    • --subsamble INT Number of contigs to run blobology on. Subsampling is randomized. (default=ALL)
    • --bins STR Folder containing bins. Contig names must match those of the assembly file. (default=None)

Step 5. Bins 定量

1
$ metawrap quant_bins -b F01_BIN_REFINEMENT/metawrap_70_5_bins -t 8 -o F01.QUANT_BINS -a Result/Assemble/F01/final_assembly.fasta reads/F01_clean_1.fastq reads/F01_clean_2.fastq
点击此处查看参数 Usage: metaWRAP quant_bins [options] -b bins_folder -o output_dir -a assembly.fa readsA_1.fastq readsA_2.fastq ... [readsX_1.fastq readsX_2.fastq]
  • Options:

    • -b STR folder containing draft genomes (bins) in fasta format
    • -o STR output directory
    • -a STR fasta file with entire metagenomic assembly (strongly recommended!)
    • -t INT number of threads

Step 6. 重组装

1
metawrap reassemble_bins -o F01.BIN_REASSEMBLY -1 reads/F01_clean_1.fastq -2 reads/F01_clean_2.fastq -t 20 -m 400 -c 70 -x 10 -b F01_BIN_REFINEMENT/metawrap_70_5_bins
点击此处查看参数 Usage: metaWRAP reassemble_bins [options] -o output_dir -b bin_folder -1 reads_1.fastq -2 reads_2.fastq
  • Options:

    • -b STR folder with metagenomic bins
    • -o STR output directory
    • -1 STR forward reads to use for reassembly
    • -2 STR reverse reads to use for reassembly
    • -t INT number of threads (default=1)
    • -m INT memory in GB (default=40)
    • -c INT minimum desired bin completion % (default=70)
    • -x INT maximum desired bin contamination % (default=10)
    • -l INT minimum contig length to be included in reassembly (default=500)
    • --strict-cut-off maximum allowed SNPs for strict read mapping (default=2)
    • --permissive-cut-off maximum allowed SNPs for permissive read mapping (default=5)
    • --skip-checkm dont run CheckM to assess bins
    • --parallel run spades reassembly in parallel, but only using 1 thread per bin

B.4 MAGs注释

B.4.1 GTDB-TK 进行物种分类和注释,构建系统发育树

1
$ conda activate gtdbtk

classify_wf——Classify workflow (包括了Step 1-3)

The classify workflow consists of three steps: identify, align, and classify.

1
$ time gtdbtk classify_wf --genome_dir metawrap_70_5_bins/ --out_dir classify_wf_output -x .fa --prefix F --cpus 6 -r --debug
点击此处查看参数 usage: gtdbtk classify_wf (--genome_dir GENOME_DIR | --batchfile BATCHFILE) --out_dir OUT_DIR [-x EXTENSION] [--min_perc_aa MIN_PERC_AA] [--prefix PREFIX] [--cpus CPUS] [--pplacer_cpus PPLACER_CPUS] [--force] [--scratch_dir SCRATCH_DIR] [-r] [--min_af MIN_AF] [--debug] [-h]
  • mutually exclusive required arguments:

    • --genome_dir GENOME_DIR
      directory containing genome files in FASTA format
    • --batchfile BATCHFILE
      path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])
  • required named arguments:

    • --out_dir OUT_DIR directory to output files
  • optional arguments:

    • -x, --extension EXTENSION
      extension of files to process, gz = gzipped (default: fna)
    • --min_perc_aa MIN_PERC_AA
      exclude genomes that do not have at least this percentage of AA in the MSA (inclusive bound) (default: 10)
    • --prefix PREFIX prefix for all output files (default: gtdbtk)
    • --cpus CPUS number of CPUs to use (default: 1)
    • --pplacer_cpus PPLACER_CPUS
      use pplacer_cpus during placement (default: cpus)
    • --force continue processing if an error occurs on a single genome (default: False)
    • --scratch_dir SCRATCH_DIR
      Reduce pplacer memory usage by writing to disk (slower).
    • -r, --recalculate_red
      recalculate RED values based on the reference tree and all added user genomes (default: False)
    • --min_af MIN_AF minimum alignment fraction to consider closest genome (default: 0.65)
    • --debug create intermediate files for debugging purposes (default: False)
    • -h, --help show help message

Step 1: identify——在基因组中寻找marker genes

Translation table选择:use table 11 unless the coding density using table 4 is 5% higher than when using table 11 and the coding density under table 4 is >70%. GTDB-Tk 不会区分 tables 4和tables 5. 若用户清楚使用哪一个table,可以通过--batchfile指定。

1
$ time gtdbtk identify --genome_dir metawrap_70_5_bins/ --out_dir identify_output --cpus 6 --prefix F --debug -x .fa --write_single_copy_genes

Step 2: align

Create a multiple sequence alignment based on the AR122/BAC120 marker set.
Time 3m50.019s

1
$ time gtdbtk align --identify_dir identify_output/ --out_dir align_output --cpus 3 --prefix F --debug

Step 3: classify

Determine taxonomic classification of genomes.
Time 118m9.386s

1
$ time gtdbtk classify --genome_dir metawrap_70_5_bins/ --align_dir align_output/ --out_dir classify_output --cpus 3 --prefix F --debug -x .fa -r

注意:如果内存较小,则加上“–scratch_dir”参数。


Step 4: export_msa (这一步不运行)

The export_msa will export the untrimmed archaeal or bacterial MSA used in the reference data.

1
2
3
4
5
# 古菌 0m1.503s
$ time gtdbtk export_msa --domain arc --output msa_arc.faa --debug

# 细菌 0m16.679s
$ time gtdbtk export_msa --domain bac --output msa_bac.faa --debug

Step 5: trim_msa (这一步不运行)

The trim_msa command will trim a MSA given a user-specified mask file, or the archaeal/bacterial mask present in the reference data.

1
2
3
4
5
# 古菌 0m10.675s
$ time gtdbtk trim_msa --untrimmed_msa msa_arc.faa --output msa_arc_trim.faa --reference_mask arc --debug

# 细菌 3m28.793s
$ time gtdbtk trim_msa --untrimmed_msa msa_bac.faa --output msa_bac_trim.faa --reference_mask bac --debug

Step 6: infer

Infer tree from multiple sequence alignment.

1
2
3
4
5
#古菌 44m54.884s
$ time gtdbtk infer --msa_file align_output/F.ar122.msa.fasta --out_dir infer_out_arc_F --prefix F --cpus 12 --debug

#细菌
$ time gtdbtk infer --msa_file align_output/F.bac120.msa.fasta --out_dir infer_out_bac_F --prefix F --cpus 12 --debug
点击此处查看参数 usage: gtdbtk infer --msa_file MSA_FILE --out_dir OUT_DIR [--prot_model {JTT,WAG,LG}] [--no_support] [--gamma] [--prefix PREFIX] [--cpus CPUS] [--debug] [-h]
  • required named arguments:

    • --msa_file MSA_FILE multiple sequence alignment in FASTA format
    • --out_dir OUT_DIR directory to output files
  • optional arguments:

    • --prot_model {JTT,WAG,LG}
      protein substitution model for tree inference (default: WAG)
    • --no_support do not compute local support values using the Shimodaira-Hasegawa test (default: False)
    • --gamma rescale branch lengths to optimize the Gamma20 likelihood (default: False)
    • --prefix PREFIX prefix for all output files (default: gtdbtk)
    • --cpus CPUS number of CPUs to use (default: 1)
    • --debug create intermediate files for debugging purposes (default: False)
    • -h, --help show help message

Step 7: decorate

Decorate a tree with the GTDB-Tk taxonomy.

1
2
3
4
5
# 古菌
$ time gtdbtk decorate --input_tree infer_out_arc_F/F.unrooted.tree --output_tree F.decorate_unrooted_arc.tree --gtdbtk_classification_file classify_output/classify/F.ar122.summary.tsv --debug

# 细菌
$ time gtdbtk decorate --input_tree infer_out_bac_F/F.unrooted.tree --output_tree F.decorate_unrooted_brc.tree --gtdbtk_classification_file classify_output/classify/F.bac120.summary.tsv --debug
点击此处查看参数 usage: gtdbtk decorate --input_tree INPUT_TREE --output_tree OUTPUT_TREE [--gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE] [--custom_taxonomy_file CUSTOM_TAXONOMY_FILE] [--debug] [-h]
  • required named arguments:

    • --input_tree INPUT_TREE
      path to the unrooted tree in Newick format
    • --output_tree OUTPUT_TREE
      path to output the tree
  • optional arguments:

    • --gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE
      file with GTDB-Tk classifications produced by the classify command
    • --custom_taxonomy_file CUSTOM_TAXONOMY_FILE
      file indicating custom taxonomy strings for user genomes, that should contain any genomes belonging to the outgroup
    • --debug create intermediate files for debugging purposes (default: False)
    • -h, --help show help message

Step 8: root

Root a tree using an outgroup.

1
$ time gtdbtk root --input_tree input.tree --outgroup_taxon p__Nanoarchaeota --output_tree output.tree --gtdbtk_classification_file <file with GTDB-Tk classifications produced by the classify command> --debug
点击此处查看参数 usage: gtdbtk root --input_tree INPUT_TREE --outgroup_taxon OUTGROUP_TAXON --output_tree OUTPUT_TREE [--gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE] [--custom_taxonomy_file CUSTOM_TAXONOMY_FILE] [--debug] [-h]
  • required named arguments:

    • --input_tree INPUT_TREE
      path to the unrooted tree in Newick format
    • --outgroup_taxon OUTGROUP_TAXON
      taxon to use as outgroup (e.g., p__Patescibacteria or p__Altarchaeota)
    • --output_tree OUTPUT_TREE
      path to output the tree
  • optional arguments:

    • --gtdbtk_classification_file GTDBTK_CLASSIFICATION_FILE
      file with GTDB-Tk classifications produced by the classify command
    • --custom_taxonomy_file CUSTOM_TAXONOMY_FILE
      file indicating custom taxonomy strings for user genomes, that should contain any genomes belonging to the outgroup
    • --debug create intermediate files for debugging purposes (default: False)
    • -h, --help show help message

Step 9: ani_rep——计算ANI值

Compute the ANI of input genomes to all GTDB-Tk representative genomes.

1
$ time gtdbtk ani_rep --genome_dir metawrap_70_5_bins/ --out_dir ani_rep/ --cpus 6 -x .fa --prefix F --debug
1
$ gtdbtk de_novo_wf --genome_dir metawrap_70_5_bins/ --out_dir de_novo_wf --extension .fa --bacteria --outgroup_taxon p__Patescibacteria --prefix F --cpus 6 --debug
点击此处查看参数 usage: gtdbtk ani_rep (--genome_dir GENOME_DIR | --batchfile BATCHFILE) --out_dir OUT_DIR [--no_mash] [--mash_k MASH_K] [--mash_s MASH_S] [--mash_d MASH_D] [--mash_v MASH_V] [--mash_db MASH_DB] [--min_af MIN_AF] [-x EXTENSION] [--prefix PREFIX] [--cpus CPUS] [--debug] [-h]
  • mutually exclusive required arguments:

    • --genome_dir GENOME_DIR
      directory containing genome files in FASTA format

    • --batchfile BATCHFILE
      path to file describing genomes - tab separated in 2 or 3 columns (FASTA file, genome ID, translation table [optional])

    • - required named arguments:

    • --out_dir OUT_DIR directory to output files

  • optional Mash arguments:

    • --no_mash skip pre-filtering of genomes using Mash (default: False)
    • --mash_k MASH_K k-mer size [1-32] (default: 16)
    • --mash_s MASH_S maximum number of non-redundant hashes (default: 5000)
    • --mash_d MASH_D maximum distance to keep [0-1] (default: 0.1)
    • --mash_v MASH_V maximum p-value to keep [0-1] (default: 1.0)
    • --mash_db MASH_DB path to save/read (if exists) the Mash reference sketch database (.msh)
  • optional FastANI arguments:

    • --min_af MIN_AF minimum alignment fraction to consider closest genome (default: 0.65)
  • optional arguments:

    • -x, --extension EXTENSION
      extension of files to process, gz = gzipped (default: fna)
    • --prefix PREFIX prefix for all output files (default: gtdbtk)
    • --cpus CPUS number of CPUs to use (default: 1)
    • --debug create intermediate files for debugging purposes (default: False)
    • -h, --help show help message

B.4.2 EnrichM 注释

Step 1. annotate

基因组注释管道,可以使用dbCAN将基因组与KO, PFAM, TIGRFAM和CAZY 数据库进行比对。结果将为每个基因组产生一个.gff文件,并生成每个注释类型的频率矩阵(frequency matrix),其中行是注释IDs,列是基因组。

1
$ enrichm annotate --genome_directory metawrap_70_5_bins --output EnrichM_annotate --force --ko --ko_hmm --pfam --tigrfam --clusters --orthologs --cazy --ec --threads 30 --parallel 8 --suffix .fa --count_domains --chunk_number 8
点击此处查看参数
  • Output options:

    • --output OUTPUT Output directory
    • --force Overwrite previous run
  • Input options (Use one):

    • --genome_files GENOME_FILES [GENOME_FILES …]
      Space separated list of genomes to annotate
    • --genome_directory GENOME_DIRECTORY
      Directory containing genomes to annotate
    • --protein_files PROTEIN_FILES [PROTEIN_FILES …]
      Space separated list of .faa files of genomes to annotate. Protein files need to be generated by prodigal.
    • --protein_directory PROTEIN_DIRECTORY
      Directory containing .faa files of genomes to annotate. Protein files need to be generated by prodigal.
  • Annotations options:

    • --ko Annotate with KO ids
    • --ko_hmm Annotate with KO ids
    • --pfam Annotate with Pfam HMMs
    • --tigrfam Annotate with TIGRFAM HMMs
    • --clusters Annotate with cluster ids
    • --orthologs Annotate with ortholog ids
    • --cazy Annotate with dbCAN HMMs
    • --ec Annotate with EC ids
  • Cutoff options:

    • --cut_ga For PFAM and TIGRfam searches: use profiles GA gathering cutoffs to set all thresholding
    • --cut_nc For PFAM and TIGRfam searches: use profiles NC noise cutoffs to set all thresholding
    • --cut_tc For PFAM and TIGRfam searches: use profiles TC trusted cutoffs to set all thresholding
    • --cut_ko For KO HMM annotation searches: use cutoffs defined by KEGG to maximise F-score.
    • --evalue EVALUE Use this evalue cutoff to filter false positives (default: 1e-05)
    • --bit BIT Use this bit score cutoff to filter false positives (default: 0)
    • --id ID Use this percent identity cutoff to filter false positives (default: 0.3)
    • --aln_query ALN_QUERY
      This fraction of the query must align to the reference (default: 0.7)
    • --aln_reference ALN_REFERENCE
      This fraction of the reference must align to the query (default: 0.7)
    • --c C When clustering, use matches above this fraction of aligned (covered) query and target residues (default: 0.7)
  • Runtime options:

    • --threads THREADS Use this number of threads when annotating with BLAST and HMMsearch (default: 1)
    • --parallel PARALLEL Run this number of jobs in parallel when annotating with HMMsearch (default: 5)
    • --inflation INFLATION
      Inflation factor to use when calling clusters in ortholog (default = 5.0)
    • --suffix SUFFIX Treat files ending with this suffix within the –genome_directory as genomes (default: .fna for –genome_directory and .faa for )
    • --light Don’t store metadata for genome files (can’t use enrichM compare downstream, default=False)
    • --count_domains Fill the frequency matrix with the total number of times an annotation was detected (for example, when one domain more than once within a protein), rather than the count of proteins with with that annotation
    • --chunk_number CHUNK_NUMBER
      Split loading of genomes into this many chunks (default = 4)
    • --chunk_max CHUNK_MAX
      Maximum number of genomes to load per chunk (default = 2500)

Step 2. classify

Determine what pathways a genome encodes. Classify quickly reads in KO annotations in the form of a matrix (KO IDs as rows, genomes as columns) and determines which KEGG modules are complete. Annotation matrices can be generated using the annotate function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ enrichm classify --output EnrichM_classify/ko_hmm_frequency_table --aggregate --genome_and_annotation_matrix EnrichM_annotate/ko_hmm_frequency_table.tsv --cutoff 0

$ enrichm classify --output EnrichM_classify/ko_frequency_table --aggregate --genome_and_annotation_matrix EnrichM_annotate/ko_frequency_table.tsv --cutoff 0

# 以下6个命令可不必运行
$ enrichm classify --output EnrichM_classify/cazy_frequency_table --aggregate --genome_and_annotation_matrix EnrichM_annotate/cazy_frequency_table.tsv --cutoff 0

$ enrichm classify --output EnrichM_classify/cluster_frequency_table --aggregate --genome_and_annotation_matrix EnrichM_annotate/cluster_frequency_table.tsv --cutoff 0

$ enrichm classify --output EnrichM_classify/ec_frequency_table --aggregate --genome_and_annotation_matrix EnrichM_annotate/ec_frequency_table.tsv --cutoff 0

$ enrichm classify --output EnrichM_classify/ortholog_frequency_table --aggregate --genome_and_annotation_matrix EnrichM_annotate/ortholog_frequency_table.tsv --cutoff 0

$ enrichm classify --output EnrichM_classify/pfam_frequency_table --aggregate --genome_and_annotation_matrix EnrichM_annotate/pfam_frequency_table.tsv --cutoff 0

$ enrichm classify --output EnrichM_classify/tigrfam_frequency_table --aggregate --genome_and_annotation_matrix EnrichM_annotate/tigrfam_frequency_table.tsv --cutoff 0
点击此处查看参数
  • Output options:

    • --output OUTPUT Output directory
    • --force Overwrite previous run
  • Input options:

    • --genome_and_annotation_matrix GENOME_AND_ANNOTATION_MATRIX
      Path to file containing a genome annotation matrix
    • --custom_modules CUSTOM_MODULES
      Tab separated file containing module name, definition as the columns
    • --module_rules_json MODULE_RULES_JSON
      json file specifying rules to interpret the annotation and guide module annotation
    • --gff_files GFF_FILES
      GFF files for the genomes being classified.
  • Cutoff options:

    • --cutoff CUTOFF Output only modules with greater than this percent of the requied KO groups (default = print all modules)
  • Runtime options:

    • --aggregate Calculate the abundance of each pathway within each genome/sample (column)

Step 3. enrichment

Enrichment will read in KO or PFAM annotations in the form of a matrix (IDs as rows, genomes as columns) and a metadata file that separates genomes into groups to compare, and will run some basic stats to determine the enrichment of modules or pfam clans between and within the groups.

1
$ enrichm enrichment --output EnrichM_enrichment/ --metadata genome.list --annotation_matrix EnrichM_annotate/ko_frequency_table.tsv --force
点击此处查看参数
  • - Output options:

    • --output OUTPUT Output directory
    • --force Overwrite previous run
  • Input options:

    • --annotate_output ANNOTATE_OUTPUT
      Output directory provided by enrichm annotate
    • --metadata METADATA Metadata file with two columns, the first with the genome name, the second with the groupings to compare.
    • --annotation_matrix ANNOTATION_MATRIX
      Annotation matrix to compare.
    • --gff_files GFF_FILES [GFF_FILES …]
      Gff files for genomes to compare.
    • --abundance 基因组丰度矩阵
    • --abundance_metadata ABUNDANCE_METADATA
      Metadata grouping abundance samples.
    • --transcriptome TRANSCRIPTOME 基因组丰度矩阵
    • --transcriptome_metadata TRANSCRIPTOME_METADATA
      Metadata grouping abundance samples.
  • Genome Taxonomy DataBase (GTDB) options:

    • --batchfile BATCHFILE
      metadata file to compare with.
  • Runtime options:

    • --pval_cutoff PVAL_CUTOFF
      Only output results with a p-value below a this cutoff (default=0.05).
    • --proportions_cutoff PROPORTIONS_CUTOFF
      Proportion enrichment cutoff.
    • --threshold THRESHOLD
      The threshold to control for in false discovery rate of familywise error rate.
    • --multi_test_correction MULTI_TEST_CORRECTION
      The form of mutiple test correction to use. Uses the statsmodel module and consequently has all of its options.
      Default: Benjamini-Hochberg FDR (fdr_bh)
      Options: Bonferroni (b)
      Sidak (s)
      Holm (h)
      Holm-Sidak (hs)
      Simes-Hochberg (sh)
      Hommel (ho)
      FDR Benjamini-Yekutieli (fdr_by)
      FDR 2-stage Benjamini-Hochberg (fdr_tsbh)
      FDR 2-stage Benjamini-Krieger-Yekutieli (fdr_tsbky)
      FDR adaptive Gavrilov-Benjamini-Sarkar (fdr_gbs))
    • --processes PROCESSES 采用多少个进程进行富集分析
    • --allow_negative_values 允许输入的矩阵中有负值
    • --ko Compare KO ids (annotated with DIAMOND)
    • --ko_hmm Compare KO ids (annotated with HMMs)
    • --pfam Compare Pfam ids
    • --tigrfam Compare TIGRFAM ids
    • --cluster Compare cluster ids
    • --ortholog Compare ortholog ids
    • --cazy Compare dbCAN ids
    • --ec Compare EC ids
    • --range RANGE Base pair range to search for synteny within. Default = 2500.
    • --subblock_size SUBBLOCK_SIZE
      Number of genes clustered in a row to be reported. Default = 2.
    • --operon_mismatch_cutoff OPERON_MISMATCH_CUTOFF
      Number of allowed mismatches when searching for operons across genomes. Defaul
    • --operon_match_score_cutoff OPERON_MATCH_SCORE_CUTOFF
      Score cutoff for operon matches

Step 4. pathway

Pathway reads in a KO matrix and generates a Cytoscape-readable metabolic network and metadata file. Only reactions that are possible given the KOs present in the input matrix are shown, and the modules and reactions that are included in the output can be customized(报错).

1
$ enrichm pathway --matrix EnrichM_annotate/ko_frequency_table.tsv --genome_metadata genome.list --output EnrichM_pathway --abundance EnrichM_enrichment/F01_vs_F02_ivg_results.cdf.tsv --abundance_metadata genome.list --force
点击此处查看参数
  • Input options:

    • --matrix KO矩阵. 必须提供
    • --genome_metadata GENOME_METADATA
      Metadata文件包含两列,第一列为基因组名字,第二列with the groupings to compare.
    • --abundance 丰度矩阵
    • --abundance_metadata ABUNDANCE_METADATA
      Metadata文件包含两列,第一列为基因组名字,第二列with the groupings to compare.
    • --tpm_values TPM_VALUES
      DetectM产生的TPM values
    • --tpm_metadata TPM_METADATA
      Metadata文件包含两列,第一列为基因组名字,第二列with the groupings to compare.
    • --metabolome METABOLOME
      Metabolome CID matrix.
  • Logging options:

    • --log LOG Output logging information to this file.
    • --verbosity VERBOSITY
      Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent
  • Output options:

    • --output 输出路径
    • --force 覆盖之前输出的结果
  • Pathway options:

    • --limit LIMIT [LIMIT …]
      USE ONLY these reactions, or reactions within this pathway or module (space separated list).
    • --filter FILTER [FILTER …]
      Do not use these reactions, or reactions within this pathway or module (space separated list).
    • --enrichment_output ENRICHMENT_OUTPUT
      Supply an enrichment output to integrate the results into the output network.

Step 5. explore

Explore is similar to pathway, but rather than generating a specified pathway it will start from a given query compound ID, and explore the possible reactions that use that compound given the enzymes present in the input KO matrix.

1
$ 
点击此处查看参数
  • Input options:

    • --matrix MATRIX KO矩阵. 必须提供
    • --genome_metadata GENOME_METADATA
      Metadata文件包含两列,第一列为基因组名字,第二列with the groupings to compare.
    • --abundance 丰度矩阵
    • --abundance_metadata ABUNDANCE_METADATA
      Metadata文件包含两列,第一列为基因组名字,第二列with the groupings to compare..
    • --tpm_values TPM_VALUES
      DetectM产生的TPM values
    • --tpm_metadata TPM_METADATA
      Metadata文件包含两列,第一列为基因组名字,第二列with the groupings to compare..
    • --metabolome METABOLOME
      Metabolome CID matrix.
  • Logging options:

    • --log LOG Output logging information to this file.
    • --verbosity VERBOSITY
      Level of verbosity (1 - 5 - default = 4) 5 = Very verbose, 1 = Silent
  • Output options:

    • --output 输出路径
    • --force 覆盖之前输出的结果
  • Query options:

    • --queries QUERIES A file containing the KEGG ids of the compounds from which to start in the metabolic network
    • --depth DEPTH Number of steps to take into the metabolic network

C Metagenome functional profiling

参考资料:

上一篇:
Use microeco分析扩增子数据
下一篇:
Linux骚操作