介绍
DeepARG 是一种机器学习解决方案,它使用深度学习来表征和注释宏基因组中的抗生素抗性基因。它由两种输入模型组成:短序列 Reads 和 gene-like 序列。
安装软件
通过
conda
安装# 创建环境 conda create -n deeparg_env python=2.7.18 conda activate deeparg_env # 安装diamond conda install -c bioconda diamond==0.9.24 # 安装其他依赖 conda install trimmomatic vsearch bedtools bowtie2 samtools # 安装DeepARG pip install deeparg==1.0.2 # 下载数据库等, -o指定下载路径 deeparg download_data -o tools/deeparg
通过其他方法安装
参考官方文档进行。
使用软件
预测 Reads 中的 ARGs
输入文件
双端 Reads。
命令
deeparg short_reads_pipeline --forward_pe_file Reads/LD201221-0003_S20210104-0015_F01_clean.R1.fq.gz --reverse_pe_file Reads/LD201221-0003_S20210104-0015_F01_clean.R2.fq.gz --output_file F01.deeparg -d ~/tools/deeparg/
参数解析
-h, --help
show this help message and exit--forward_pe_file
FORWARD_PE_FILE: forward mate from paired end library--reverse_pe_file
REVERSE_PE_FILE: reverse mate from paired end library--output_file
OUTPUT_FILE: save results to this file prefix-d
DEEPARG_DATA_PATH: Path where data was downloaded [see deeparg download-data --help for details]--deeparg_identity
DEEPARG_IDENTITY: minimum identity for ARG alignments [default 80]--deeparg_probability
DEEPARG_PROBABILITY: minimum probability for considering a reads as ARG-like [default 0.8]--deeparg_evalue
DEEPARG_EVALUE: minimum e-value for ARG alignments [default 1e-10]--gene_coverage
GENE_COVERAGE: minimum coverage required for considering a full gene in percentage. This parameter looks at the full gene and all hits that align to the gene. If the overlap of all hits is below the threshold the gene is discarded. Use with caution [default 1]
预测 FASTA 序列中的 ARGs
输入文件
可以是 核苷酸序列
或者是 氨基酸序列
。
命令
# 1) Annotate gene-like sequences when the input is a nucleotide FASTA file:
deeparg predict --model LS --type nucl --input /path/file.fasta --out /path/to/out/file.out
# 2) Annotate gene-like sequences when the input is an amino acid FASTA file:
deeparg predict --model LS --type prot --input /path/file.fasta --out /path/to/out/file.out
# 3) Annotate short sequence reads when the input is a nucleotide FASTA file:
deeparg predict --model SS --type nucl --input /path/file.fasta --out /path/to/out/file.out
# 4) Annotate short sequence reads when the input is a protein FASTA file (unusual case):
deeparg predict --model SS --type prot --input /path/file.fasta --out /path/to/out/file.out
参数解析
usage: deeparg predict-h, --help
show this help message and exit--model
MODEL: Select model to use (short sequences for reads | long sequences for genes) SS|LS [No default]-i, --input-file
INPUT_FILE: Input file (Fasta input file)-o, --output-file
OUTPUT_FILE: Output file where to store results-d, --data-path
DATA_PATH: Path where data was downloaded [see deeparg download-data --help for details]--type
TYPE: Molecular data type prot/nucl [Default: nucl]--min-prob
MIN_PROB: Minimum probability cutoff [Default: 0.8]--arg-alignment-identity
ARG_ALIGNMENT_IDENTITY: Identity cutoff for sequence alignment [Default: 50]--arg-alignment-evalue
ARG_ALIGNMENT_EVALUE: Evalue cutoff [Default: 1e-10]--arg-alignment-overlap
ARG_ALIGNMENT_OVERLAP: Alignment read overlap [Default: 0.8]--arg-num-alignments-per-entry
ARG_NUM_ALIGNMENTS_PER_ENTRY: Diamond, minimum number of alignments per entry [Default: 1000]--model-version
MODEL_VERSION: Model deepARG version [Default: v2]
输出
DeepARG 产生两个文件: .ARG
包含 probability >= --prob (0.8 default) 的序列, .potential.ARG
包含 probability <--prob (0.8 default) 的序列。 *.potential.ARG
文件中仍可能含有 ARG-like 序列,但是需要对这些序列进行检查。
两个文件的文件头如下所示:
* ARG_NAME
* QUERY_START
* QUERY_END
* QUERY_ID
* PREDICTED_ARG_CLASS
* BEST_HIT_FROM_DATABASE
* PREDICTION_PROBABILITY
* ALIGNMENT_BESTHIT_IDENTITY (%)
* ALIGNMENT_BESTHIT_LENGTH
* ALIGNMENT_BESTHIT_BITSCORE
* ALIGNMENT_BESTHIT_EVALUE
* COUNTS
批处理
- Run_deepARG_reads.pl
- Run_deepARG_scafs.pl
样本间比较
生成样本 VS. ARGs 矩阵
- co_type.pl
绘制气泡图
- Bubble_plot.R
绘制热图
- Plot_heatmap.R
参考
代码获取
关注公众号 “生信之巅”,聊天窗口回复 “92eb” 获取下载链接。
![]() | ![]() |
敬告:使用文中脚本请引用本文网址,请尊重本人的劳动成果,谢谢!