预测前噬菌体prophages
发表于:2021-10-08 |
字数统计: 1.2k | 阅读时长: 6分钟 | 阅读量:

本文介绍如何从细菌/古菌中预测前噬菌体。

软件选择

软件安装

安装主程序

  • 推荐使用conda

    1
    2
    conda install phispy
    conda install perl-bioperl

下载模型

1
2
3
4
5
wget https://ftp.ncbi.nlm.nih.gov/pub/kristensen/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz

tar -zxvf AllvogHMMprofiles.tar.gz

cat AllvogHMMprofiles/* > pVOGs.hmm

输入文件

运行软件

1
PhiSpy.py genbank_file -o output_directory --phage_genes 1 --color --threads 6 --phmms pVOGs.hmm --min_contig_size 5000 --output_choice 1
  • genbank file: 输入文件
  • output directory: 输出目录
  • --phage_genes: 区域内含有的被鉴定为噬菌体基因的最小数量。默认采用严格模式,即在每个前噬菌体区域必须鉴定得到两个或更多个phage基因。调高该参数的值将会减少预测到的前噬菌体的数量,反之,减小参数值,将会得到更多的移动元件。如果该参数设置为0,将会预测plasmids, integrons, and pathogenicity islands. Somewhat unexpectedly, it will also identify the ribosomal RNA operons as likely being mobile since they are unlike the host’s backbone!
  • --color: 根据CDs的功能对其着色,使用Artemis查看
  • --threads: 线程数
  • --min_contig_size: 低于阈值的序列将被忽略,不对其进行预测
  • --output_choice: 控制输出的文件类型,详见官网

报错处理

  • 由于序列ID引发的错误

    • 错误信息如下:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    Traceback (most recent call last):
    File "/home/liu/miniconda3/envs/component/bin/PhiSpy.py", line 10, in <module>
    sys.exit(run())
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/main.py", line 122, in run
    main(sys.argv)
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/main.py", line 44, in main
    args_parser.record = PhiSpyModules.SeqioFilter(filter(lambda x: len(x.seq) > args_parser.min_contig_size, SeqIO.parse(handle, "genbank")))
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/seqio_filter.py", line 33, in __init__
    for n, item in enumerate(content):
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 74, in __next__
    return next(self.records)
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 516, in parse_records
    record = self.parse(handle, do_features)
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 499, in parse
    if self.feed(handle, consumer, do_features):
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 465, in feed
    self._feed_first_line(consumer, self.line)
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 1571, in _feed_first_line
    raise ValueError("Did not recognise the LOCUS line layout:\n" + line)
    ValueError: Did not recognise the LOCUS line layout:
    LOCUS NODE_52_length_15591_cov_14.37480715591 bp
    • 解决方案
      • 将sequence ID缩短

批处理并整合结果

撰写脚本“run_PhiSpy.pl”。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
# Author: Liu hualin
# Date: Oct 12, 2021

my @gbk = glob("*.gbk");
foreach (@gbk) {
$_=~/(.+).gbk/;
my $out = $1 . "_prophage";
system("PhiSpy.py $_ -o $out --phage_genes 1 --min_contig_size 5000 --output_choice 1 --color --phmms pVOGs.hmm --threads 8");
}

open OUT, ">All.prophages.txt" || die;# print prophages informations
print OUT "Strain\tPhage ID\tContig ID\tStart location of the prophage\tStop location of the prophage\tStart of attL\tEnd of attL\tStart of attR\tEnd of attR\tsequence of attL\tSequence of attR\tWhy this att site was chosen for this prophage\n";
# attachment (att) sites
open SEQ, ">All.prophages.seq" || die;# print prophages sequences
print SEQ "Strain\tPhage ID\tContig ID\tGene Start\tGene End\tStrand\tAnnotation\tProtein sequences\n";
my @result = glob("*_prophage");
foreach (@result) {
$_=~/(.+)_prophage/;
my $str = $1;
my $prophage = $_ . "/prophage_coordinates.tsv";
if (! -z $prophage) {
open IN, "$prophage" || die;
while (<IN>) {
chomp;
my @lines = split /\t/;
my $contig = $lines[1];
my $gbk = $str . ".gbk";
my $seqin = Bio::SeqIO->new( -format => 'genbank', -file => "$gbk");#需要在gbk文件所在的目录中运行命令!
while( (my $seq = $seqin->next_seq) ) {
foreach my $sf ( $seq->get_SeqFeatures ) {
if ($seq->display_name eq $contig) {
if( $sf->primary_tag eq 'CDS' ) {
#print SEQ $str, "\t", $lines[0], "\t", $seq->display_name, "\t", $sf->start, "\t", $sf->end, "\t", $sf->strand, "\t", $sf->get_tag_values('product'), "\t", $sf->get_tag_values('translation'), $seq->seq,"\n";# Also print contig sequences
print SEQ $str, "\t", $lines[0], "\t", $seq->display_name, "\t", $sf->start, "\t", $sf->end, "\t", $sf->strand, "\t", $sf->get_tag_values('product'), "\t", $sf->get_tag_values('translation'),"\n";# Only print Protein sequences
}
}
}
}
print OUT $str . "\t" . $_ . "\n";
}
close IN;
}
}
close OUT;
close SEQ;

将“run_PhiSpy.pl”和后缀名为“.gbk”的GenBank格式的文件,以及“pVOGs.hmm”放在同一目录下,运行如下命令:

1
perl run_PhiSpy.pl

得到两个汇总文件:

  • All.prophages.txt: 包含prophage信息的文件
  • All.prophages.seq: 包含prophage序列的文件

All.prophages.txt

All.prophages.seq

脚本获取

关注公众号“生信之巅”,聊天窗口回复“53f4”获取下载链接。

生信之巅微信公众号 生信之巅小程序码

敬告:使用文中脚本请引用本文网址,请尊重本人的劳动成果,谢谢!Notice: When you use the scripts in this article, please cite the link of this webpage. Thank you!

参考

上一篇:
原核生物基因岛预测
下一篇:
Hexo渲染异常