根据基因组预测表型 —— traitar的安装与使用

Traitar 用于从基因组序列中提取表型，它提供了表型分类器，可以预测与碳和能源使用、氧气需求、形态学、抗生素易感性、蛋白水解和酶活性等有关的67个性状。

1. 软件安装——traitar

安装基本依赖：

sudo apt-get install python-scipy python-matplotlib python-pip python-pandas

进入要安装软件的目录，我的为家目录下的tools：

cd ~/tools

=======================================================

安装主程序到家目录下

pip install traitar --user

将添加到环境变量中：

vim ~/.zshrc

i

文档末尾添加：

PATH=$PATH:$HOME/.local/bin/

ESC

shift + :

wq!

source ~/.zshrc

安装依赖软件（parallel， prodigal， hmmer）

sudo apt-get install parallel prodigal hmmer

下载pfam数据库到家目录下并建库

traitar pfam ~/

也可以手动下载pfam数据库（如果上一条不出错可以不运行后面的几条命令）：

cd ~/

官方提供的为Pfam27.0，我下载的为最新的Pfam32.0（下面两个命令选一个运行即可）：

wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam27.0/Pfam-A.hmm.gz

wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.hmm.gz

将Pfam-A.hmm.gz解压缩，然后运行下面的命令建库：

traitar pfam --local ~/

软件运行出错的话：

错误提示：ImportError: C extension: numpy.core.multiarray failed to import not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.

运行：

conda install -c conda-forge numpy

错误提示：AttributeError: 'DataFrame' object has no attribute 'sort'

pandas降级：

conda install pandas=0.19.2

错误提示：Python的最大递归深度错误 “maximum recursion depth exceeded while calling a Python object”

编辑脚本（~/miniconda3/lib/python2.7/site-packages/scipy/cluster/hierarchy.py），第183行加入两行，将默认的1000改大，比如2000（因为我有1000多个基因组）

import sys

#print sys.getrecursionlimit()

sys.setrecursionlimit(2000)

错误提示：RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility

pip uninstall -y scipy scikit-learn

pip install --no-binary scipy scikit-learn

2. 软件使用

首先进入含有基因组文件目录的上一级目录，输入命令执行：

traitar phenotype <in dir> <sample file> from_nucleotides <out_dir> -c 2

<in dir>：包含基因组的输入目录

<sample file>：描述文件，置于<in dir>的父目录下。共3列，第一列为基因组文件的全名（包含文件扩展名），第二列为菌株名称（一般为第一列去掉扩展名，可随意更改），第三列为分组信息（可以将所有的菌株划分到不同的组别）。三列之间以制表符分隔。三列的抬头为“sample_file_name sample_name category”，如下面文本框所示：

sample_file_name sample_name category

1457190.3.RefSeq.faa Listeria_ivanovii_WSLC3009 environment_1

525367.9.RefSeq.faa Listeria_grayi_DSM_20601 environment_2

<out_dir>：结果输出目录

-c 2：使用两个线程运行，提高预测速度