PGCGAP - the Prokaryotic Genomics and Comparative Genomics Analysis Pipeline v1.0.33

English Readme | 中文说明

      ____       ____      ____     ____       _        ____    
    U|  _"\ u U /"___|u U /"___| U /"___|u U  /"\  u  U|  _"\ u 
    \| |_) |/ \| |  _ / \| | u   \| |  _ /  \/ _ \/   \| |_) |/ 
     |  __/    | |_| |   | |/__   | |_| |   / ___ \    |  __/   
     |_|        \____|    \____|   \____|  /_/   \_\   |_|      
     ||>>_      _)(|_    _// \\    _)(|_    \\    >>   ||>>_    
    (__)__)    (__)__)  (__)(__)  (__)__)  (__)  (__) (__)__)

Multi-version instructions (This one for the latest version)

Introduction

PGCGAP is a pipeline for prokaryotic comparative genomics analysis. It can take the pair-end reads, ONT reads or PacBio reads as input. In addition to genome assembly, gene prediction and annotation, it can also get common comparative genomics analysis results such as phylogenetic trees of single-core proteins and core SNPs, pan-genome, whole-genome Average Nucleotide Identity (ANI), orthogroups and orthologs, COG annotations, substitutions (SNPs) and insertions/deletions (indels), and antimicrobial and virulence genes mining with only one line of commands. To follow this document, please upgrade PGCGAP to version v1.0.33 or later.

Installation

The software was tested successfully on Windows WSL, Linux x64 platform, and macOS. Because this software relies on a large number of other software, so it is recommended to install with Bioconda.

Step1: Install PGCGAP

Method 1: use mamba to install PGCGAP
1
2
3
4
5
# Install mamba first
conda install mamba

# Usually specify the latest version of PGCGAP (v1.0.33 is coming soon)
mamba create -n pgcgap pgcgap=1.0.33
Notice: I had submitted the latest version (v1.0.33) of the Bioconda recipe for PGCGAP for a test. However, Bioconda moved to a new test server which allocated very little memory, causing the test to fail. As a result, I had to remove some dependencies from the Bioconda recipe to pass the test, so after installing the main program of PGCGAP v1.0.33 through Conda, users still need to install some dependencies (Installing V1.0.32 and previous versions does not require an additional dependency package installation). This situation will be resolved soon with the release of Conda v4.12 as Bioconda will switch to the less memory-consuming Mamba for recipe testing. After that, users will still be able to install PGCGAP and its dependencies just using the above commands. However, at present, after the previous step, the user needs to execute the following command to complete the installation of the dependency packages:

当安装PGCGAP v1.0.33的时候还需要单独安装依赖包，这是由于Bioconda换了新的测试服务器。我在提交最新版本的recipe并进行测试的时候，服务器仅分配了极小的内存，导致测试失败。因此，我不得不从Bioconda recipe中移除一些依赖包以通过测试。所以，在通过conda安装完PGCGAP的主程序后，还需要安装一些依赖包。这种状况将在不久的将来得到解决，即随着conda v4.12释放以后，Bioconda将转用耗费内存更小的mamba来进行recipe的测试，此后，用户仍可通过上述命令完成PGCGAP和其依赖包的安装。目前，在执行完上一步操作之后，用户还需要执行下面这条命令完成依赖包的安装（当然，安装v1.0.32及以前的版本不需要单独再安装依赖包了）：
1
2
conda activate pgcgap
mamba install -y abricate canu roary orthofinder fastani fastp snippy sickle-trim unicycler perl-file-copy-recursive prokka pal2nal mash trimal

Method 2: use "environment.yaml". Run the following commands to download the latest environmental file and install PGCGAP:

# Install mamba first
conda install mamba

# download pgcgap_latest_env.yml
wget --no-check-certificate https://bcam.hzau.edu.cn/PGCGAP/conda/pgcgap_latest_env.yml

# create a conda environment named as pgcgap and install the latest version of PGCGAP
mamba env create -f pgcgap_latest_env.yml

Step2: Setup COG database (Users should execute this after the first installation of pgcgap)

1
2
3

conda activate pgcgap
pgcgap --setup-COGdb
conda deactivate

Users with docker container installed have another choice to install PGCGAP.

1	docker pull quay.io/biocontainers/pgcgap:<tag>

(see pgcgap/tags for valid values for <tag>)

Required dependencies

Abricate
ABySS
Canu
CD-HIT
Coreutils
Diamond
FastANI
Fastme
Fastp
FastTree
Htslib
IQ-TREE
Mafft
Mash
Mmseqs2
Muscle
NCBI-blast+
OrthoFinder
OpenJDK8
PAL2NAL v14
trimAL
Perl & the modules
- perl-bioperl
- perl-data-dumper
- perl-file-tee
- perl-getopt-long
- perl-pod-usage
- perl-parallel-forkmanager
Prokka
Python & the modules
- biopython
- matplotlib
- numpy
- pandas
- seaborn
R & the packages
- corrplot
- ggplot2
- gplots
- pheatmap
- plotrix
Roary
Sickle-trim
Snippy
Snp-sites
unicycler
wget

Usage

Print the help messages:
1
pgcgap --help
Check for update:
1
pgcgap --check-update
General usage:
1
pgcgap [modules] [options]

Show parameters for each module:

1	pgcgap [Assemble\|Annotate\|ANI\|AntiRes\|CoreTree\|MASH\|OrthoF\|Pan\|pCOG\|VAR\|STREE\|ACC]

Show examples of each module:
1
pgcgap Examples
Setup COG database: (Users should execute this after the first installation of pgcgap)
1
pgcgap --setup-COGdb
Modules:
- [--All] Perform Assemble, Annotate, CoreTree, Pan, OrthoF, ANI, MASH, AntiRes and pCOG functions with one command
- [--Assemble] Assemble reads (short, long or hybrid) into contigs
- [--Annotate] Genome annotation
- [--CoreTree] Construct single-core proteins tree and SNPs tree of single-copy core genes
- [--Pan] Run "roary" pan-genome pipeline with gff3 files, and construct a phylogenetic tree with the sing-copy core proteins called by roary
- [--OrthoF] Identify orthologous protein sequence families with "OrthoFinder", and construct a phylogenetic tree with the sing-copy core Orthologues
- [--ANI] Compute whole-genome Average Nucleotide Identity ( ANI )
- [--MASH] Genome and metagenome similarity estimation using MinHash
- [--pCOG] Run COG annotation for each strain (*.faa), and generate a table containing the relative abundance of each flag for all strains
- [--VAR] Rapid haploid variant calling and core genome alignment with "Snippy"
- [--AntiRes] Screening of contigs for antimicrobial and virulence genes
- [--STREE] Construct a phylogenetic tree based on multiple sequences in one file
- [--ACC] Other useful gadgets (now includes 'Assess' for filtering short sequences in the genome and assessing the statistics of the genome only)
Global Options:
- [--strain_num (INT)] [Required by "--All", "--CoreTree", "--Pan", "--VAR" and "--pCOG"] The total number of strains used for analysis, not including the reference genome
- [--ReadsPath (PATH)] [Required by "--All", "--Assemble" and "--VAR"] Reads of all strains as file paths ( Default ./Reads/Illumina )
- [--scafPath (PATH)] [Required by "--All", "--Assess", "--Annotate", "--MASH" and "--AntiRes"] Path for contigs/scaffolds (Default "Results/Assembles/Scaf/Illumina")
- [--AAsPath (PATH)] [Required by "--All", "--Pan", "--OrthoF" and "--pCOG"] Amino acids of all strains as fasta file paths, ( Default "./Results/Annotations/AAs" )
- [--reads1 (STRING)] [Required by "--All", "--Assemble" and "--VAR"] The suffix name of reads 1 ( for example: if the name of reads 1 is "YBT-1520_L1_I050.R1.clean.fastq.gz", "YBT-1520" is the strain same, so the suffix name should be ".R1.clean.fastq.gz")
- [--reads2 (STRING)] [Required by "--All", "--Assemble" and "--VAR"] The suffix name of reads 2( for example: if the name of reads 2 is "YBT-1520_2.fq", the suffix name should be "_2.fq" )
- [--Scaf_suffix (STRING)] [Required by "--All", "--Assess", "--Annotate" "--MASH", "--ANI" and "--AntiRes"] The suffix of scaffolds or genome files. This is an important parameter that must be set (Default -8.fa)
- [--filter_length (INT)] [Required by "--All", "--Assemble" and "--Assess"]> Sequences shorter than the 'filter_length' will be removed from the assembled genomes. ( Default 200 )
- [--codon (INT)] [Required by "--All", "--Annotate", "--CoreTree" and "--Pan"] Translation table ( Default 11 )
```
- 1 Universal code
- 2 Vertebrate mitochondrial code
- 3 Yeast mitochondrial code
- 4 Mold, Protozoan, and Coelenterate Mitochondrial code and Mycoplasma/Spiroplasma code
- 5 Invertebrate mitochondrial
- 6 Ciliate, Dasycladacean and Hexamita nuclear code
- 9 Echinoderm and Flatworm mitochondrial code
- 10 Euplotid nuclear code
- 11 Bacterial, archaeal and plant plastid code ( Default )
- 12 Alternative yeast nuclear code
- 13 Ascidian mitochondrial code
- 14 Alternative flatworm mitochondrial code
- 15 Blepharisma nuclear code
- 16 Chlorophycean mitochondrial code
- 21 Trematode mitochondrial code
- 22 Scenedesmus obliquus mitochondrial code
- 23 Thraustochytrium mitochondrial code
```
- [--suffix_len (INT)] [Required by "--All", "--Assemble" and "--VAR"] (Strongly recommended) The suffix length of the reads, that is the length of your reads name minus the length of your strain name. For example the --suffix_len of "YBT-1520_L1_I050.R1.clean.fastq.gz" is 26 ( "YBT-1520" is the strain name ) ( Default 0 )
- [--fasttree] [Can be used with "CoreTree", "Pan" and "OrthoF"] Use FastTree to construct phylogenetic tree quickly instead of IQ-TREE
- [--bsnum (INT)] [Required by "CoreTree", "Pan", "OrthoF", "STREE", and "VAR"] Replicates for bootstrap of IQ-TREE ( Default 500 )
- [--fastboot (INT)] [Required by "CoreTree", "Pan", "OrthoF", "STREE", and "VAR"] Replicates for ultrafast bootstrap of IQ-TREE. ( must >= 1000, Default 1000 )
- [--logs (STRING)] Name of the log file ( Default Logs.txt )
- [--threads (INT)] Number of threads to be used ( Default 4 )
Local Options:
- --Assemble
  - [--platform (STRING)] [Required] Sequencing Platform, "illumina", "pacbio", "oxford" and "hybrid" available ( Default illumina )
  - [--assembler (STRING)] [Required] Software used for illumina reads assembly, "abyss" and "spades" available ( Default auto )
  - [--kmmer (INT)] [Required] k-mer size for genome assembly of Illumina data ( Default 81 )
  - [--genomeSize (STRING)] [Required] An estimate of the size of the genome. Common suffixes are allowed, for example, 3.7m or 2.8g. Needed by PacBio data and ONT data ( Default Unset )
  - [--short1 (STRING)] [Required] FASTQ file of first short reads in each pair. Needed by hybrid assembly ( Default Unset )
  - [--short2 (STRING)] [Required] FASTQ file of second short reads in each pair. Needed by hybrid assembly ( Default Unset )
  - [--long (STRING)] [Required] FASTQ or FASTA file of long reads. Needed by hybrid assembly ( Default Unset )
  - [--hout (STRING)] [Required] Output directory for hybrid assembly ( Default ../../Results/Assembles/Hybrid )
- --Annotate
  - [--genus (STRING)] Genus name of your strain ( Default "NA" )
  - [--species (STRING)] Species name of your strain ( Default "NA")\
- --CoreTree
  - [--CDsPath (PATH)] [Required] CDs of all strains as fasta file paths ( Default "./Results/Annotations/CDs" ), if set to "NO", the SNPs of single-copy core genes will not be called
  - [-c (FLOAT)] Sequence identity threshold, ( Default 0.5)
  - [-n (INT)] Word_length, -n 2 for thresholds 0.4-0.5, -n 3 for thresholds 0.5-0.6, -n 4 for thresholds 0.6-0.7, -n 5 for thresholds 0.7-1.0 ( Default 2 )
  - [-G (INT)] Use global (set to 1) or local (set to 0) sequence identity, ( Default 0 )
  - [-t (INT)] Tolerance for redundance ( Default 0 )
  - [-aL (FLOAT)] Alignment coverage for the longer sequence. If set to 0.9, the alignment must cover 90% of the sequence ( Default 0.5 )
  - [-aS (FLOAT)] Alignment coverage for the shorter sequence. If set to 0.9, the alignment must covers 90% of the sequence ( Default 0.7 )
  - [-g (INT)] If set to 0, a sequence is clustered to the first cluster that meets the threshold (fast cluster). If set to 1, the program will cluster it into the most similar cluster that meets the threshold (accurate but slow mode, Default 1)
  - [-d (INT)] length of description in .clstr file. if set to 0, it takes the fasta defline and stops at first space ( Default 0 )
- --Pan
  - [--GffPath (PATH)] [Required] Gff files of all strains as paths ( Default "./Results/Annotations/GFF" )
  - [--PanTree] Construct a phylogenetic tree of single-copy core proteins called by roary
  - [--identi (INT)] Minimum percentage identity for blastp ( Default 95 )
- --OrthoF
  - [--Sprogram (STRING)] Sequence search program, Options: blast, mmseqs, blast_gz, diamond ( Default diamond)
- --ANI
  - [--queryL (FILE)] [Required] The file containing paths to query genomes, one per line ( Default scaf.list )
  - [--refL (FILE)] [Required] The file containing paths to reference genomes, one per line. ( Default scaf.list )
- --VAR
  - [--refgbk (FILE)] [Required] The full path and name of reference genome in GENBANK format ( recommended ), fasta format is also OK. For example: "/mnt/g/test/ref.gbk"
  - [--qualtype (STRING)] [Required] Type of quality values (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)). ( Default sanger )
  - [--qual (INT)] Threshold for trimming based on average quality in a window. ( Default 20 )
  - [--length (INT)] Threshold to keep a read based on length after trimming. ( Default 20 )
  - [--mincov (INT)] The minimum number of reads covering a site to be considered ( Default 10 )
  - [--minfrac (FLOAT)] The minimum proportion of those reads which must differ from the reference ( Default 0.9 )
  - [--minqual (INT)] The minimum VCF variant call "quality" ( Default 100 )
  - [--ram (INT)] Try and keep RAM under this many GB ( Default 8 )
- --AntiRes
  - [--db (STRING)] [Required] The database to use, options: all, argannot, card, ecoh, ecoli_vf, megares, ncbi, plasmidfinder, resfinder and vfdb. ( Default all )
  - [--identity (INT)] [Required] Minimum %identity to keep the result, should be a number between 1 to 100. ( Default 75 )
  - [--coverage (INT)] [Required] Minimum %coverage to keep the result, should be a number between 0 to 100. ( Default 50 )
- --STREE
  - [--seqfile (STRING)] [Required] Path of the sequence file for analysis.
  - [--seqtype (INT)] [Required] Type Of Sequence (p, d, c for Protein, DNA, Codons, respectively). ( Default p )
- --pCOG
  - [--evalue (FLOAT)] [Required] Maximum e-value to report alignments, ( Default 1e-3 )
  - [--id (INT)] [Required] Minimum identity% to report an alignment, ( Default 40 )
  - [--query_cover (INT)] [Required] Minimum query cover% to report an alignment, ( Default 70 )
  - [--subject_cover (INT)] [Required] Minimum subject cover% to report an alignment, ( Default 50 )
- --ACC
  - [--Assess (STRING)] Filter short sequences in the genome and assess the status of the genome
Paths of external programs

Not needed if they were in the environment variables path. Users can check with the "--check-external-programs" option for the essential programs.
- [--abricate-bin (PATH)] Path to abyss binary file.
  Default tries if abyss is in PATH;
- [--abyss-bin (PATH)] Path to abyss binary file. Default
  tries if abyss is in PATH;
- [--canu-bin (PATH)] Path to canu binary file. Default
  tries if canu is in PATH;
- [--cd-hit-bin (PATH)] Path to cd-hit binary file. Default
  tries if cd-hit is in PATH;
- [--fastANI-bin (PATH)] Path to the fastANI binary file.
  Default tries if fastANI is in PATH;
- [--iqtree-bin (PATH)] Path to the iqtree binary file.
  Default tries if iqtree is in PATH;
- [--mafft-bin (PATH)] Path to mafft binary file. Default
  tries if mafft is in PATH;
- [--mash-bin (PATH)] Path to the mash binary file. Default
  tries if mash is in PATH.
- [--muscle-bin (PATH)] Path to the muscle binary file.
  Default tries if muscle is in PATH.
- [--orthofinder-bin (PATH)] Path to the orthofinder binary
  file. Default tries if orthofinder is in PATH;
- [--pal2nal-bin (PATH)] Path to the pal2nal.pl binary
  file. Default tries if pal2nal.pl is in PATH;
- [--prodigal-bin (PATH)] Path to prodigal binary file.
  Default tries if prodigal is in PATH;
- [--prokka-bin (PATH)] Path to prokka binary file. Default
  tries if prokka is in PATH;
- [--roary-bin (PATH)] Path to the roary binary file.
  Default tries if roary is in PATH;
- [--sickle-bin (PATH)] Path to the sickle-trim binary
  file. Default tries if sickle is in PATH.
- [--snippy-bin (PATH)] Path to the snippy binary file.
  Default tries if snippy is in PATH;
- [--snp-sites-bin (PATH)] Path to the snp-sites binary
  file. Default tries if snp-sites is in PATH;
- [--trimAL-bin (PATH)] Path to the trimAL binary file.
  Default tries if trimAL is in PATH;
- [--unicycler-bin (PATH)] Path to the unicycler binary
  file. Default tries if unicycler is in PATH;
Setup COG database
- [--setup-COGdb] Users should execute this after first
  installation of pgcgap.
Check the required external programs (It is strongly recommended that this step be performed after the installation of PGCGAP):
1
pgcgap --check-external-programs

Examples

Example dataset can be download here.

Example 1: Perform all functions, take the Escherichia coli as an example, total 6 strains for analysis.

Notice: For the sake of flexibility, The "VAR" function needs to be added additionally.

pgcgap --All --platform illumina --filter_length 200 --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --suffix_len 11 --kmmer 81 --genus Escherichia --species “Escherichia coli” --codon 11 --PanTree --strain_num 6 --threads 4 --VAR --refgbk /mnt/h/PGCGAP_Examples/Reads/MG1655.gbff --qualtype sanger

Example 2: Genome assembly.

Illumina reads assembly

In this dataset, the naming format of the genome is “strain_1.fastq.gz” and “strain_2.fastq.gz”. The string after the strain name is “_1.fastq.gz”, and its length is 11, so "--suffix_len" was set to 11.

# Assemble with ABySS
pgcgap --Assemble --platform illumina --assembler abyss --filter_length 200 --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --kmmer 81 --threads 4 --suffix_len 11

# Assemble with SPAdes
pgcgap --Assemble --platform illumina --assembler spades --filter_length 200 --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --threads 4 --suffix_len 11

# Assemble with AUTO
pgcgap --Assemble --platform illumina --assembler auto --filter_length 200 --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --kmmer 81 --threads 4 --suffix_len 11

ONT reads assembly

Oxford nanopore sequencing usually produces one reads file, so only the parameter of "--reads1" needs to be set, where the value is ".fasta". “--genomeSize” is the estimated genome size, and users can check the genome size of similar strains in the NCBI database for reference. The parameter was set to "4.8m" here. The suffix of the reads file here is ".fasta" and its length is 6, so "--suffix_len" was set to 6.
1
pgcgap --Assemble --platform oxford --filter_length 200 --ReadsPath Reads/Oxford --reads1 .fasta --genomeSize 4.8m --threads 4 --suffix_len 6
PacBio reads assembly

PacBio also usually produces only one reads file "pacbio.fastq", the parameter settings are similar to Oxford. The strain name is 6, so "--suffix_len" was set to 6.
1
pgcgap --Assemble --platform pacbio --filter_length 200 --ReadsPath Reads/PacBio --reads1 .fastq --genomeSize 4.8m --threads 4 --suffix_len 6
Hybrid assembly of short reads and long reads

Paired-end short reads and long reads in the directory “Reads/Hybrid/“ were used as inputs. Illumina reads and long reads must be from the same isolates.
1
pgcgap --Assemble --platform hybrid --ReadsPath Reads/Hybrid --short1 short_reads_1.fastq.gz --short2 short_reads_2.fastq.gz --long long_reads_high_depth.fastq.gz --threads 4

Example 3: Gene prediction and annotation

1	pgcgap --Annotate --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa --genus Escherichia --species “Escherichia coli” --codon 11 --threads 4

Example 4: Constructing single-copy core protein tree and core SNPs tree

# Construct phylogenetic tree with FastTree (Quick without best fit model testing)
pgcgap --CoreTree --CDsPath Results/Annotations/CDs --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4 --fasttree

# Construct phylogenetic tree with IQ-TREE (Very slow with best fit model testing, traditional bootstrap, DEFAULT)
pgcgap --CoreTree --CDsPath Results/Annotations/CDs --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4 --bsnum 500

# Construct phylogenetic tree with IQ-TREE (Slow with best fit model testing, ultrafast bootstrap)
pgcgap --CoreTree --CDsPath Results/Annotations/CDs --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4 --fastboot 1000

Example 5: Constructing single-copy core protein tree only.

# Construct phylogenetic tree with FastTree (Quick without best fit model testing)
pgcgap --CoreTree --CDsPath NO --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4 --fasttree

# Construct phylogenetic tree with IQ-TREE (Very slow with best fit model testing, traditional bootstrap, DEFAULT)
pgcgap --CoreTree --CDsPath NO --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4 --bsnum 500

# Construct phylogenetic tree with IQ-TREE (Slow with best fit model testing, ultrafast bootstrap)
pgcgap --CoreTree --CDsPath NO --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4 --fastboot 1000

Example 6: Conduct pan-genome analysis and construct a phylogenetic tree of single-copy core proteins called by roary. Applicable to v1.0.27 and later.

# Construct phylogenetic tree with FastTree (Quick without best fit model testing)
pgcgap --Pan --codon 11 --identi 95 --strain_num 6 --threads 4 --GffPath Results/Annotations/GFF --PanTree --fasttree

# Construct phylogenetic tree with IQ-TREE (Very slow with best fit model testing, traditional bootstrap, DEFAULT)
pgcgap --Pan --codon 11 --identi 95 --strain_num 6 --threads 4 --GffPath Results/Annotations/GFF --PanTree --bsnum 500

# Construct phylogenetic tree with IQ-TREE (Slow with best fit model testing, ultrafast bootstrap)
pgcgap --Pan --codon 11 --identi 95 --strain_num 6 --threads 4 --GffPath Results/Annotations/GFF --PanTree --fastboot 1000

Example 7: Inference of orthologous gene groups and construct a phylogenetic tree of single-copy Orthologue proteins. Applicable to v1.0.29 and later.

# Construct phylogenetic tree with FastTree (Quick without best fit model testing)
pgcgap --OrthoF --threads 4 --AAsPath Results/Annotations/AAs --fasttree

# Construct phylogenetic tree with IQ-TREE (Very slow with best fit model testing, traditional bootstrap, DEFAULT)
pgcgap --OrthoF --threads 4 --AAsPath Results/Annotations/AAs --bsnum 500

# Construct phylogenetic tree with IQ-TREE (Slow with best fit model testing, ultrafast bootstrap)
pgcgap --OrthoF --threads 4 --AAsPath Results/Annotations/AAs --fastboot 1000

Example 8: Compute whole-genome Average Nucleotide Identity (ANI).

1	pgcgap --ANI --threads 4 --queryL scaf.list --refL scaf.list --Scaf_suffix .fa

Example 9: Genome and metagenome similarity estimation using MinHash.

1	pgcgap --MASH --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa

Example 10: Run COG annotation for each strain.

1	pgcgap --pCOG --threads 4 --strain_num 6 --id 40 --query_cover 70 --subject_cover 50 --AAsPath Results/Annotations/AAs

Example 11: Variants calling and phylogenetic tree construction based on the reference genome.

# Construct phylogenetic tree with IQ-TREE (Very slow with best fit model testing, traditional bootstrap, DEFAULT)
pgcgap --VAR --threads 4 --refgbk /mnt/h/PGCGAP_Examples/Reads/MG1655.gbff --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --suffix_len 11 --strain_num 6 --qualtype sanger --bsnum 500

# Construct phylogenetic tree with IQ-TREE (Slow with best fit model testing, ultrafast bootstrap)
pgcgap --VAR --threads 4 --refgbk /mnt/h/PGCGAP_Examples/Reads/MG1655.gbff --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --suffix_len 11 --strain_num 6 --qualtype sanger --fastboot 1000

Example 12: Screening of contigs for antimicrobial and virulence genes.

1	pgcgap --AntiRes --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa --threads 6 --db all --identity 75 --coverage 50

Example 13: Filter short sequences in the genome and assess the status of the genome.

1	pgcgap --ACC --Assess --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa --filter_length 200

Example 14: Construct a phylogenetic tree based on multiple sequences in one file.

# Construct phylogenetic tree with IQ-TREE (Very slow with best fit model testing, traditional bootstrap, DEFAULT)
pgcgap --STREE --seqfile proteins.fas --seqtype p --bsnum 500 --threads 4

# Construct phylogenetic tree with IQ-TREE (Slow with best fit model testing, ultrafast bootstrap)
pgcgap --STREE --seqfile proteins.fas --seqtype p --fastboot 1000 --threads 4

Generating Input files

Working directory

The directory where the PGCGAP software runs.

Assemble

Pair-end reads of all strains in a directory or PacBio reads or ONT reads (Default: ./Reads/Illumina/ under the working directory).

Annotate

Genomes files (complete or draft) in a directory (Default: Results/Assembles/Scaf/Illumina under the working directory).

ANI

QUERY_LIST and REFERENCE_LIST files containing full paths to genomes, one per line (default: scaf.list under the working directory). If the “--Assemble” function was run first, the list file will be generated automatically.

MASH

Genomes files (complete or draft) in a directory (Default: Results/Assembles/Scaf/Illumina under the working directory).

CoreTree

Amino acids file (With “.faa” as the suffix) and nucleotide (With “.ffn” as the suffix) file of each strain placed into two directories (default: “./Results/Annotations/AAs/“ and “./Results/Annotations/CDs/“). The “.faa” and “.ffn” files of the same strain should have the same prefix name. The name of protein IDs and gene IDs should be started with the strain name. The “Prokka” software was suggested to generate the input files. If the “--Annotate” function was run first, the files will be generated automatically. If the “--CDsPath” was set to “NO”, the nucleotide files will not be needed.

OrthoF

A set of protein sequence files (one per species) in FASTA format under a directory (default: “./Results/Annotations/AAs/“). If the “--Annotate” function was run first, the files will be generated automatically.

Pan

GFF3 files (With “.gff” as the suffix) of each strain placed into a directory. They must contain the nucleotide sequence at the end of the file. All GFF3 files created by Prokka are valid (default: ./Results/Annotations/GFF/). If the “--Annotate” function was run first, the files will be generated automatically.

pCOG

Amino acids file (With “.faa” as the suffix) of each strain placed into a directory (default: ./Results/Annotations/AAs/). If the “--Annotate” function was run first, the files will be generated automatically.

VAR

Pair-end reads of all strains in a directory (default: ./Reads/Over/ under the working directory).
The full path of reference genome in FASTA format or GenBank format (must be provided).

AntiRes

Genomes files (complete or draft) in a directory (Default: Results/Assembles/Scaf/Illumina under the working directory).

STREE

Multiple-FASTA sequences in a file, can be Protein, DNA and Codons.

Output Files

Assemble

Results/Assembles/Illumina/

Directories contain Illumina assembly files and information of each strain.
Results/Assembles/PacBio/

Directories contain PacBio assembly files and information of each strain.
Results/Assembles/Oxford/

Directories contain ONT assembly files and information of each strain.
Results/Assembles/Hybrid/

Directory contains hybrid assembly files of the short reads and long reads of the same strain.
Results/Assembles/Scaf/Illumina/

Directory contains Illumina contigs/scaffolds of all strains. "*.filtered.fas" is the genome after excluding short sequences. "*.prefilter.stats" describes the stats of the genome before filtering, and "*.filtered.stats" describes the stats of the genome after filtering.
Results/Assembles/Scaf/Oxford/

Directory contains ONT contigs/scaffolds of all strains.
Results/Assembles/Scaf/PacBio/

Directory contains PacBio contigs/scaffolds of all strains.

Annotate

Results/Annotations/*_annotation/

directories contain annotation files of each strain.
Results/Annotations/AAs/

Directory contain amino acids sequences of all strains.
Results/Annotations/CDs/

Directory contain nucleotide sequences of all strains.
Results/Annotations/GFF/

Directory contain the master annotation of all strains in GFF3 format.

ANI

Results/ANI/ANIs

The file contains comparation information of genome pairs. The document is composed of five columns, each of which represents query genome, reference genome, ANI value, count of bidirectional fragment mappings, total query fragments.
Results/ANI/ANIs.matrix

file with identity values arranged in a phylip-formatted lower triangular matrix.
Results/ANI/ANIs.heatmap

An ANI matrix of all strains.
Results/ANI/ANI_matrix.pdf

The heatmap plot of "ANIs.heatmap".

MASH

Results/MASH/MASH

The pairwise distance between pair genomes, each column represents Reference-ID, Query-ID, Mash-distance, P-value, and Matching-hashes, respectively.
Results/MASH/MASH2

The pairwise similarity between pair genomes, each column represents Reference-ID, Query-ID, similarity, P-value, and Matching-hashes, respectively.
Results/MASH/MASH.heatmap

A similarity matrix of all genomes.
Results/MASH/MASH_matrix.pdf

A heat map plot of "MASH.heatmap".

CoreTree

Results/CoreTrees/ALL.core.protein.fasta

Concatenated and aligned sequences file of single-copy core proteins.
Results/CoreTrees/ALL.core.protein.nwk

The phylogenetic tree file of single-copy core proteins for all strains constructed by FastTree.
Results/CoreTrees/ALL.core.protein.fasta.gb.treefile

The phylogenetic tree file of single-copy core proteins for all strains constructed by IQ-TREE.
Results/CoreTrees/faa2ffn/ALL.core.nucl.fasta

Concatenated and aligned sequences file of single-copy core genes.
Results/CoreTrees/ALL.core.snp.fasta

Core SNPs of single-copy core genes in fasta format.
Results/CoreTrees/ALL.core.snp.fasta.treefile

The phylogenetic tree file of SNPs of single-copy core genes for all strains constructed by IQ-TREE.
Results/CoreTrees/"Other_files"

Intermediate directories and files.

OrthoF

Results/OrthoFinder/Results_orthoF

Same as OrthoFinder outputs.
Results/OrthoFinder/Results_orthoF/Single_Copy_Orthologue_Tree/

Directory contains Phylogenetic tree files based on Single Copy Orthologue sequences.
Results/OrthoFinder/Results_orthoF/Single_Copy_Orthologue_Tree/Single.Copy.Orthologue.nwk

Phylogenetic tree constructed by FastTree.
Results/OrthoFinder/Results_orthoF/Single_Copy_Orthologue_Tree/Single.Copy.Orthologue.fasta.gb.treefile

Phylogenetic tree constructed by IQ-TREE.

Pan

Results/PanGenome/Pangenome_Pie.pdf

A 3D pie chart and a fan chart of the breakdown of genes and the number of isolates they are present in.
Results/PanGenome/pangenome_frequency.pdf

A graph with the frequency of genes versus the number of genomes.
Results/PanGenome/Pangenome_matrix.pdf

A figure showing the tree compared to a matrix with the presence and absence of core and accessory genes.
Results/PanGenome/Core/Roary.core.protein.fasta

Alignments of single-copy core proteins called by roary software.
Results/PanGenome/Core/Roary.core.protein.nwk

A phylogenetic tree of Roary.core.protein.fasta constructed by FastTree.
Results/PanGenome/Core/Roary.core.protein.fasta.gb.treefile

A phylogenetic tree of Roary.core.protein.fasta constructed by IQ-TREE.
Results/PanGenome/Other_files

see roary outputs.

pCOG

*.COG.xml, *.2gi.table, *.2id.table, *.2Sid.table

Intermediate files.
*.2Scog.table

The super COG table of each strain.
*.2Scog.table.pdf

A plot of super COG table in pdf format.
All_flags_relative_abundances.table

A table containing the relative abundance of each flag for all strains.

VAR

Results/Variants/directory-named-in-strains

directories containing substitutions (snps) and insertions/deletions (indels) of each strain. See Snippy outputs for detail.
Results/Variants/Core

The directory containing SNP phylogeny files.
- core.aln : A core SNP alignment includes only SNP sites.
- core.full.aln : A whole genome SNP alignment (includes invariant sites).
- core.aln.treefile : Phylogenetic tree of the core SNP alignment based on the best-fit model of evolution selected using IQ-TREE (ignoring possible recombination).
- core.aln.treefile : The best-fit model of evolution selected using IQ-TREE can be found in this file.

AntiRes

Results/AntiRes/*.tab : Screening results of each strain.
Results/AntiRes/summary.txt : A matrix of gene presence/absence for all strains.

STREE

Results/STREE/*.aln : Aligned sequences.
Results/STREE/*.aln.gb : Conserved blocks of the aligned sequences.
Results/STREE/*.aln.gb.treefile : The final phylogenetic tree.
Results/STREE/*.aln.gb.iqtree : Log of IQ-TREE.

License

PGCGAP is free software, licensed under GPLv3.

Feedback and Issues

Please report any issues to the issues page or email us at liaochenlanruo@webmail.hzau.edu.cn.

Citation

If you use this software please cite: Liu H, Xin B, Zheng J, Zhong H, Yu Y, Peng D, Sun M. Build a bioinformatics analysis platform and apply it to routine analysis of microbial genomics and comparative
genomics. Protocol exchange, 2022. DOI: 10.21203/rs.2.21224/v6
If you use "--Assemble", please also cite one or two of Fastp, ABySS, SPAdes, Canu, or Unicycler.
If you use "--Annotate", please also cite Prokka.
If you use "--CoreTree", please also cite CD-HIT, MAFFT, PAL2NAL, trimAL, IQ-TREE or FastTree, and SNP-sites.
If you use "--Pan", please also cite Roary, MAFFT, trimAL, IQ-TREE or FastTree.
If you use "--OrthoF", please also cite OrthoFinder, MAFFT, trimAL, IQ-TREE or FastTree.
If you use "--ANI", please also cite fastANI.
If you use "--MASH", please also cite Mash.
If you use "--VAR", please also cite Sickle, Snippy, Gubbins, IQ-TREE, and SnpEff.
If you use "--AntiRes", please also cite
Abricate and the corresponding database you used: NCBI AMRFinderPlus, CARD, Resfinder, ARG-ANNOT, VFDB, PlasmidFinder, EcOH, or MEGARES 2.00.
If you use "--STREE", please also cite Muscle, trimAL, and IQ-TREE.

FAQ

Q1 VAR function ran failed to get annotated VCFs and Core results

Check the log file named in "strain_name.log" under Results/Variants/<strain_name>/ directory. If you find a sentence like "WARNING: All frames are zero! This seems rather odd, please check that 'frame' information in your 'genes' file is accurate." This is a snpEff error. Users can install JDK8 to solve this problem.

1	conda install java-jdk=8.0.112

Click here for more solutions.

Q2 Could not determine version of minced please install version 2 or higher

When running the Annotate function, this error could happen, the error message shows as following:

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.UnsupportedClassVersionError: minced has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:495)
[01:09:40] Could not determine version of minced - please install version 2.0 or higher

Users can downgrade the minced to version 0.3 to solve this problem.

1	conda install minced=0.3

Click here for detail informations.

Q3 dyld: Library not loaded: @rpath/libcrypto.1.0.0.dylib

This error may happen when running function "VAR" on macOS. It is an error of openssl. Users can solve this problem as the following:

#Firstly, install brew if have not installed before
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

#Install openssl with brew
brew install openssl

#Create the soft link for libraries
ln -s /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib /usr/local/lib/

ln -s /usr/local/opt/openssl/lib/libssl.1.0.0.dylib /usr/local/lib/

Click here for more informations.

Q4 Use of uninitialized value in require at Encode.pm line 61

This warning may happen when running function "Pan". It is a warning of Roary software. The content of line 61 is "require Encode::ConfigLocal;". Users can ignore the warning. Click here for details.

Updates

V1.0.3
- Updated ANI function.
V1.0.4
- Add parallel for function "pCOG".
- Optimized drawing of ANI heat map.
V1.0.5
- Bug repair for the input of gubbins.
V1.0.6
- Modified CoreTree to split protein and SNPs tree constructing.
V1.0.7
- Split Assemble and Annotate into two functions.
- Added third-generation genome assembly function.
- Changed the default parameters of the CoreTree function (aS 0.8 to 0.7 and aL 0.8 to 0.5).
- Changed the name of function "COG" to "pCOG".
- Fixed the sorting bug for ANI heat map.
V1.0.8
- Add the "MASH" function to compute genome distance and similarity using MinHash.
V1.0.9
- The function of constructing a single-copy core protein phylogenetic tree was added to "Pan".
- Fixed a bug of plot_3Dpie.R, Optimized image display, and a fan
  chart has been added.
- Fixed a bug for plotting the ANI matrix.
V1.0.10
- Add the "AntiRes" function to screening of contigs for antimicrobial and virulence genes.
V1.0.11
- Users now can choose "abyss" or "spades" for illumina reads aseembly.
- New support for hybrid assembly of paired-end short reads and long reads.
- Add the selecting of best-fit model of evolution for DNA and protein alignments before constructing a phylogenetic tree.
- Optimized display of help information. Users can check parameters for each modulewith command "pgcgap [Assemble|Annotate|ANI|AntiRes|CoreTree|MASH|OrthoF|Pan|pCOG|VAR]", and can look up the examples of each module with command "pgcgap Examples".
V1.0.12
- Added automatic mode for illumina genome assembly. First, PGCGAP calls "ABySS" for genome assembly. When the assembled N50 is less than 50,000, it automatically calls "SPAdes" to try multiple parameters for assembly.
- Added ability to filter short sequences of assembled genomes.
- Added function of genome assembly status assessment.
- Modified the drawing script of ANI and MASH modules so that it can automatically adjust the font size according to the number of samples.
V1.0.13
- Fixed the "running error" bug of function "Assess" in module "ACC".
- Added module "STREE" for constructing a phylogenetic tree based on multiple sequences in one file.
V1.0.14
- The relative_abundances of flags among strains will not be called while the strain number is less than two.
- Fixed the error of function "Assess" in module "ACC".
V1.0.15
- When the number of threads set by the user exceeds the number of threads owned by the system, PGCGAP will automatically adjust the number of threads to avoid program crash.
- Add FASTQ preprocessor before Illunima genome assembly: adapter trimming, polyG tail trimming of Illumina NextSeq/NovaSeq reads, quality filtering (Q value filtering, N base filtering, sliding window filtering), length filtering.
V1.0.16
- Reduced the number of Racon polishing rounds for better speed performance when peforming genome assembly.
- Force overwriting existing output folder when running "Annotate" analysis to avoid program crash.
V1.0.17
- Fixed a bug that the program can not go back to the working directory after genome annotation.
- Added scripts to check if there were single-copy core proteins found while running module "CoreTree".
- Modified the help message.
V1.0.18
- Updated the downloading link of COG database.
- Users can choose the number of threads used for running module "STREE".
V1.0.19
- Can resume from break-point when downloading the COG database.
- Fixed a bug that failed to create multi-level directories.
V1.0.20
- Fixed a little bug (path error) of module "VAR".
- Fixed a little bug of module "CoreTree" to avoid the interference of special characters in sequence ID to the program.
V1.0.21
- Change the default search program "blast" to "diamond" of
  module "OrthoF".
- Fixed a bug of module "pCOG" to output the right figure.
V1.0.22
- The drawing function of module "ANI" and module "MASH" has been enhanced, including automatic adjustment of font size and legend according to the size of the picture.
- Fixed a bug of module "ANI", that is no heatmap will be drawn when there is "NA" in the ANI matrix in the previous versions.
- When the ANI value or genome similarity is greater than 95%, an asterisk (*) will be drawn in the corresponding cell of the heatmap.
V1.0.23
- The "--Assess" function of module "ACC" was enhanced to (1) generate a summary file containing the status of all genomes (before and after the short sequence filtering), (2) auto move the low-quality genomes (that is genomes with N50 length less than 50 k) to a directory, and others to another directory.
V1.0.24
- Fixed a little bug of module "Pan" to avoid the interference of special characters (>) in sequence ID to the program.
V1.0.25
- Gblocks was used to eliminate poorly aligned positions and divergent regions of an alignment of DNA or protein sequences in module "CoreTree" and "Pan".
- The parameter "--identi" was added into module "Pan" to allow users to set the minimum percentage identity for blastp.
V1.0.26
- Adjusted the font size with the variation of genome number and the string length of the genome name when plotting the heat map of module "ANI" and "MASH".
- Two heat map are provided, one of which with a star (means the similarity of the two genomes is larger than 95%) and another without a star, when performing the "ANI" and "MASH" analysis.
V1.0.27
- The Amino Acid files are no longer needed when performing the Pan-genome analysis with module Pan.
V1.0.28
- Users can check and install the latest version of PGCGAP by the command "pgcgap --check-update".
- Update module Assemble to allow polish after the assembly of PacBio and ONT data.
- Update module pCOG to adjust the latest database of COG 2020.
- Optimized the drawing and color scheme of the module pCOG.
- Fixed the parameter "CoreTree" in the module Pan to avoid program termination caused by the ">" in non-sequence lines.
V1.0.29
- Function added to module OrthoF: Phylogenetic tree can be constructed automatically with the Single Copy Orthologue Sequences called by module OrthoF.
- Fixed the "permission denied" error when moving directories on the WSL platform.
V1.0.30
- Replace Gblocks with trimAL to trim MSA (module CoreTree, Pan, STREE, and OrthoF).
- Replaced Modeltest-ng and Raxml-ng with IQ-TREE (module CoreTree, Pan OrthoF, and VAR).
- Added the option of using fasttree to build phylogenetic tree (module CoreTree, Pan, and OrthoF).
V1.0.31
- The default replicates for bootstrap testing of IQ-TREE was set to 500.
- Add the method for phylogenetic tree constructing with ultrafast bootstrap of IQ-TREE.
- Prevent the log from being written to the tree file generated by FastTree.
V1.0.32
- A more colorful version, try "pgcgap Examples" to have a look.
- Updated module AntiRes: the parameter --db had been modified to add choices of "all" and "megares".
- A little optimization of module VAR.
- Replaced conda with mamba to update PGCGAP more quickly.
V1.0.33
- Updated module CoreTree: Run IQ-TREE with the correct number of constant sites when constructing the single-copy core SNPs tree.
- Updated module VAR: Use "SNP-SITE" and "IQ-TREE -fconst" to generate SNP sites from the "core.full.aln" and construct the phylogenetic tree.
- Updated module pCOG: Replace blast with diamond to speed up analysis.

Multi-version instructions (This one for the latest version)

Introduction

Installation

Required dependencies

Usage

Examples

Generating Input files

Working directory

Assemble

Annotate

ANI

MASH

CoreTree

OrthoF

Pan

pCOG

VAR

AntiRes

STREE

Output Files

Assemble

Annotate

ANI

MASH

CoreTree

OrthoF

Pan

pCOG

VAR

AntiRes

STREE

License

Feedback and Issues

Citation

FAQ

Q1 VAR function ran failed to get annotated VCFs and Core results

Q2 Could not determine version of minced please install version 2 or higher

Q3 dyld: Library not loaded: @rpath/libcrypto.1.0.0.dylib

Q4 Use of uninitialized value in require at Encode.pm line 61

Updates

你的赏识是我前进的动力