____ ____ ____ ____ _ ____
U| _"\ u U /"___|u U /"___| U /"___|u U /"\ u U| _"\ u
\| |_) |/ \| | _ / \| | u \| | _ / \/ _ \/ \| |_) |/
| __/ | |_| | | |/__ | |_| | / ___ \ | __/
|_| \____| \____| \____| /_/ \_\ |_|
||>>_ _)(|_ _// \\ _)(|_ \\ >> ||>>_
(__)__) (__)__) (__)(__) (__)__) (__) (__) (__)__)
PGCGAP is a pipeline for prokaryotic comparative genomics analysis. It can take the pair-end reads as input. In addition to genome assembly, gene prediction and annotation, it can also get common comparative genomics analysis results such as phylogenetic trees of single-core proteins and core SNPs, pan-genome, whole-genome Average Nucleotide Identity (ANI), orthogroups and orthologs, COG annotations, substitutions (snps) and insertions/deletions (indels) with only one line of commands.
The software was only tested on Linux64 platform, The MacOS may also be good in theory. However, Windows could not be supported. Because this software relies on a large number of other softwares, so it is recommended to install with Bioconda. The main program and most of other dependencies can be installed with one command as shown below, but the "Gubbins" should be installed separately because of it relys on python3, while PGCGAP relys on python2.
Step1: Install Gubbins
$conda install gubbins
Step2: Install PGCGAP
$conda create -n pgcgap python=2.7 anaconda
$conda activate pgcgap
$conda install pgcgap # By command "whereis pgcgap", users can find the installation path of pgcgap, and then you should add it into your environment variable.
$conda deactivate
Step3: Setup COG database (Users should execute this after first installation of pgcgap)
$pgcgap --setup-COGdb
$pgcgap --help
$pgcgap [Fuctions] [options]
$pgcgap --setup-COGdb
FUCTIONS:
[–All] Perform all functions with one command
[–Assemble] Assemble reads ( paired-end only ) into contigs, predict genes and annotate them
[–CoreTree] Construct single-core proteins tree and SNPs tree of single core genes
[–Pan] Run "roary" pan genome pipeline with gff3 files
[–OrthoF] Identify orthologous protein sequence families with "OrthoFinder"
[–ANI] Compute whole-genome Average Nucleotide Identity ( ANI )
[–COG] Run COG annotation for each strain (*.faa), and generate a table containing the relative abundance of each flag for all strains
[–VAR] Rapid haploid variant calling and core genome alignment with "Snippy"
Global Options:
[–strain_num (INT)] [Required by "–All", "–CoreTree" and "–VAR"] The total number of strains used for analysis, including reference genomes
[–ReadsPath (PATH)] [Required by "–All", "–Assemble" and "–VAR"] Reads of all strains as file paths ( Default ./Reads )
[–AAsPath (PATH)] [Required by "–CoreTree" and "–COG"] Amino acids of all strains as fasta file paths, ( Default "./Results/Annotations/AAs" )
[–reads1 (STRING)] [Required by "–All", "–Assemble" and "–VAR"] The suffix name of reads 1 ( for example: if the name of reads 1 is "YBT-1520_L1_I050.R1.clean.fastq.gz", "YBT-1520" is the strain same, so the suffix name should be ".R1.clean.fastq.gz")
[–reads2 (STRING)] [Required by "–All", "–Assemble" and "–VAR"] The suffix name of reads 2( for example: if the name of reads 2 is "YBT-1520_2.fq", the suffix name should be "_2.fq" )
[–codon (INT)] [Required by "–All", "–Assemble", "–CoreTree" and "–Pan"] Translation table ( Default 11 )
[–suffix_len (INT)] [Required by "–All", "–Assemble" and "–VAR"] (Strongly recommended) The suffix length of the reads, that is the length of your reads name minus the length of your strain name. For example the –suffix_len of "YBT-1520_L1_I050.R1.clean.fastq.gz" is 26 ( "YBT-1520" is the strain name ) ( Default 0 )
[–logs (STRING)] Name of the log file ( Default Logs.txt )
[–threads (INT)] Number of threads to be used ( Default 4 )
Local Options:
[–kmmer (INT)] [Required] k-mer size for genome assembly ( Default 81 )
[–genus (STRING)] Genus name of your strain ( Default "NA" )
[–species (STRING)] Species name of your strain ( Default "NA")
–CoreTree
[–CDsPath (PATH)] [Required] CDs of all strains as fasta file paths, ( Default "./Results/Annotations/CDs" )
[-c (FLOAT)] Sequence identity threshold, ( Default 0.5)
[-n (INT)] Word_length, see user's guide of CD-HIT for choosing it ( Default 2 )
[-G (INT)] Use global (set to 1) or local (set to 0) sequence identity, ( Default 0 )
[-t (INT)] Tolerance for redundance ( Default 0 )
[-aL (FLOAT)] Alignment coverage for the longer sequence. If set to 0.9, the alignment must covers 90% of the sequence ( Default 0.9 )
[-aS (FLOAT)] Alignment coverage for the shorter sequence. If set to 0.9, the alignment must covers 90% of the sequence ( Default 0.9 )
[-g (INT)] If set to 0, a sequence is clustered to the first cluster that meet the threshold (fast cluster). If set to 1, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode, Default 1)
[-d (INT)] length of description in .clstr file. if set to 0, it takes the fasta defline and stops at first space ( Default 0 )
–Pan
–orthoF
–ANI
[–queryL (FILE)] [Required] The file containing paths to query genomes, one per line ( Default scaf.list )
[–refL (FILE)] [Required] The file containing paths to reference genomes, one per line. ( Default scaf.list )
[–ANIO (FILE)] The name of output file ( Default "Results/ANI/ANIs" )
–VAR
[–refgbk (FILE)] [Required] The full path and name of reference genome in GENBANK format ( recommended ), fasta format is also OK. For example: "/mnt/g/test/ref.gbk"
[–qualtype (STRING)] [Required] Type of quality values (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)). ( Default sanger )
[–qual (INT)] Threshold for trimming based on average quality in a window. ( Default 20 )
[–length (INT)] Threshold to keep a read based on length after trimming. ( Default 20 )
[–mincov (INT)] The minimum number of reads covering a site to be considered ( Default 10 )
[–minfrac (FLOAT)] The minimum proportion of those reads which must differ from the reference ( Default 0.9 )
[–minqual (INT)] The minimum VCF variant call "quality" ( Default 100 )
[–ram (INT)] Try and keep RAM under this many GB ( Default 8 )
[–tree_builder (STRING)] Application to use for tree building [raxml|fasttree|hybrid] ( Default fasttree)
[–iterations (INT)] Maximum No. of iterations for gubbins ( Default 5 )
Paths of external programs
Not needed if they were in the environment variables path. Users can check with the "–check-external-programs" option for the essential programs.
[–abyss-bin (PATH)] Path to abyss binary file. Default tries if abyss is in PATH;
[–prodigal-bin (PATH)] Path to prodigal binary file. Default tries if prodigal is in PATH;
[–prokka-bin (PATH)] Path to prokka binary file. Default tries if prokka is in PATH;
[–cd-hit-bin (PATH)] Path to cd-hit binary file. Default tries if cd-hit is in PATH;
[–mafft-bin (PATH)] Path to mafft binary file. Default tries if mafft is in PATH;
[–fasttreeMP-bin (PATH)] Path to the fasttreeMP binary file. Default tries if fasttreeMP is in PATH;
[–pal2nal-bin (PATH)] Path to the pal2nal.pl binary file. Default tries if pal2nal.pl is in PATH;
[–snp-sites-bin (PATH)] Path to the snp-sites binary file. Default tries if snp-sites is in PATH;
[–roary-bin (PATH)] Path to the roary binary file. Default tries if roary is in PATH;
[–orthofinder-bin (PATH)] Path to the orthofinder binary file. Default tries if orthofinder is in PATH;
[–fastANI-bin (PATH)] Path to the fastANI binary file. Default tries if fastANI is in PATH;
[–gubbins-bin (PATH)] Path to the run_gubbins.py binary file. Default tries if run_gubbins.py is in PATH;
[–snippy-bin (PATH)] Path to the snippy binary file. Default tries if snippy is in PATH;
[–sickle-bin (PATH)] Path to the sickle-trim binary file. Default tries if sickle is in PATH.
Setup COG database
$pgcgap --check-external-programs
Example 1: Perform all functions, take the bacillus thuringiensis as an example, total 4 strains for analysis.
Notice: For the sake of flexibility, The "VAR" function needs to be added additionally.
$pgcgap --All --ReadsPath <PATH> --reads1 .R1.clean.fastq.gz --reads2 .R2.clean.fastq.gz --suffix_len <INT> --kmmer 81 --genus bacillus --species thuringiensis --codon 11 --strain_num 4 --threads 4 --VAR --refgbk <FILE> --qualtype <STRING>
Example 2: Conduct reads assembly, gene predition and annotation.
$pgcgap --Assemble --ReadsPath <PATH> --reads1 .R1.clean.fastq.gz --reads2 .R2.clean.fastq.gz --kmmer 81 --genus bacillus --species thuringiensis --codon 11 --threads 4
Example 3: Constructing single core protein tree and core SNPs tree.
$pgcgap --CoreTree --CDsPath <PATH> --AAsPath <PATH> --codon 11 --strain_num 3 --threads 4
Example 4: Conduct pan-genome analysis.
$pgcgap --Pan --codon 11 --threads 4 --GffPath <PATH>
Example 5: Inference of orthologous gene groups.
$pgcgap --orthoF --threads 4 --AAsPath <PATH>
Example 6: Compute whole-genome Average Nucleotide Identity (ANI).
$pgcgap --ANI --threads 4 --queryL <FILE> --refL <FILE> --ANIO <FILE>
Example 7: Run COG annotation for each strain.
$pgcgap --COG --threads 4 --AAsPath <PATH>
Example 8: Varients calling and phylogenetic tree construction based on reference genome.
$pgcgap --VAR --threads 4 --refgbk <FILE> --ReadsPath <PATH> --reads1 .R1.clean.fastq.gz --reads2 .R2.clean.fastq.gz --suffix_len <INT> --strain_num <INT> --qualtype <STRING>
The directory where the PGCGAP software runs.
Pair-end reads of all strains in a directory (default: ./Reads/ under the working directory).
QUERY_LIST and REFERENCE_LIST files containing full paths to genomes, one per line (default: scaf.list under the working directory). If the "–Assemble" fuction was run first, the list file will be generated automatically.
Amino acids file (With ".faa" as the suffix) and nucleotide (With ".ffn" as the suffix) file of each strain placed into two directorys (default: ./Results/Annotations/AAs/ and ./Results/Annotations/CDs/). If the "–Assemble" fuction was run first, the files will be generated automatically.
A set of protein sequence files (one per species) in FASTA format under a directory (default: ./Results/Annotations/AAs/). If the "–Assemble" fuction was run first, the files will be generated automatically.
GFF3 files (With ".gff" as the suffix) of each strain placed into a directory. They must contain the nucleotide sequence at the end of the file. All GFF3 files created by Prokka are valid (default: ./Results/Annotations/GFF/). If the "–Assemble" fuction was run first, the files will be generated automatically.
Amino acids file (With ".faa" as the suffix) of each strain placed into a directory (default: ./Results/Annotations/AAs/). If the "–Assemble" fuction was run first, the files will be generated automatically.
Pair-end reads of all strains in a directory (default: ./Reads/Over/ under the working directory).
The full path of reference genome in fasta format or genebank format (must be provided).
Results/Assembles/*_assembly
Directorys contain assembly files and information of each strain.
Results/Assembles/Scaf
Directory contain contig/scaffold of all strains.
Results/Annotations/*_annotation
Directorys contain annotation files of each strain.
Results/Annotations/AAs
Directory contain amino acids sequences of all strains.
Results/Annotations/CDs
Directory contain nucleotide sequences of all strains.
Results/Annotations/GFF
Directory contain the master annotation of all strains in GFF3 format.
Results/ANI/ANIs
query genome, reference genome, ANI value, count of bidirectional fragment mappings, total query fragments
Results/ANI/ANIs.matrix
file with identity values arranged in a phylip-formatted lower triangular matrix
Results/ANI/ANIs.heatmap
An ANI matrix of all strains
Results/ANI/ANI_matrix.pdf
The heatmap plot of "ANIs.heatmap"
Results/CoreTrees/faa/ALL.core.protein.fast
Concatenated and aligned sequences file of single-core proteins.
Results/CoreTrees/faa2ffn/ALL.core.nucl.fasta
Concatenated and aligned sequences file of single-core genes.
Results/CoreTrees/faa2ffn/ALL.core.snp.fasta
Core SNPs of single-core genes in fasta format.
Results/CoreTrees/ALL.core.protein.nwk
The phylogenetic tree file of single-copy proteins for all strains.
Results/CoreTrees/faa2ffn/ALL.core.snp.nwk (or gubbins.core.snp.final_tree.tre)
The phylogenetic tree file of SNPs of single-copy genes for all strains.
Results/CoreTrees/faa2ffn/gubbins.*
See gubbins outputs
Results/CoreTrees/"Other_files"
Intermediate directorys and files.
Results/PanGenome/Pangenome_Pie.pdf
An 3D pie chart of the breakdown of genes and the number of isolate they are present in
Results/PanGenome/pangenome_frequency.pdf
A graph with the frequency of genes versus the number of genomes
Results/PanGenome/Pangenome_matrix.pdf
A figure showing the tree compared to a matrix with the presence and absence of core and accessory genes
Results/PanGenome/Other_files
see roary outputs
*.COG.xml, *.2gi.table, *.2id.table, *.2Sid.table
Intermediate files
*.2Scog.table
The super COG table of each strain
*.2Scog.table.pdf
A plot of super COG table in pdf format
All_flags_relative_abundances.table A table containing the relative abundance of each flag for all strains
Results/Varients/directory-named-in-strains
Directorys containing substitutions (snps) and insertions/deletions (indels) of each strain. See Snippy outputs for detail.
Results/Varients/Core
The directory containing Core SNP phylogeny files
PGCGAP is free software, licensed under GPLv3.
Please report any issues to the issues page or email us at liaochenlanruo@webmail.hzau.edu.cn.
If you use this software please cite: Liu H, Xin B, Zheng J, Zhong H, Yu Y, Peng D, Sun M. Build a bioinformatics analysis platform and apply it to routine analysis of microbial genomics and comparative genomics. Protocol exchange, 2020. DOI: 10.21203/rs.2.21224/v1