This step removes artifacts and contaminat sequences from you data in an effort to speed up and improve the quality of the resulting assemblies.
The script will generate a makefile in the specified output directory called PREFIX_preprocess.mk.
This step uses the SSAKE short read assembler to build contigs from the preprocessed reads. It does this by running SSAKE with various parameter settings and combining the results.
TODO - Additional Detail Needed
The resulting contigs from the previous SSAKE assemblies are often redundant and can be stitched together using a long-read assembler. We have had sucess using Gap4 from the Staden package.
TODO - Additional Detail Needed
Once you have generated the best contigs you can with SSAKE and Gap4, the next step is to put them together into a scaffold. To do this, VAMP utilizes the reference genome of a similar strain to determien how to order the contigs.
The first step is to align contigs to a reference genome and output the result in a MAF formatted file. There are many options for alignment tools, however, we have had sucess with Mugsy, a very fast multiple whole genome alignment tool.
Once you have aligned the contigs to the reference, the next step is to stitch together the various alignment blocks into a scaffold. The maf_net.py utility does this by reassembling the reference sequence from the MAF blocks and using the highest scoring block for each location in the genome to assemble a scaffold genome.:
Usage: maf_net.py [options] maf_file
Determine the best MAF block (determined by score) that cover a specified
genome
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-r REFERENCE, --reference=REFERENCE
Reference species
-s SPECIES, --species=SPECIES
List of species to include
-c CHROMOSOME, --chromosome=CHROMOSOME
Sequence ID of the chromosome for which to generate
the alignment net (e.g. NC_001806)
-o OUTPUT_DIR, --output_dir=OUTPUT_DIR
Directory to store output file, default is maf file
directory
--consensus_sequence Output "consensus sequence" for each species in files
named [species].[chromosome].consensus.fasta
--reference_fasta=REFERENCE_FASTA
Check MAF file against this fasta (for
troubleshooting, debugging)
-v, --verbose verbose output
This file contains an aligned fasta file created by stitching together MAF blocks based on the reference sequence. Where two blocks overlap, the higher scoriing block is used.
A FASTA file containing the consensus sequence for this species. N’s in the sequence represent sections where no contigs mapped to a section of the reference (i.e. potential gaps in the scaffold).
Once a draft of a genome has been completed, it can be useful to migrate annotations from an annotated reference to the new genome. In addition, this step generates a summary of the changes at the nucleic acid as well as amino acid level.
Run compare_genomes.py to migrate annotations and generate a list of differences between two speices. The script requres an aligned fasta file (typically use the one generated from the previous scaffold stitching step) and a GFF file of features (genes, exons, etc.) to migrate.:
Usage: compare_genomes.py [options] aligned_fasta gene_gff
Compares genomes using multiple alignment Input: Aligned Fasta and GFF
Output: For each non-reference record: gff, feature sequences, diff summary,
and vcf file. Currently limited to single sequence (chromosome) at a time.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-r REFERENCE, --reference=REFERENCE
Sequence id of reference sequence in aligned fasta
file
--align_format=FORMAT
Alignment format (default: fasta)
-o PREFIX, --output=PREFIX
Output prefix (default: compare_genomes_output/)
--gff_feature_types=GFF_FEATURE_TYPES
Comma separated list of gff feature types to parse
(default: CDS,exon,gene,mRNA,stem_loop)
--gff_attributes=GFF_ATTRIBUTES
Comma separated list of feature attributes to carry
over (default: ID,Parent,Note,gene,function,product)
-v, --verbose verbose output