Usage

PreProcessing

This step removes artifacts and contaminat sequences from you data in an effort to speed up and improve the quality of the resulting assemblies.

  1. Run setup_preprocessing.py
    • You will be asked to provide
      • Forward read file - The filename of a fastq file (gzipped is ok) containing the forward reads from sequencing
      • Reverse read file - If you performed paired-end sequencing, this is the filename of the fastq file containing the reverse read. Note, that the reads must be in matching order to the forward read file and the read ids should be the same (characters after a space in the read id are ignored for matching purposes).
      • Output directory - The location where the processed files will be stored
      • Output prefix - The basename of the output files (typically the project or strain name)
      • Host reference - Either a FASTA file or the basename of a Bowtie index. Sequences will be aligned to this reference and any that match will be removed from the analysis. This is typically used to remove host contaminiation.
    • The script will generate a makefile in the specified output directory called PREFIX_preprocess.mk.

  2. Execute resulting makefile
    • You can view the resulting makefile to see the various parameters that will be used and edit any configuration options.
    • Running make -f PREFIX_preprocess.mk will execute the various preprocessing steps. You make take advantage of multiple processors by using the -j N options to specify the number of threads for make to use.
    • To cleanup intermediate files: “make -f analysis.mk clean”
The PreProcessing consists of the following steps. After each step, FastQC is run to generate a summary of the result.
  1. Run Cutadapt to trim adapter sequences from the ends of reads. (Intermediate, output is discarded).
  2. Run Fastx Toolkit‘s artifact filter to remove low complexity sequences.
  3. If paired reads are used, match up paired reads and set unpaired reads aside.
  4. Map reads using Bowtie to the specified host reference, keeping only those reads (or pairs) that did not map.

Short Read de novo Assembly

This step uses the SSAKE short read assembler to build contigs from the preprocessed reads. It does this by running SSAKE with various parameter settings and combining the results.

  1. multi_ssake.py
  2. Run clean_multi_ssake.sh to cleanup intermediate fasta files (interactive, do NOT use qsub)

TODO - Additional Detail Needed

Contig Assembly

The resulting contigs from the previous SSAKE assemblies are often redundant and can be stitched together using a long-read assembler. We have had sucess using Gap4 from the Staden package.

  1. Assemble ssake_combined.contigs (using Gap4 or other long read assembler)

TODO - Additional Detail Needed

Reference Guided Scaffold Assembly

Once you have generated the best contigs you can with SSAKE and Gap4, the next step is to put them together into a scaffold. To do this, VAMP utilizes the reference genome of a similar strain to determien how to order the contigs.

Align Contigs to Reference

The first step is to align contigs to a reference genome and output the result in a MAF formatted file. There are many options for alignment tools, however, we have had sucess with Mugsy, a very fast multiple whole genome alignment tool.

Stitch Together Scaffold

Once you have aligned the contigs to the reference, the next step is to stitch together the various alignment blocks into a scaffold. The maf_net.py utility does this by reassembling the reference sequence from the MAF blocks and using the highest scoring block for each location in the genome to assemble a scaffold genome.:

Usage: maf_net.py [options] maf_file

Determine the best MAF block (determined by score) that cover a specified
genome

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -r REFERENCE, --reference=REFERENCE
                        Reference species
  -s SPECIES, --species=SPECIES
                        List of species to include
  -c CHROMOSOME, --chromosome=CHROMOSOME
                        Sequence ID of the chromosome for which to generate
                        the alignment net (e.g. NC_001806)
  -o OUTPUT_DIR, --output_dir=OUTPUT_DIR
                        Directory to store output file, default is maf file
                        directory
  --consensus_sequence  Output "consensus sequence" for each species in files
                        named [species].[chromosome].consensus.fasta
  --reference_fasta=REFERENCE_FASTA
                        Check MAF file against this fasta (for
                        troubleshooting, debugging)
  -v, --verbose         verbose output
Output:
  • Aligned Fasta File - BASENAME.net.afa

    This file contains an aligned fasta file created by stitching together MAF blocks based on the reference sequence. Where two blocks overlap, the higher scoriing block is used.

Optional Output (one per species):
  • Consusus Sequence - SPECIES.consensus.fasta

    A FASTA file containing the consensus sequence for this species. N’s in the sequence represent sections where no contigs mapped to a section of the reference (i.e. potential gaps in the scaffold).

  • Consensus Contig Composition GFF - SPECIES.consensus_contig_composition.gff
    GFF formatted file descibes intervals in the SPECIES genome. The attributes contain information about the contigs used to determine the sequence in this interval. The attribues are:
    • src_seq
    • src_seq_start
    • src_seq_end
    • src_strand
    • src_size
    • maf_block
    • block_start
    • block_end
    • ref_src
    • ref_start
    • ref_end
    • ref_strand
  • Consensus Contig Composition Summary - SPECIES.consensus_contig_composition_summary.txt
    Tab delimited file with the following columns that descibes intervals in the SPECIES genome and the contigs that were used for the sequence
    • seq - sequence id of the interval in the SPECIES genome
    • start - start postion of the interval
    • end - end position of the interval
    • contig - contig id that was used to “build” this interval. If None, that means no contig was found for the analagous region in the reference.
    • contig_start - the start postion of the contig that aligned to this start interval
    • contig_end - the end position of the contig that aligned to the end position of this interval
    • contig_strand - the direction that the contig aligned to the reference (if ‘-‘, the reverse complement of the contig aligned to the reference in this interval)
    • contig_size - the full size of the contig (including those bases that did not aligned to this interval)

Genome Annotation and Comparison

Once a draft of a genome has been completed, it can be useful to migrate annotations from an annotated reference to the new genome. In addition, this step generates a summary of the changes at the nucleic acid as well as amino acid level.

Run compare_genomes.py to migrate annotations and generate a list of differences between two speices. The script requres an aligned fasta file (typically use the one generated from the previous scaffold stitching step) and a GFF file of features (genes, exons, etc.) to migrate.:

Usage: compare_genomes.py [options] aligned_fasta gene_gff

Compares genomes using multiple alignment Input: Aligned Fasta and GFF

Output: For each non-reference record: gff, feature sequences, diff summary,
    and vcf file.  Currently limited to single sequence (chromosome) at a time.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -r REFERENCE, --reference=REFERENCE
                        Sequence id of reference sequence in aligned fasta
                        file
  --align_format=FORMAT
                        Alignment format (default: fasta)
  -o PREFIX, --output=PREFIX
                        Output prefix (default: compare_genomes_output/)
  --gff_feature_types=GFF_FEATURE_TYPES
                        Comma separated list of gff feature types to parse
                        (default: CDS,exon,gene,mRNA,stem_loop)
  --gff_attributes=GFF_ATTRIBUTES
                        Comma separated list of feature attributes to carry
                        over (default: ID,Parent,Note,gene,function,product)
  -v, --verbose         verbose output