NPG-explorer, Nucleotide PanGenome explorer

Table of contents

Instruction

Download NPG-explorer

Download prebuild static executables for Windows and Linux.

Prepare fasta files with genomes

Input file: table of genomes

Create file of the form:

CP003176 BRUAO chr1 c Brucella abortus A13334 chr 1
CP003177 BRUAO chr2 c Brucella abortus A13334 chr 2
CP003174 BRUCA chr1 c Brucella canis HSK A52141 chr 1
CP003175 BRUCA chr2 c Brucella canis HSK A52141 chr 2

Fields are:

String CP003175 BRUCA chr2 c corresponds to EMBL entry CP003175 which is represented by short genome name BRUCA, chromosome name chr2 and is circular.

You can use contigs instead of chromosomes, if genome is not fully assembled. Set circularity to 'l' in this case.

Such a table for 17 genomes of Brucella can be found in file brucella/17genomes.tsv. This example is used below.

Create empty directory and create file genomes.tsv with the table of genomes to be used to build pangenome. npge will create files and sub-folders in current directory. You can change location of output files using command line options. To see all options, add -h to a command. To set path to table file (instead of genomes.tsv), pass option --table to commands GetFasta, GetGenes and Rename.

Prepare sequences and genes

Run the following command:

$ npge Prepare

The following files are created by this command:

Files genomes-raw.fasta and features.embl contain unprocessed data from a database. You can safely remove them.

Build nucleotide pangenome

$ npge Pangenome

This command creates file pangenome.bs. The file is in Block Set format.

Check nucleotide pangenome (optional)

$ npge CheckPangenome

This command makes sure that nucleotide pangenome in file pangenome.bs satisfies pangenome criteria.

The command prints if the pangenome is Ok and may print some comments about the pangenome.

This step is done by post-processing as well, result is saved to file check/isgood.

Run post-processing of nucleotide pangenome

$ npge PostProcessing

This command produces many files, some of them are located in sub-folders.

Files *.bs contain blocksets, *.bi contain tables of blocks' properties, *.bsa contain block set alignments.

How to view .tre files using FigTree: open a file with FigTree, set branch label to "Diagnostic positions" in pop-up window, go to "Branch Labels" section of left menu, enable the section's checkbox. Abstract distances between nodes are shown under branches. To show number of diagnostic positions between corresponding clades, select "Diagnostic positions" in drop-down list.

View results in graphical user interface

$ qnpge
Graphical User Interface of NPG-explorer

Graphical User Interface of NPG-explorer

This command uses pangenome.bs and some of files created by PostProcessing.

The program window is splitted to 3 parts:

Columns of blocks table:

You can filter blocks by block name, gene name or their sequence using input located up to block table. To hide blocks of one fragment, clock checkbox "only blocks of >= 2 fragments". Blocks table can be sorted by any column.

Blockset alignment table shows alignment of fragment on genomes. Chromosome can be selected using drop-down list located up to blockset table. Each sequence is represented as a row of blockset table. Name of a sequence and its orientation against the alignment is written in first column. Fragments of a sequence are represented by cells of blockset table. Fragments of one block are coloured similarly. Orientation of a fragment against the alignment is indicated by '<' and '>'.

When you navigate in blocks table and blockset alignment, the alignment of the corresponding block is shown in bottom part of the program. Fragment name is shown left to alignment itself. Background colors in alignment correspond to nucleotide types. Name of the selected gene is shown in read-only input located up to the alignment. You can disable genes representation completely by unchecking the checkbox "show genes". Genes are coloured with foreground color white. Genes on reverse chain (relatively to the fragment orientation) are marked with underscore. Overlapping genes are coloured with purple. Start codons are coloured with black, stop codons are coloured with gray. Consensus of the block is shown up to the alignment. Identical columns without gaps are coloured with black, identical columns with gaps are coloured with gray, non-identical columns are white. Columns numbers are shown up to consensus. Columns numbers of low similarity regions are coloured with red.

You can use arrows keys to navigate through the alignment. Corresponding fragment is selected in blockset alignment. Use keys "Home" and "End" to go to first and last columns of the alignment respectively. If you "go away" from the alignment, the program switches to corresponding block. You go to next gene boundary if you press Ctrl + Arrow Right or Ctrl + Arrow Left. You go to next low similarity region if you press Shift + Arrow Right or Shift + Arrow Left.

To change order of sequences in blockset alignment and block alignment, select some rows (you can use Ctrl to select multiple rows) and press Ctrl + Arrow Up or Ctrl + Arrow Down.

Requirements of a good pangenome

Build and Install

Main executables are command line tool src/tool/npge (or src/tool/npge.exe) and GUI tool src/gui/qnpge (src/gui/qnpge.exe).

To change compiled-in default settings, run ccmake . in build directory.

To generate config file, run npge -g npge.conf and change generated file npge.conf.

Requirements

Optional:

Linux

Install build requirements (on Debian):

% ./linux/requirements.sh

Build the program as static executables (Qt 4 is not static!):

$ ./linux/build.sh

The program is built in the directory npge-build-linux.

Create distribution .tar.gz file: go into npge-build-linux and run:

$ ./linux/package.sh

How to build manually:

$ mkdir build
$ cd build
$ cmake ..
$ make

Pass argument -DNPGE_STATIC_LINUX:BOOL=1 to after cmake to get static executables (on Debian, Qt 4 is not static!).

Build README.html:

$ pandoc -s README.html README.md

Run tests:

$ make test

Windows

Windows executables are cross-compiled from Linux using MinGW cross-compiler.

Install build requirements (on Debian):

% ./windows/requirements.sh

Build the program as static executables:

$ ./windows/build.sh

The program is built in the directory npge-build-windows.

Create ZIP file and Installation Wizard for Windows, go into npge-build-windows and run:

$ ./windows/package.sh

Model

Sorry, this section is incomplete.

Lua in 15 minutes

If you do not see results of expressions in Lua terminal, use function print.

There are several classes used to represent state of nucleotide pangenome (or, mo general, block set):

Objects of classes listed above are long-living. This means in particular that they can persist across several Processors (see bellow). Objects of long-living classes are created with new static method:

> block_set = BlockSet.new()
> sequence = Sequence.new()

Instances of BSA can't be created directly, so the only way to get instance of BSA in Lua script is through methods of BlockSet.

Deletion of objects

Objects of Fragment, Block and AlignmentRow require manual deletion. If you create instance of one of these classes, you are responsible for deleting it manually or transferring ownership to other object which in turn is deleted automatically. Fragment owns corresponding AlignmentRow. Block (if it is not weak) owns all Fragments it has. BlockSet owns all Blocks it has. To delete an object manually, use AlignmentRow.delete, Fragment.delete or Block.delete.

> fragment = Fragment.new()
> Fragment.delete(fragment)
-- do not use fragment after this line
-- unless you want program crash

If you create temporary object, you can create a deleter, which owns this object. When deleter object is garbage collected, it deletes corresponding object.

> f2 = fragment:clone()
> df = Fragment.deleter(f2)
-- df deletes f2

If you do not delete objects of AlignmentRow, Fragment or Block, this results in memory leak. If you use an object after deletion, this results in program crash.

Sequence

Sequence stores string representing genome sequence. Moreover, it stores name and description of the sequence. NPGe recognizes the following sequence name format: GENOME&CHROMOSOME&CIRCULAR (3 values joint by '&'). GENOME is short name of a genome. CHROMOSOME is short name of a chromosome. CIRCULAR can be 'c' (circular) or 'l' (linear). This information is used by some algorithms of NPGe. Example of sequence name: BRUO2&chr1&c (chromosome 1 of Brucella ovis ATCC 25840, which is circular).

Description of a sequence can be any string. Optionally it can include the following value: ac=XXXX. NPGe recognizes this entry as Accession Number of the sequence in databank.

Create new sequence and set its name and description:

> seq = Sequence.new()
> seq:set_name("TEST&chr1&c")
> seq:set_description("ac=ABC123")

Get name, description, genome, chromosome, circularity and accession number of the sequence:

> seq:name()
"TEST&chr1&c"
> seq:description()
"ac=ABC123"
> seq:genome()
"TEST"
> seq:chromosome()
"chr1"
> seq:circular()
true
> seq:ac()
"ABC123"

Sequence stores string consisting of letters 'A', 'T', 'G', 'C' and 'N'. 'N' is used for unknown letter. To append new part to a Sequence, use method push_back:

> seq:push_back("AT")
> seq:push_back("GC")

Create Sequence from string:

> sequence = Sequence.new('ATGC')

To get contents of a Sequence, use method contents:

> seq:contents()
"ATGC"

To get size of a Sequence, use method size:

> seq:size()
4

If you convert Sequence to string (using function tostring or just by entering its name), you get FASTA representation of this sequence:

> tostring(seq)
[[
> TEST&chr1&c ac=ABC123
ATGC
]]

To get a letter from a sequence by its index (0-based), use method char_at:

> seq:char_at(0)
"A"

To get a substring from a sequence, use method substr. Arguments are start position, length and orientation (1 is direct orientation, -1 is reversed (complement) orientation).

> seq:substr(1, 2, 1)
"TG"
> seq:substr(2, 2, -1)
"CA"

Note that first argument of seq:substr(2, 2, -1). Because the substring is reversed, start position is 2, not 1.

To hash substring of sequence, use method hash. Arguments are the same of substr. Free function make_hash can be applied to a string to get its hash.

> seq:hash(1, 2, 1)
"9"
> make_hash("TG")
"9"
> seq:hash(2, 2, -1)
"3"
> make_hash("CA")
"3"

To filter out any character but 'A', 'T', 'G', 'C' and 'N' from a string, use Sequence.to_atgcn:

> Sequence.to_atgcn("AT-G-C-A#T$N .C")
"ATGCATNC"

There are several ways to store string in Sequence:

By default, one of compact sequences is used. You can specify sequence type directly:

> seq = Sequence.new(Sequence.ASIS_SEQUENCE)
> seq = Sequence.new('ATGC', Sequence.ASIS_SEQUENCE)

DummySequence

DummySequence is a sequence all letters in which are equal.

> seq = DummySequence.new('A', 20)
> seq:contents()
"AAAAAAAAAAAAAAAAAAAA"

You can skip letter and/or size of DummySequence and set them by methods set_letter and set_size. Letter of DummySequence can be retrieved using method letter.

> seq = DummySequence.new()
> seq:set_letter('A')
> seq:set_size(20)
> seq:contents()
"AAAAAAAAAAAAAAAAAAAA"

FragmentSequence

FragmentSequence points to content of a Fragment.

> fragment = ...
> fs = FragmentSequence.new(fragment)
> assert(fs:size() == fragment:size())
> assert(fs:contents() == fragment:str())

The fragment can be set latter by method set_fragment.

Fragment

Fragment represents part of sequence. Fragment is defined by Sequence, minimum position in sequence, maximum position in sequence (0-based) and orientation (a.k.a. ori, can be 1 and -1).

Fragments created on Lua side, must be deleted manually.

Note: Fragment and Sequence do not own each other, though Fragment points to Sequence. If you use the fragment, whose sequence gets deleted, the program crashes.

Method Fragment.new accepts 4 arguments (seq, min_pos, max_pos, ori), all are optional. They can be accessed by getters and setters named same as arguments:

-- create a fragment which covers whole sequence
> fragment = Fragment.new(sequence, 0, sequence:size() - 1)
> assert(fragment:length() == sequence:size())
> fragment:set_min_pos(2)
> fragment:set_max_pos(7)
> assert(fragment:length() == 7 - 2 + 1)
> fragment:ori()
1
> fragment:set_ori(-1)
> fragment:ori()
-1

Use method inverse to change ori to the opposite.

> fragment:ori()
-1
> fragment:inverse()
> fragment:ori()
1

Reversed fragment points reverse chain of the sequence, thus it contains complement sequence.

Method begin_pos returns position in the sequence, where the fragment starts. Method last_pos returns position in the sequence, where the fragment ends. Method end_pos returns position in the sequence, next to the position where the fragment ends and not located in the fragment. end_pos can be located even not in the sequence.

> f = Fragment.new(s, 10, 20)
> f:begin_pos()
10
> f:last_pos()
20
> f:end_pos()
21
> f.inverse()
> f:begin_pos()
20
> f:last_pos()
10
> f:end_pos()
9

To set begin_pos and last_pos at once, use method set_begin_last. Orientation is changed to satisfy both begin and last positions.

> f = Fragment.new(s)
> f:set_begin_last(10, 20)
> f:ori()
1
> f:set_begin_last(20, 10)
> f:ori()
-1

Method seq returns sequence of the fragment, method block returns block of the fragment.

> seq = fragment:seq()
> block = fragment:block()

Use method str to get part of Sequence occupied by the fragment. Use method contents to get string contents of Fragment (can include gaps ('-') if the fragment has AlignmentRow. Method length returns length of part of the sequence occupied by the fragment. Method alignment_length returns length of contents of the fragment (including gaps).

> sequence = Sequence.new('ATGC')
> fragment = Fragment.new(sequence, 1, 2)
> fragment:contents()
"TG"
> fragment:str()
"TG"
> fragment:length()
2
> fragment:alignment_length()
2
> fragment:set_row(AlignmentRow.new('T-G'))
> fragment:contents()
"T-G"
> fragment:str()
"TG"
> fragment:length()
2
> fragment:alignment_length()
3

Changelog

Corresponding author: Boris Nagaev, email: bnagaev@gmail.com