GFOLD V1.0.7


NAME

gfold - Generalized fold change for ranking differentially expressed genes from RNA-seq data.

GFOLD is especially useful when no replicate is available. GFOLD generalizes the fold change by considering the posterior distribution of log fold change, such that each gene is assigned a reliable fold change. It overcomes the shortcoming of p-value that measures the significance of whether a gene is differentially expressed under different conditions instead of measuring relative expression changes, which are more interesting in many studies. It also overcomes the shortcoming of fold change that suffers from the fact that the fold change of genes with low read count are not so reliable as that of genes with high read count, even these two genes show the same fold change.


CITATION

Feng J, Meyer CA, Wang Q, Liu JS, Liu XS, Zhang Y. GFOLD: a generalized fold change for ranking differentially expressed genes from RNA-seq data. Bioinformatics 2012


SYNOPSIS

gfold JOBS OPTIONS


EXAMPLES

Example 1: Count reads and rank genes

In the following example, hg19Ref.gtf is the ucsc knownGene table for hg19; sample1.sam and sample2.sam are the mapped reads in SAM format.

gfold count -ann hg19Ref.gtf -tag sample1.sam -o sample1.read_cnt
gfold count -ann hg19Ref.gtf -tag sample2.sam -o sample2.read_cnt
gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

Example 2: Count reads

This example utilizes samtools to produce mapped reads in SAM format from BAM format.

samtools view sample1.bam | gfold count -ann hg19Ref.gtf -tag stdin -o sample1.read_cnt

Example 3: Identify differentially expressed genes without replicates

Suppose there are two samples: sample1 and sample2 with corresponding read count file being sample1.read_cnt sample2.read_cnt. This example finds differentially expressed genes using default parameters on two samples

gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

Example 4: Identify differentially expressed genes with replicates

This example finds differentially expressed genes using default parameters on two group of samples.

gfold diff -s1 sample1,sample2,sample3 -s2 sample4,sample5,sample6 -suf .read_cnt -o 123VS456.diff

Example 5: Identify differentially expressed genes with replicates only in one condition

This example finds differentially expressed genes using default parameters on two group of samples. Only the first group contains replicates. In this case, the variance estimated based on the first group will be used as the variance of the second group.

gfold diff -s1 sample1,sample2 -s2 sample3 -suf .read_cnt -o 123VS456.diff


JOBS

-h

Print help information

count

Given the gene annotation in GTF/GPF/BED format and mapped short reads in SAM/BED format, count the number of reads mapped to each gene. Because of possible overlapping of multiple genes, a read could be mapped to the overlaped region of multiple genes. In this case, a read is counted multiple times with each time for each gene. Furthermore, if a gene is on multiple chromosomes or different strands of the same chromosome, only exons on one strand of one chromosome (the one appear first in the annotation file) will be assigned to this gene. Exons not on this strand of the chromosome will be discarded.

diff

For each gene, calculate GFOLD value and other statistics. diff accepts the output of count as the input. Please refer to the output format of count for more information about the input format. If you are not satisfied with the strategy adopted by count, you can generate gene read counts by yourself. The input for diff should contain at least the first two columns of the output of count and the orders of gene names are the same for different samples. The third column of the output of count (gene length) only influences the RPKM in the output of diff. If it is missing, RPKM will not be generated by diff. diff does not use the forth column of the output of count.


OPTIONS

-ann <file>

Gene annotation file in GTF/GPF/BED format. Note that the knownGene table downloaded from UCSC is in GPF format. For job count only.

-annf <GTF/GPF/BED>

The format of gene annotation file. Default GTF (Gene Transfer Format). For job count only. In short, from UCSC Table Browser, the 'knownGene' table with all fields is in GPF (Gene Prediction Format) format and the 'refGene' table without the first column is in GPF format. Note that for either 'knownGene' or 'refGene' table, the downloaded file would contain a header which should be removed before calling GFOLD. More specifically, a file in GPF format contains 12 columns separated by TABs (adapted from UCSC):

name Name of gene
chrom Reference sequence chromosome or scaffold
strand + or - for strand
txStart Transcription start position
txEnd Transcription end position
cdsStart Coding region start
cdsEnd Coding region end
exonCount Number of exons
exonStarts Exon start positions separated by commas
exonEnds Exon end positions separated by commas
proteinID UniProt display ID for Known Genes, UniProt accession or RefSeq protein ID for UCSC Genes
alignID Unique identifier for each (known gene, alignment position) pair
-tag <file>

Short reads in SAM format. 'stdin' stands for standard input stream. For job count only.

-tagf <SAM/BED>

The format of short reads. Default SAM. For job count only.

-s <T/F>

Whether is the sequencing data strand specific? T stands for strand specific. Default F. If you are not clear about this, using default parameter should be OK even for the strand specific case. For job count only.

-acc <T/F>

When no replicate is available, whether to use accurate method to calculate GFOLD value. T stands for accurate which depends on sequencing depth and slower, F stands for MCMC. Default T. For job diff only.

-o <file>

The file for output for all jobs.

-s1 <file>

The prefix for gene read count of the 1st group output by count. Multiple prefixes are separated by commas. For job diff only. If you have gene read count generated by other ways instead of job count, make sure that the format are the same for all files. Each file contains two columns corresponding to gene names and read counts separated by a TAB. All files are sorted by gene names and have the same number of lines.

-s2 <file>

The prefix for gene read count of the 2st group output by count. Multiple prefixes are separated by commas. For job diff only.

-suf <string>

The suffix for gene read count file specified by -s1 and -s2. For job diff only.

-sc <string>

The significant cutoff for fold change. Default 0.05. For job diff only.

-bi <string>

For MCMC, the iterations for burn-in phase. Default 1000. For job diff only.

-si <string>

For MCMC, the iterations for sampling phase. Default 1000. For job diff only.

-r <num>

The maximum number of selected pairs for calculating empirical FDR. Default 20. For job diff only.

-v <num>

Verbos level. A larger value gives more information of the running process. Default 2.

-norm <Count/DESeq>

The way to do normalization. 'Count' stands for normalization by total number of mapped reads. 'DESeq' stand for the normalization proposed by DESeq. Default 'DESeq'.


OUTPUT FORMAT

All fields in a output file are separated by TABs.

For JOB count:

The output file contains 4 columns:

  1. GeneSymbol:

    GeneSymbol. The order of gene symbol is the same as that appearing in the read count file.

  2. Read Count:

    The number of reads mapped to this gene.

  3. Gene exon length:

    The length sum of all the exons of this gene.

  4. RPKM:

    The expression level of this gene in RPKM.

For JOB diff:

The output file contains 6 columns:

  1. #GeneSymbol:

    Gene symbols. The order of gene symbol is the same as that appearing in the read count file.

  2. GFOLD:

    GFOLD value for every gene. The GFOLD value could be considered as a reliable log2 fold change. It is positive/negative if the gene is up/down regulated. The main usefulness of GFOLD is to provide a biological meanlingful ranking of the genes. The GFOLD value is zero if the gene doesn't show differential expression. If the log2 fold change is treated as a random variable, a positive GFOLD value x means that the probability of the log2 fold change (2nd/1st) being larger than x is (1 - the parameter specified by -sc); A negative GFOLD value x means that the probability of the log2 fold change (2st/1nd) being smaller than x is (1 - the parameter specified by -sc). If this file is sorted by this column in descending order then genes ranked at the top are differentially up-regulated and genes ranked at the bottom are differentially down-regulated. Note that a gene with GFOLD value 0 should never be considered differentially expressed. However, it doesn't mean that all genes with non-negative GFOLD value are differentially expressed. For taking top differentially expressed genes, the user is responsible for selecting the cutoff.

  3. E-FDR:

    Empirical FDR based on replicates. It is always 1 when no replicates are available.

  4. log2fdc:

    log2 fold change. If no replicate is available, and -acc is T, log2 fold change is based on read counts and normalization constants. Otherwise, log2 fold change is based on the sampled expression level from the posterior distribution.

  5. 1stRPKM:

    The RPKM for the first condition. It is available only if gene length is available. If multiple replicates are available, the RPKM is calculated simply by summing over replicates. Because RPKM is acturally using sequencing depth as the normalization constant, log2 fold change based on RPKM could be different from the log2fdc field.

  6. 2ndRPKM:

    The RPKM for the second condition. It is available only if gene length is available. Please refer to 1stRPKM for more information.


AUTHOR

Jianxing Feng (jianxing.tongji@gmail.com)

 GFOLD V1.0.7