V1.0.2


NAME

gfold - Generalized fold change for ranking differentially expressed genes from RNA-seq data.


DESCRIPTION

gfold generalizes the fold change by considering the posterior distribution of log fold change, such that each gene is assigned a reliable fold change. It overcomes the shortcoming of p-value that measures the significance of whether a gene is differentially expressed under different conditions instead of measuring relative expression changes, which are more interesting in many studies. It also overcomes the shortcoming of fold change that suffers from the fact that the fold change of genes with low read count are not so reliable as that of genes with high read count, even these two genes show the same fold change.

The source code is freely available gfold.V1.0.2.tar.gz


SYNOPSIS

gfold JOBS OPTIONS


EXAMPLES

Example 1: Count reads and rank genes

In the following example, hg19Ref.gpf is the ucsc knownGene table for hg19; sample1.sam and sample2.sam are the mapped reads in SAM format.

gfold count -ann hg19Ref.gpf -tag sample1.sam -o sample1.read_cnt
gfold count -ann hg19Ref.gpf -tag sample2.sam -o sample2.read_cnt
gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

Example 2: Count reads

This example utilizes samtools to produce mapped reads in SAM format from BAM format.

samtools view sample1.bam | gfold count -ann hg19Ref.gpf -tag stdin -o sample1.read_cnt

Example 3: Identify differentially expressed genes without replicates

Suppose there are two samples: sample1 and sample2 with corresponding read count file being sample1.read_cnt sample2.read_cnt. This example finds differentially expressed genes using default parameters on two samples

gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

Example 4: Identify differentially expressed genes with replicates

This example finds differentially expressed genes using default parameters on two group of samples.

gfold diff -s1 sample1,sample2,sample3 -s2 sample4,sample5,sample6 -suf .read_cnt -o 123VS456.diff

Example 5: Identify differentially expressed genes with replicates only in one condition

This example finds differentially expressed genes using default parameters on two group of samples. Only the first group contains replicates. In this case, the variance estimated based on the first group will be used as the variance of the second group.

gfold diff -s1 sample1,sample2 -s2 sample3 -suf .read_cnt -o 123VS456.diff


JOBS

-h

Print help information

count

Given the gene annotation in GPF/BED format and mapped short reads in SAM/BED format, count the number of reads mapped to each gene.

diff

For each gene, calculate GFOLD value and other statistics.


OPTIONS

-ann <file>

Gene annotation file in GPF/BED format. Note that the knownGene table downloaded from UCSC is in GPF format. For job count only.

-annf <GPF/BED>

The format of gene annotation file. Default GPF. For job count only.

-tag <file>

Short reads in SAM format. 'stdin' stands for standard input stream. For job count only.

-tagf <SAM/BED>

The format of short reads. Default SAM. For job count only.

-s <T/F>

Whether is the sequencing data strand specific? T stands for strand specific. Default F. If you are not clear about this, using default parameter should be OK even for the strand specific case. For job count only.

-o <file>

The file for output for all jobs.

-s1 <file>

The prefix for gene read count of the 1st group output by count. Multiple prefixes are separated by commas. For job diff only. If you have gene read count generated by other ways instead of job count, make sure that the format are the same for all files. Each file contains two columns corresponding to gene names and read counts separated by a TAB. All files are sorted by gene names and have the same number of lines.

-s2 <file>

The prefix for gene read count of the 2st group output by count. Multiple prefixes are separated by commas. For job diff only.

-suf <string>

The suffix for gene read count file specified by -s1 and -s2. For job diff only.

-sc <string>

The significant cutoff for fold change. Default 0.05. For job diff only.

-bi <string>

For MCMC, the iterations for burn-in phase. Default 1000. For job diff only.

-si <string>

For MCMC, the iterations for sampling phase. Default 1000. For job diff only.

-r <num>

The maximum number of selected pairs for calculating empirical FDR. Default 20. For job diff only.

-v <num>

Verbos level. A larger value gives more information of the running process. Default 2.

-norm <Count/DESeq>

The way to do normalization. 'Count' stands for normalization by total number of mapped reads. 'DESeq' stand for the normalization proposed by DESeq. Default 'DESeq'.


OUTPUT FORMAT

All fields in a output file are separated by TABs.

For JOB count:

The output file contains 3 columns:

  1. GeneSymbol:

    GeneSymbol. The order of gene symbol is the same as that appearing in the read count file.

  2. Read Count:

    The number of reads mapped to this gene.

  3. Gene exon length:

    The length sum of all the exons of this gene.

For JOB diff:

The output file contains 12 columns:

  1. #GeneSymbol:

    Gene symbols. The order of gene symbol is the same as that appearing in the read count file.

  2. GFOLD:

    GFOLD value for every gene. The GFOLD value could be considered as a reliable log2 fold change. It is positive/negative if the gene is up/down regulated. The main usefulness of GFOLD is to provide a biological meanlingful ranking of the genes. The GFOLD value is zero if the gene doesn't show differential expression. If the log2 fold change is treated as a random variable, a positive GFOLD value x means that the probability of the log2 fold change (2nd/1st) being larger than x is (1 - the parameter specified by -sc); A negative GFOLD value x means that the probability of the log2 fold change (2st/1nd) being smaller than x is (1 - the parameter specified by -sc). If this file is sorted by this column in descending order then genes ranked at the top are differentially up-regulated and genes ranked at the bottom are differentially down-regulated. Note that a gene with GFOLD value 0 should never be considered differentially expressed. However, it doesn't mean that all genes with non-negative GFOLD value are differentially expressed. For taking top differentially expressed genes, the user is responsible for selecting the cutoff.

  3. E-FDR:

    Empirical FDR based on replicates. It is always 1 when no replicates are available.

  4. pval:

    p-value for every gene. The formula is: if log2fdcMean[i] > 0, pval = P(X > log2fdcMean[i]), otherwise pval = P(X < log2fdcMean[i]), where X follows normal distribution with mean 0 and variance logvar_first + logvar_second + pow(log2fdc_sd,2). Here, logvar_first and logvar_second are the estimated biological variances for both conditions. When no replicate is available, the calculation assumes that the biological variances are zero. Note that the calculation of pvalue is not based on GFOLD value and the ranking by this column is not the same as ranking by GFOLD. The pvalue calculation assumes that the posterior distribution of log fold change follows normal distribution. This assumption is approximately true especially on genes with large read counts. When read counts are small, the posterior distribution of log fold change is skewed to the left if its mean is positive, skewed to the right if the mean is negative. Therefore, the significance reflected by pvalue is under estimated on genes with low read counts, which is not a serious problem because such genes would not be reliabley called significant any way.

  5. padj:

    BH corrected p-value for every gene.

  6. log2fdcMean:

    The mean for the posterior distribution of log2 fold change.

  7. log2fdc_low(c):

    The c*100 quantile for the posterior distribution of log2 fold change. c is the parameter specified by -sc.

  8. log2fdc_high(c):

    The (1-c)*100 quantile for the posterior distribution of log2 fold change. c is the parameter specified by -sc.

  9. log2fdc_sd:

    The standard deviation for the posterior distribution of log2 fold change.

  10. log2fdc_skewness:

    The skewness for the posterior distribution of log2 fold change.

  11. log2expMeanFirst:

    The mean of the posterior distribution of log2 expression of genes in the first group.

  12. log2expMeanSecond:

    The mean of the posterior distribution of log2 expression of genes in the second group.


AUTHOR

Jianxing Feng (jianxing.tongji@gmail.com)

 GFOLD V1.0.2