How good are the ENCODE RNA-Seq guidelines?

The ENCODE consortium released its first set of data-standards guidelines and interestingly they are for RNA-Seq. ChIP-Seq guidelines will follow later which is a little surprising considering almost all the ENCODE data so far is ChIP-Seq (see below). In some ways Iâ€™d have preferred to see the ChIP-Seq document first. As ChIP-Seq is pretty mature it would have been clear how much ENCODE had taken into account the different lab and analytical methods and distilled what was important in an experiment.

There is a bit of a hole in the guidelines from my point of view as there is no comparison or recommendation on methods. When I first looked at the site this is exactly the information I was hoping to get. There is none, zip, zilch! I was also surprised that there are no references in these guidelines. I think this is a significant shortcoming from the ENCODE consortium and one that needs to be fixed. I would very much hope that there are protocol recommendations for the more mature ChIP-Seq methods when those guidelines are written.

A lot of the guideline recommendations come from experience of microarrays. This document is nowhere near as comprehensive as MIAME but I think it will be easier for users to adopt because of this. The Metadata section is a nice concise list of information to collect for an experiment, RNA-Seq or otherwise. I’d encourage anyone doing a sequencing or array experiment to read this list and think about other factors they might need to collect in their own experiments.

Whilst these guidelines are a reasonable start and outline many of the issues RNA-Seq users need to consider, they fall a long way short of being truly useful to someone considering where to start with an RNA-Seq experiment.

ENCODE data so far: About 20 labs have submitted data to ENCODE according to their data summary. When I looked there was no ChIP-Chip data in the summary; almost 85% of the data is from sequencing experiments with 63% ChIP-Seq, 8% RNA-Seq, 7%, DNAse-Seq, 4% Methyl-Seq and 2% FAIRE-Seq.

The Guidelines
Methods: RNA-Seq Methods mentioned include. transcript quantification, differential gene expression, discovery and splicing analysis. They donâ€™t mention allele specific expression. Many types of input can be used in these methods, Total RNA (including miRNA of course), single cell RNA, smallRNA, polyA+ RNA, polysomal RNA, etc, etc, etc. The authors do state how immature RNA-Seq is and that the applications are evolving incredibly rapidly in almost every part of an experiment; sample prep, sequencing and analysis. They say they donâ€™t aim to cover every possible application but instead focus on the major ones and also provide recommendations for providing meta-data, something too many scientists still donâ€™t collect before and during an experiment, let alone submit with the data for analysis.

Metadata: recommendations include the usual suspects. For Cell lines; accession number, passage number, culture conditions, STR and Mycoplasma test results. For tissue the source and genotype if this is an animal, sample collection and processing methods, cellularity scores. And for the final RNA the method used for extraction and QC results (bioanalyser database anyone?)

Replication: They say that RNA-Seq experiments should be replicated (biological rather than technical) although ENCODE recommend a minimum of two replicates, which is very low. I defy anyone to find a statistician involved in microarray experiments that would settle for anything less than three and probably four replicates today. However they do give a get out clause for those who canâ€™t replicate by stating â€œunless there is a compelling reason why this is impractical or wastefulâ€. An interesting point is that these guidelines suggest an RPKM correlation of at least 0.92 is required otherwise an experiment should be repeated or explained. I would have thought anyone publishing their experiments would already be explaining this and that reviewers would pick up on such poor correlations.

Read-depth: This is one of the hottest topics for RNA-Seq. It makes a massive difference to the final cost of the experiment and is a major determinant in the â€œmicroarrays vs sequencingâ€ thought process. ENCODE suggest around 30M paired end reads for differential gene expression, however Illumina are suggesting you can use as few as 2M reads per sample today if you want the same sensitivity as Affy arrays. Thatâ€™s a 15 fold difference and I suspect this will be revised in the next version of the guidelines. They do say that other methods will require more reads, up to 200M.

ENCODE aim to update this document annually, I am sure many will be encouraged by this as a useful endeavor. What about a step further with an open access Genomic journal that only covers annual reviews of methods, compares the variations and makes recommendations for a consensus protocol?

â€œGenome Methods Reviewsâ€ perhaps?