RNA-seq advice from Illumina

This article was commissioned by Illumina Inc.

The most common NGS method we discuss in our weekly experimental design meeting is RNA-seq. Nearly all projects will use it at some point to delve deeply into hypothesis driven questions, or simply as a tool to go fishing for new biological insights. It is amazing how far a project can progress in just 30 minutes of discussion, methodology, replication, controls, analysis, and all sorts of bias get covered as we try to come up with an optimal design. However many users don’t have the luxury of in-house Bioinformatics and/or Genomics core facilities so they have to work out the right sort of experiment to do for themselves. Fortunately people have been hard at work creating resources that can really help and most recently Illumina released an RNA-seq “Buyerâ€™s Guide” with lots of helpful information….including how to keep costs down.

Illumina’s “Buyerâ€™s Guide”: the guide offers advice on common RNA-Sequencing methods and should help new users in evaluating the many options available for next-generation sequencing of RNA. Anyone considering a differential gene expression analysis experiment should have RNA-seq as their platform of choice and the guide presents three simple steps for users to consider different aspects of their experiments.

1) First of all make sure you understand what your scientific question is! This sounds simple but all too often people want to get too much out of one experiment and end up getting in a bit of a mess. Better to answer one question well, than two questions badly. Once you’ve thought about this it should be clear whether you want analyse mRNA’s for a simple differential gene expression experiment, or are after something else e.g. splicing, and also if you’ll need to look at more than just poly-adenylated mRNAs. And if possible try to determine ahead of time whether the genes you’re interested in studying are highly expressed or very rare.

2) Once you’ve thought about this you can consider what sort of samples you have, are they low quality and/or low quantity? You should also consider who’s going to do the work in the lab and who’s going to analyse the sequence data?

3) Now you can really think about the final experimental design, what type f library preparation kit to use, replicate numbers, proper controls, depth of sequencing, etc. Illumina’s RNA-seq buyers guide describes some of the things you’ll need to consider in choosing the read-depth and run-type, and also include some tips for keeping the costs of your experiment down.

What do people mean when they say “RNA-seq”: When people say “RNA-seq” most of them are talking about differential gene expression (DGE) by sequence analysis of reverse transcribed poly-adenylated mRNAs, but by changing the depth sequencing or type of sequencing, and/or choosing a different library prep kit you can investigate so much more. The guide includes three different scenarios for RNA-seq experiments including basic differential gene expression; DGE and allele-specic expression plus isoforms, SNVs and fusions; and finally whole transcriptome analysis. These show the breadth of experiments you can consider once you’ve mastered this method.

The first two scenarios showcase the power of RNA-seq and demonstrate how using a single library prep method, but varying the sequencing allows very different questions to be asked of your samples. The guide recommends Illumina’s TruSeq Stranded mRNA-seq kits (these are the ones we use most in my lab and we have done so ever since beta-testing the original RNA-seq kit many years ago). Scenario #1 is a simple DGE experiment and Illumina recommends you generate â‰¥ 10 million reads per sample, using single-end 50bp reads (SE50). Scenario #2 allows a full mRNA analysis by simply changing read depth to â‰¥ 25 million reads per sample, and using paired-end 75 bp reads (PE75).

If you are interested in more than poly-adenylated mRNA’s then changing the RNA-seq library prep kit to Illumina’s TruSeq Stranded Total RNA gets rid of ribosomal RNA’s, letting you anaylse both coding and non-coding RNA. Much greater read depth is needed and Illumina recommend â‰¥ 50 million PE75 reads per sample. Completing the RNA-seq line-up is the TruSeq small RNA kits which allow you to analyse microRNAs and other smaller transcripts, usually this requires only â‰¥ 1-2 million SE50 reads per sample.

How do Illumina’s recommendations stack-up: The guide is pretty good in the suggestions it makes for common RNA–seq methods. I’d aim a bit higher for DGE and suggest â‰¥ 20 million reads per sample to allow profiling of high, medium and lowly expressed genes. I’m really not keen on the suggestion that MiSeq or NextSeq mid-output are good tools for RNA-seq as from my experience most experiments, with sufficient replication, will be too large to fit into a single sequencing run. I’d argue that the cheapest way to get your RNA-seq data is going to be on HiSeq 4000, until of course we can run RNA-seq on X Ten. Of course not everyone should buy a HiSeq and a MiniSeq, MiSeq or NextSeq may be a good fit for your own laboratory; but I’d encourage you to consider the benefits of using your local core lab first though, especially if you are planning on doing experiments bigger than 12-24 samples. I’m not sure I’d argue quote as strongly for paired-end data and would prefer splicing, ASE, fusion detection to be coming from higher depth sequencing instead (50M SE50 reads cost about the same as 25M paired-75bp reads).

Why does my lab focus on mRNA-seq DGE: My own choices for RNA-seq are primarily informed by the questions people say that want to answer in experimental design discussions – and nearly all of these are differential gene expression questions. As such my lab runs lots and lots of Illumina’s stranded mRNA-seq kits. We only run some form of ribosomal reduction when the experiment warrants it as these methods generally require deeper sequencing for the same differential gene expression analysis power. We’ve very few users who need to run FFPE RNA so although we tested the RNA Access kit, we’ve yet to really use it in a significant project. This is partly because the research groups coming ot my lab understand the limitations of FFPE samples, and work hard to procure fresh frozen material wherever possible.

A brief bit about informatics: This article is focussed on the wetlab but without a good analysis pipeline you’ll be stuck with some big but unusable Fastq files. The analysis requirements are heavily influenced by the biological questions being asked, by the samples available, and by the library preparation and sequencing performed. I’d always recommend the user to make sure they know what analysis is likely to be performed before generating data.

Many others have weighed in on how to use and design RNA-seq experiments (see the list of my favourite references at the bottom of this post). Nearly everyone agrees that replication is key with most people suggesting 4-6 biological replicates. Most papers agree on read-depth being kept to under 20M reads per sample. The ENCODE RNA-seq guidelines are very different recommending just two biological replicate and 30M paired-end reads per sample – I’ve never agreed with this, even when it was published in 2011, and have steered people to other resources. The Blogosphere also offers lots of help; a 2013 post by GKNO (Marth lab, U. Utah), and the RNA-seqlopedia (U. Oregon) are two great reads for people who want to know more.

All Illumina products listed are for research use only. Not for use in diagnostic procedures (except as specifically noted).

Further reading:

How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA. 2016. This paper really pushes to answering the question most people want to understand. They present a very highly replicated study and show that as many as 20 biological replicates were required to detect 85% of DGE accurately. They recommend using 6 biological replicates in RNA-seq experiments as a minimum, and edgeR or DESeq2 as the best tools. They used single-end sequencing and generated 0.8-2.6 million reads per technical replicate – equivalent to about 10M per biological sample.
Experimental Design and Power Calculation for RNA-seq Experiments. Methods Mol Biol. 2016. This book chapter reviews the major factors that influence the statistical power of detecting DGE.
Designing alternative splicing RNA-seq studies. Beyond generic guidelines. Bioinformatics. 2015. This paper describes how sequencing depth and length, library preparation and the level of replication affect the cost-effectiveness of single-sample and group comparison studies. They present data showing how short reads outperformed long reads for most analyses.
Power analysis and sample size estimation for RNA-Seq differential expression. RNA. 2014. In this paper the authors compare and evaluate five differential expression analysis packages – DESeq, edgeR, DESeq2, sSeq, and EBSeq. They show that increasing sample size is preferable to increasing sequencing depth past 20 million reads.
RNA-seq differential expression studies: more sequence or more replication? Bioinformatics. 2014. This paper describes the explicit trade-off between numbers of biological replicates and depth of sequencing in increasing the power to detect DGE. They suggested that greater than 10M reads was unnecessary and that more replicates should be the strategy of choice to increase power and accuracy inRNA-seq studies.
Accounting for technical noise in single-cell RNA-seq experiments. Nat Methods. 2013. This paper presents a quantitative statistical method to distinguish biological variability from technical noise in single-cell RNA-seq.
Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics. 2013. This paper presents a web-based tool, Scotty, to assists in the design of RNA-seq experiments with appropriate sample size and read depth.
RNA-SeQC: RNA-seq metrics for quality control and process optimisation. Bioinformatics. 2012. Authors from the Broad Institute present the RNA-SeQC tool for quality control of data before DGE analysis. They provide metrics including yield, alignment and duplication rates; GC bias, rRNA content, regions of alignment (exon, intron and intragenic), continuity of coverage, 3’/5′ bias and count of detectable transcripts.
Design and validation issues in RNA-seq experiments. Brief Bioinform. 2011. This paper reviews the experimental design issues pertinent to RNA-seq.
RNA-seq: technical variability and sampling. BMC Genomics. 2011. This paper analysed technical bias in 3 replicated RNA-seq experiments and showed that low coverage (less than 5 reads per base) leads to a significant increase in technical noise, and that understanding sampling bias is an issue that needs to be considered.
Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010. One of the first papers to suggest that a relatively low read-depth for RNA-seq of just 10 million reads “gave the same dynamic range as microarrays, with better quantification of alternate and highly abundant transcripts”. However they used paired-end reads in their analysis.
RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008. In this paper the authors estimated the technical variance in RNA-seq and compared it to arrays for detecting differentially expressed genes.

8 Comments

Anonymous 2016-07-26 at 7:26 am - Reply

"This article was commissioned by Illumina Inc."

No comments indeed.
Chris Cole 2016-07-26 at 8:53 am - Reply

Nice summary.

Just a quick comment. In our paper "How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?" we recommend EdgeR and DESeq2 not so much DESeq. Please can you correct that? Thanks.
Phil Chapman 2016-07-26 at 9:58 am - Reply

Really nice article. On the bioinformatics side for simple DGE at least you won't go far wrong following the Rsubread/EdgeR or DESeq2 workflows described in the papers below:
http://f1000research.com/articles/5-1438/v1
http://f1000research.com/articles/4-1070/v1
James@cancer 2016-07-26 at 7:28 pm - Reply

Hi Chris, happy to make the revision. In the paper you say that with "higher replicate numbers, minimising false positives is more important and DESeq marginally outperforms the other tools" – which was my reason for pointing to both tools.
Anonymous 2016-07-29 at 3:42 pm - Reply

We've found the NextSeq to be the best option for DGE of mRNA. Sure, it's a bit more expensive than the HiSeq 4000, but there's no waiting to fill up 8 lanes. 12 or 24 samples can be run overnight. The 75 cycle kit is also paired-end capable for no extra cost. With the additional unused cycles that come from not doing dual indexing, you can squeeze out 42bp paired end (43|6|0|43) which we've found the best for the money. We've found paired gives a small improvement in mapping, and you lose less data if you deduplicate.
Chris Cole 2016-08-03 at 2:20 pm - Reply

Hi James. Thanks for the revision.
DESeq is best with >12 reps, but at that level of replication there isn't a huge amount of difference. It's also an unreasonable no. of reps for a typical expt. Best to focus on 6 or fewer reps as you have done, which is where edgeR and DESeq2 outperform the others.
James@cancer 2016-08-26 at 11:50 am - Reply

I have a strong preference towards HiSeq but this is because I have three instruments in the lab and we don't need to wait for flowcells to fill up. I agree that where you are dealing with lower throughput NextSeq is a good choice. However I really don't think paired-reads bring any extra benefit, but if they are "free" then why not…did you consider going for 84bp reads to increase the number of spliced-reads?
James@cancer 2016-08-26 at 11:50 am - Reply

I have a strong preference towards HiSeq but this is because I have three instruments in the lab and we don't need to wait for flowcells to fill up. I agree that where you are dealing with lower throughput NextSeq is a good choice. However I really don't think paired-reads bring any extra benefit, but if they are "free" then why not…did you consider going for 84bp reads to increase the number of spliced-reads?

Like this:

Related

About the Author: James

8 Comments

Leave A Comment Cancel reply

Archive

Categories