What’s the right way to sequence an exome? We’ve been looking at Illumina’s v4 chemistry for HiSeq 2500 and wondering whether we should jump to PE125bp or not, or should we try to reconfigure our exome capture for shorter or longer fragments.
Exome-seq: Exomes have been a big hit, there are currently over 3000 publications in PubMed with the search term “exome”. Given that the first in-solution exome paper was only published in 2009 that’s pretty amazing, but then again the exome is an amazing research tool.
Note to readers: This post started out as a writing down my thoughts about whether we should move to longer reads for exomes. But it has become a bit more rambling as I started to find out I need to do some mroe digging. I may well come back to this post with an update or version two…
There are many ways to prepare an exome for sequencing and in my lab we’re currently using Illumina’s rapid exome kit. We’re also about to compare this to Agilent’s new SureSelect QXT kit which is a direct competitor to Illumina’s Nextera-based offering. But we’ve never tried Nimblegen or AmpliSeq, however this post is more about how to sequence the exome than prepare it so enough of kit comparisons.
The standard exome: Their are two things you need to consider when sequencing exomes: read depth and read-length. I’m not going to worry about depth in this post, and instead I’m going to focus on read-length. Today most labs appear to be running exomes at PE75bp, a standard which I am not sure has ever been agreed by anyone, but it has been accepted as being good-enough for most projects (Illumina recommend PE75-100). I know of some groups that moved over to PE100 to simplify lab logistics as much as anything else, but I am not clear that there are significant benefits to increasing length so we’ve stuck at PE75 for the time being.
Are longer reads better: With the advent of v4 chemistry on HiSeq 2500 we should be able to generate high-quality paired-end 125bp reads, albeit with a slightly higher error rate at the end of the read. At first glance this additional data seems too good to ignore, especially when Illumina do not sell a 150cycle SBS kit, and 3x50cycle SBS would not be that much cheaper (and more hassle for my lab staff!) By my reckoning PE75 costs Â£900 per lane whilst PE125 is Â£1200, or Â£300 for an extra 100bp of coverage. So if cost does not prevent us using PE125, should we simply switch?
Insert size vs read-length: As you can see below the average distribution of exome fragments size spans the read-length of the sequencer. The solid black line indicates 150bp (PE75): everything to the left of this will be fragments sequenced with an overlapping reads (opes), whilst everything to the right is sequenced with non-overlapping reads (nopes). As read-length increases the percentage of fragments sequenced with an overlap also increases, at PE100 (dashed line) this is over 50% of reads, and at PE125 (dotted line) it’s about 75% of all fragments. An overlapping read creates some issues as the two reads are not independent, tools need to take the overlap into account when calculating on-target coverage, etc; but it also offers the opportunity to increase variant calling quality by increasing Q-scores in the overlap region.
Exome libraries may not be the best size for sequencing: If a non-overlapping read is the best kind to generate then we may need to reconfigure library prep in the light of v4 chemistry. An interesting comparison can be made to the Agilent Bioanalyser trace below the computed insert size distribution. If you overlay and rescale the two images, then the Agilent trace appears to be peak-shifted to larger fragments, and the right-hand fragment distribution is much broader. This appears to demonstrate the preference of clustering:sequencing for shorter fragments.
Exome libraries are probably the best size for capturing exons: The average exon length in the Human genome is 170 bp with 80â€“85% exons less than 200bp (Zhu et al & Sakharkar et al) so the 185bp average fragment length seems almost ideal.
|Table reproduced from Shkharkar et al 2004
So what’s the sweetspot for Exome capture and sequencing: The simple answer is I don’t know, and several factors are likely to affect this. As we increase read-length we’ll get more fragments with overlapping reads that could be wasteful; the same happens if we decrease fragment size so longer reads give us more and more overlap with higher quality. But unless there are tools to make use of this the data are redundant. So fragments should not be longer than reads.
But fragments are captured by probes of 95bp so we should probably not make fragments shorter than probes.
Exome capture kits contain blocking oligos to prevent adapter:adapter hybridisation and off-target pull-down. As fragment length increases then the amount of near-target sequence captured may increase meaning we should not make fragments too long. A long fragment risks too much off-target enrichment by the secondary capture of off-target fragments.
Lastly (for now) we’d like to be able to use independent fragments for our analysis so read-pairs might be better replaced with longer single-reads, but twice as many. So perhaps the answer is probes that efficiently capture exons with little or no fragment:fragment hybridisation, coupled to single-end 185bp sequencing with low error-rate across the reads.