Illumina’s patterned flowcells were a major change to their core technology. It was a change that was a long time coming, with patterned flowcells being discussed many years ago at AGBT. Any major change can throw up nasty surprises, and unfortunately the new ExAmp clustering chemistry appears to have a problem some users will need to think carefully about – swapping of indexes between samples in a multiplex pool. This simply means that the reads from sample A in an A:B mix will contain a small percentage of reads from sample B and vice versa.

For many experiments this really does not matter, so your RNA-seq results from last week are likely to be fine. But for low-frequency allele detection this is likely to be an issue users need to be aware of, and one they should understand.

This blog post describes the problem in some detail (sorry but it’s a long one…). It also tries to explain how ExAmp really works with a lot of the information coming from one of Illumina’s patents (…but it is worth reading it all).

loss-of-snvs-at-different-varian-thresholds
A previous post described the problem of Illumina’s non-indexed PhiX control being given an index read due to signal-bleeding from neighbouring clusters. We’re just implementing a fix for this: an additional index-read sequencing primer (AAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTC) is spiked-into the run chemistry to generate a signal from the PhiX adapters. While investigating this problem we noticed that sample to sample index swapping was also occurring, and this was confirmed by Illumina as being reported in other laboratories.
The low-level “index-swapping” on the patterned flowcells means that the reads you THINK are coming from your sample MAY be coming from one of the others in the pool. And depending on your experimental design this may be impossible to discern for low-variant mutant or low-frequency SNP alleles.

In the figure above I’ve shown some early data from our lab. The project is one hunting for cancer mutations after chemically induced mutagenesis. We’d expect most mutations in this context to be of high allele frequency given the rapid rate of tumour development. What we see is varying numbers of low-frequency variant alleles detected in exome sequencing and a lower rate in WGS. This is presumably due to the different depths of sequencing “low” for WGS and “high” for exomes. But what is also apparent is the rate does not drop to zero in the exome datasets – and I am interpreting this as partly due to index mis-assignment. We’ve not been able to confirm this, or work out what else is contributing to the background rate, but at 0.1-0.2% it may stop us calling variants at below 1%.

For now the major bullet points are:

  • Single-plex single-sample per lane sequencing is unaffected.
  • Multiplexed sequencing with unique-at-both-ends dual-indexed adapters is very likely to be unaffected.
  • Multiplexed sequencing with standard dual-indexed adapters (where one end is shared) is likely to be affected – but at a very low level!

The main mitigation we can apply is to design experiments such that dual-indexes within a pool are unique; however this will complicate our experimental design and lab workflows so should only be considered for projects we think are at risk of being significantly affected. We’re running lots of exome and RNA-seq projects and have favoured pooling the whole experiment for running on multiple lanes to remove sequencing batch effects. I believe this is the best way to design a sequencing experiment. Unfortunately where we are pooling 96 samples today using a TruSeq HT kit with 96 dual-indexes, we could only pool 8 samples with unique-at-both-ends dual-indexed adapters. This would increase sequencing costs  for many experiments, and also increase the complexity of pooling design too. I am a firm believer that highly-multiplexed pooling is important for the best designed and most economical research sequencing experiments.

However I am now starting to think “is this issue (index mis-assignment) more important to resolve than sequencing batch effects, which are the main reason for multiplexed sequencing”? – but I/we need to consider this in the context of each experiment planned.

How to fix the problem: This very much depends on what’s causing it, and since no-one appears to know for sure I’ve left that discussion to later in the post. It is a difficult problem for a non-Illumina R&D scientist to figure out. But we don’t necessarily need to know what’s causing it to come up with some solutions, or at least mitigations:

    1. The simplest fix is to go back to running one sample per lane as per most, but not all HiSeq X Ten projects: This will remove much of the sample contamination in the sequencing process (but not library prep). However currently multiplexed HiSeq 4000 methods run as single-plex pools would be inconceivably expensive!
    2. Carefully choose the indexes being used in experiments: This would be a good solution, but only for methods that require upwards for 40 million reads per sample (8-plex pooling) on HiSeq 4000.
    3. Use uniquely dual-indexed libraries: My preferred solution, although one that requires making, or buying, adapters to make libraries that are uniquely-dual-indexed-at–both-ends. Index mis-assignment is at a low-level, so one end will be affected at <1%, making the likelihood of both ends being mis-assigned vanishingly small. All mis-assigned reads would fall nicely into the “lost reads” file.

These solutions will work for some projects but for patterned flowcells to become mainstream outside of HiSeq X, and for HiSeq 4000 to replace 2500 we need to fix the problem at source. That requires understanding what’s causing it, and as ExAmp itself seems such a likely candidate then understanding how it works is likely to be key.

How does ExAmp work: The clever bit is the use of a process that allows cluster growth to proceed at a faster rate than cluster seeding. This should mean that most clusters are empty or contain a single molecule, and a few will contain two molecules i.e Poisson distributed. The two molecule clusters are “Chastity filtered” and most likely this is what causes the 70%PF rates of patterned flowcells. Illumina talk about achieving a Super Poisson distribution – many clusters are likely to contain more than one molecule but most of the polyclonal clusters will have such low amounts of secondary molecules as to be undetectable.

To really get a handle on how ExAmp works requires a little bit of effort as Illumina have not published this work. However their patent pool is often a gold-mine of information (assuming you can make sense of the patent legalese and obfuscation). One patent in particular has formed the basis for much of this post: WO2013188582A1 “Kinetic exclusion amplification of nucleic acid libraries” it describes…

How does ExAmp really work: The amplification reagent mentioned above is where all the action is taking place. The rest of the process is pretty much the same as standard clustering; library prep is unchanged (but could possibly be tweaked to improve ExAmp’s issues); loading the flowcell requires essentially the same denatured and diluted library; and the bridge-amplification we’ve all come to love is still the backbone of clustering. But the new amplification reagent is very different from the cyclical, bridge-PCR of random clustering (HiSeq 2000, Miseq, NextSeq, MiniSeq).

The patent describes several methods for library amplification PCR, RCA, MDA, RPA, but also focuses on two others less commonly used; recombinase (TwistDX) and helicase (BioHelix) amplification. Given the close proximity of TwistDX to Illumina’s UK headquarters in Chesterford I’d be pretty willing to bet that the recombinase method is being used in the ExAmp kits. The patent describes the RPA (Recombinase Polymerase Amplification – PubMed) technology sold by TwistDx as TwistAmp. But their website provides a much clearer picture of recombinase amplification, albeit applied to in-solution chemistry (it also points to two other patents US 5,223,414 and US 7,399,590).

RPA’s really clever bit is that the mix of a recombinase, a single-stranded DNA binding protein, and a polymerase. This mix allows targeting and insertion of oligonucleotide primers into the dsDNA duplex, initiation of strand-displacement DNA synthesis and amplification by polymerase. Amplification proceeds with no thermal or chemical melting required. The reaction is also very rapid taking as little as 5-10 minutes (in solution – it is not clear how quickly a cluster is filled). It is this very rapid amplification of library molecules once they reach the surface of the flowcell that is key to the patterned flowcell clustering process. Amplification by RPA of a single library molecule happens at a rate that fills the nanowell on the flowcell with a cluster, far more quickly that the rate of diffusion of library molecules to the flowcell’s surface and seeding of a cluster.

Controlling cluster seeding rates: Two methods are described in the patent to slow down the seeding, or initial extension of secondary seeding molecules. The first is the use of an electric field that can be used to pull negatively charged DNA down to the flowcell surface, or to repel it, i.e. to control the diffusion rate. The same electric field might be used by Illumina to direct library molecules to nanowells rather than the interstitial space between them. The patent also mentions P5 & P7 primer re-grafting; suggesting that flowcells could be recycled – anyone still got their cluster station? The second method describe is the use of isocytosine (isoC) or isoguanine (isoG) in the DNA adapters. By including these as part of the amplification reagent, but at low concentration relative to their normal nucleotides, the rate of initial template extension can be slowed, allowing a run-away chain-reaction from the first library molecule to seed a cluster once initial extension has been completed.

Illumina have also been working on Recombinase mutants, see patent WO2016054088A1 by Illumina Cambridge Limited. These mutant enzymes “have substantially improved characteristics in the seeding nucleic acids onto a patterned flow cell surface”. They also improved seeding of PCR-free libraries with single-stranded adapter regions onto a patterned flow cell surface, Whether Illumina have tested loading of dsDNA libraries as opposed to the usual denatured and diluted libraries is unclear.

The improvements over the standard Bacteriophage T4 UvsX recombinase are clearly shown in the clustering images presented in the patent. The P256K mutant recombinase (RB49) substantially improved clustering. The patent describes the ExAmp clustering on MiSeq…is this a hint of what might be around the corner!

ExAmp itself may be the problem: It may be that the problem of index mis-assignment is caused by the use of amplification reagent itself. This only seemed obvious to me once I’d thought about our PhiX problem – my hypothesis is that the rapid amplification is happening in the clustering mastermix. Library molecules are being amplified and creating mis-assignment of indexes as they are “swapped” by the ExAmp reaction; and this is happening before anything gets to the flowcell surface.

Most likely it is free adapter that is acting as a primer for the recombinase/ploymerase to amplify from. As adapters are likely to come from all libraries in the pool they will randomly extend library molecules creating the swapped indexes. This could be tested by mixing excess free adapters of known ratios and seeing if that same ratio was apparent in swapped indexes.

Testing to confirm that ExAmp is the cause of the index mis-assignment is going to be important. But if it proves this is what is happening it is unlikely to be easy to fix as it is an inherent part of the ExAmp process. However if we know in advance this is going to happen, and we know what the root cause is, then we can take steps to mitigate the problem…and I’m focusing on the use uniquely dual-indexed adapters

Other things can happen to cause swapping of indexes; oligonucleotide cross-contamination and PCR chimeras from recombining template molecules (a process often referred to as ‘jumping PCR’). Consequently, multiplex capture may introduce significant levels of sample cross-contamination. 

The impact on clinical work: Much of the clinical work being undertaken today is on Illumina’s MiSeq and NextSeq platforms and both are unaffected by this issue as far as I know. However anyone wanting to use HiSeq 4000 or X Ten for clinical work (e.g. Grail) will need to pay careful attention to the experimental design and controls to unequivocally demonstrate that variants called are from the correct patient. I guess this will delay the release of patterned flowcells on other platforms until this is fixed.

Bigger pools are probably better than smaller ones – assuming index mis-assignment is random, and variants are also random then only 1% of the data is from mis-assigned indexes, and only 1/96th of that will come from a single individual. This is something that can be modelled.

Issues like this have happened in the past e.g. MiSeq cross-contamination, and Illumina have shown they are resolvable. It is likely that there are other things happening in the clustering chemistry I am unaware of, so Illumina may have other fixes up their sleeve. We’ll be focusing on testing new adaptors. But I am unaware of anyone selling unique-at-both-ends dual-indexed adapters in 96-well plate configuration. I’m currently considering the impact of having to make these ourselves and would prefer not to. So if you know of a provider please do let me know.

PS: A comment on the PhiX post I referred to above queried the prevalence of “sample-to-sample bleeding” on the NextSeq/MiniSeq two-color chemistry. They gave the example of a cluster with no signal that is going to look the same as GGGGGG, which is only two bases off from TruSeq LT index 23 (GAGTGG). They wondered if this makes that index unusable on two-color chemistry?