Index mis-assignment between samples on HiSeq 4000 and X-Ten

Illumina’s patterned flowcells were a major change to their core technology. It was a change that was a long time coming, with patterned flowcells being discussed many years ago at AGBT. Any major change can throw up nasty surprises, and unfortunately the new ExAmp clustering chemistry appears to have a problem some users will need to think carefully about – swapping of indexes between samples in a multiplex pool. This simply means that the reads from sample A in an A:B mix will contain a small percentage of reads from sample B and vice versa.

For many experiments this really does not matter, so your RNA-seq results from last week are likely to be fine. But for low-frequency allele detection this is likely to be an issue users need to be aware of, and one they should understand.

This blog post describes the problem in some detail (sorry but it’s a long one…). It also tries to explain how ExAmp really works with a lot of the information coming from one of Illumina’s patents (…but it is worth reading it all).

loss-of-snvs-at-different-varian-thresholds

A previous post described the problem of Illumina’s non-indexed PhiX control being given an index read due to signal-bleeding from neighbouring clusters. We’re just implementing a fix for this: an additional index-read sequencing primer (AAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTC) is spiked-into the run chemistry to generate a signal from the PhiX adapters. While investigating this problem we noticed that sample to sample index swapping was also occurring, and this was confirmed by Illumina as being reported in other laboratories.

The low-level “index-swapping” on the patterned flowcells means that the reads you THINK are coming from your sample MAY be coming from one of the others in the pool. And depending on your experimental design this may be impossible to discern for low-variant mutant or low-frequency SNP alleles.

In the figure above I’ve shown some early data from our lab. The project is one hunting for cancer mutations after chemically induced mutagenesis. We’d expect most mutations in this context to be of high allele frequency given the rapid rate of tumour development. What we see is varying numbers of low-frequency variant alleles detected in exome sequencing and a lower rate in WGS. This is presumably due to the different depths of sequencing “low” for WGS and “high” for exomes. But what is also apparent is the rate does not drop to zero in the exome datasets – and I am interpreting this as partly due to index mis-assignment. We’ve not been able to confirm this, or work out what else is contributing to the background rate, but at 0.1-0.2% it may stop us calling variants at below 1%.

For now the major bullet points are:

Single-plex single-sample per lane sequencing is unaffected.
Multiplexed sequencing with unique-at-both-ends dual-indexed adapters is very likely to be unaffected.
Multiplexed sequencing with standard dual-indexed adapters (where one end is shared) is likely to be affected – but at a very low level!

The main mitigation we can apply is to design experiments such that dual-indexes within a pool are unique; however this will complicate our experimental design and lab workflows so should only be considered for projects we think are at risk of being significantly affected. We’re running lots of exome and RNA-seq projects and have favoured pooling the whole experiment for running on multiple lanes to remove sequencing batch effects. I believe this is the best way to design a sequencing experiment. Unfortunately where we are pooling 96 samples today using a TruSeq HT kit with 96 dual-indexes, we could only pool 8 samples with unique-at-both-ends dual-indexed adapters. This would increase sequencing costs for many experiments, and also increase the complexity of pooling design too. I am a firm believer that highly-multiplexed pooling is important for the best designed and most economical research sequencing experiments.

However I am now starting to think “is this issue (index mis-assignment) more important to resolve than sequencing batch effects, which are the main reason for multiplexed sequencing”? – but I/we need to consider this in the context of each experiment planned.

How to fix the problem: This very much depends on what’s causing it, and since no-one appears to know for sure I’ve left that discussion to later in the post. It is a difficult problem for a non-Illumina R&D scientist to figure out. But we don’t necessarily need to know what’s causing it to come up with some solutions, or at least mitigations:

The simplest fix is to go back to running one sample per lane as per most, but not all HiSeq X Ten projects: This will remove much of the sample contamination in the sequencing process (but not library prep). However currently multiplexed HiSeq 4000 methods run as single-plex pools would be inconceivably expensive!
Carefully choose the indexes being used in experiments: This would be a good solution, but only for methods that require upwards for 40 million reads per sample (8-plex pooling) on HiSeq 4000.
Use uniquely dual-indexed libraries: My preferred solution, although one that requires making, or buying, adapters to make libraries that are uniquely-dual-indexed-at–both-ends. Index mis-assignment is at a low-level, so one end will be affected at <1%, making the likelihood of both ends being mis-assigned vanishingly small. All mis-assigned reads would fall nicely into the “lost reads” file.

These solutions will work for some projects but for patterned flowcells to become mainstream outside of HiSeq X, and for HiSeq 4000 to replace 2500 we need to fix the problem at source. That requires understanding what’s causing it, and as ExAmp itself seems such a likely candidate then understanding how it works is likely to be key.

“is index mis-assignment more important to resolve than sequencing batch effects”?

How does ExAmp work: The clever bit is the use of a process that allows cluster growth to proceed at a faster rate than cluster seeding. This should mean that most clusters are empty or contain a single molecule, and a few will contain two molecules i.e Poisson distributed. The two molecule clusters are “Chastity filtered” and most likely this is what causes the 70%PF rates of patterned flowcells. Illumina talk about achieving a Super Poisson distribution – many clusters are likely to contain more than one molecule but most of the polyclonal clusters will have such low amounts of secondary molecules as to be undetectable.

To really get a handle on how ExAmp works requires a little bit of effort as Illumina have not published this work. However their patent pool is often a gold-mine of information (assuming you can make sense of the patent legalese and obfuscation). One patent in particular has formed the basis for much of this post: WO2013188582A1 “Kinetic exclusion amplification of nucleic acid libraries” it describes…

How does ExAmp really work: The amplification reagent mentioned above is where all the action is taking place. The rest of the process is pretty much the same as standard clustering; library prep is unchanged (but could possibly be tweaked to improve ExAmp’s issues); loading the flowcell requires essentially the same denatured and diluted library; and the bridge-amplification we’ve all come to love is still the backbone of clustering. But the new amplification reagent is very different from the cyclical, bridge-PCR of random clustering (HiSeq 2000, Miseq, NextSeq, MiniSeq).

The patent describes several methods for library amplification PCR, RCA, MDA, RPA, but also focuses on two others less commonly used; recombinase (TwistDX) and helicase (BioHelix) amplification. Given the close proximity of TwistDX to Illumina’s UK headquarters in Chesterford I’d be pretty willing to bet that the recombinase method is being used in the ExAmp kits. The patent describes the RPA (Recombinase Polymerase Amplification – PubMed) technology sold by TwistDx as TwistAmp. But their website provides a much clearer picture of recombinase amplification, albeit applied to in-solution chemistry (it also points to two other patents US 5,223,414 and US 7,399,590).

RPA’s really clever bit is that the mix of a recombinase, a single-stranded DNA binding protein, and a polymerase. This mix allows targeting and insertion of oligonucleotide primers into the dsDNA duplex, initiation of strand-displacement DNA synthesis and amplification by polymerase. Amplification proceeds with no thermal or chemical melting required. The reaction is also very rapid taking as little as 5-10 minutes (in solution – it is not clear how quickly a cluster is filled). It is this very rapid amplification of library molecules once they reach the surface of the flowcell that is key to the patterned flowcell clustering process. Amplification by RPA of a single library molecule happens at a rate that fills the nanowell on the flowcell with a cluster, far more quickly that the rate of diffusion of library molecules to the flowcell’s surface and seeding of a cluster.

Controlling cluster seeding rates: Two methods are described in the patent to slow down the seeding, or initial extension of secondary seeding molecules. The first is the use of an electric field that can be used to pull negatively charged DNA down to the flowcell surface, or to repel it, i.e. to control the diffusion rate. The same electric field might be used by Illumina to direct library molecules to nanowells rather than the interstitial space between them. The patent also mentions P5 & P7 primer re-grafting; suggesting that flowcells could be recycled – anyone still got their cluster station? The second method describe is the use of isocytosine (isoC) or isoguanine (isoG) in the DNA adapters. By including these as part of the amplification reagent, but at low concentration relative to their normal nucleotides, the rate of initial template extension can be slowed, allowing a run-away chain-reaction from the first library molecule to seed a cluster once initial extension has been completed.

Illumina have also been working on Recombinase mutants, see patent WO2016054088A1 by Illumina Cambridge Limited. These mutant enzymes “have substantially improved characteristics in the seeding nucleic acids onto a patterned flow cell surface”. They also improved seeding of PCR-free libraries with single-stranded adapter regions onto a patterned flow cell surface, Whether Illumina have tested loading of dsDNA libraries as opposed to the usual denatured and diluted libraries is unclear.

The improvements over the standard Bacteriophage T4 UvsX recombinase are clearly shown in the clustering images presented in the patent. The P256K mutant recombinase (RB49) substantially improved clustering. The patent describes the ExAmp clustering on MiSeq…is this a hint of what might be around the corner!

ExAmp itself may be the problem: It may be that the problem of index mis-assignment is caused by the use of amplification reagent itself. This only seemed obvious to me once I’d thought about our PhiX problem – my hypothesis is that the rapid amplification is happening in the clustering mastermix. Library molecules are being amplified and creating mis-assignment of indexes as they are “swapped” by the ExAmp reaction; and this is happening before anything gets to the flowcell surface.

Most likely it is free adapter that is acting as a primer for the recombinase/ploymerase to amplify from. As adapters are likely to come from all libraries in the pool they will randomly extend library molecules creating the swapped indexes. This could be tested by mixing excess free adapters of known ratios and seeing if that same ratio was apparent in swapped indexes.

Testing to confirm that ExAmp is the cause of the index mis-assignment is going to be important. But if it proves this is what is happening it is unlikely to be easy to fix as it is an inherent part of the ExAmp process. However if we know in advance this is going to happen, and we know what the root cause is, then we can take steps to mitigate the problem…and I’m focusing on the use uniquely dual-indexed adapters

Other things can happen to cause swapping of indexes; oligonucleotide cross-contamination and PCR chimeras from recombining template molecules (a process often referred to as ‘jumping PCR’). Consequently, multiplex capture may introduce significant levels of sample cross-contamination.

The impact on clinical work: Much of the clinical work being undertaken today is on Illumina’s MiSeq and NextSeq platforms and both are unaffected by this issue as far as I know. However anyone wanting to use HiSeq 4000 or X Ten for clinical work (e.g. Grail) will need to pay careful attention to the experimental design and controls to unequivocally demonstrate that variants called are from the correct patient. I guess this will delay the release of patterned flowcells on other platforms until this is fixed.

Bigger pools are probably better than smaller ones – assuming index mis-assignment is random, and variants are also random then only 1% of the data is from mis-assigned indexes, and only 1/96th of that will come from a single individual. This is something that can be modelled.

Issues like this have happened in the past e.g. MiSeq cross-contamination, and Illumina have shown they are resolvable. It is likely that there are other things happening in the clustering chemistry I am unaware of, so Illumina may have other fixes up their sleeve. We’ll be focusing on testing new adaptors. But I am unaware of anyone selling unique-at-both-ends dual-indexed adapters in 96-well plate configuration. I’m currently considering the impact of having to make these ourselves and would prefer not to. So if you know of a provider please do let me know.

PS: A comment on the PhiX post I referred to above queried the prevalence of “sample-to-sample bleeding” on the NextSeq/MiniSeq two-color chemistry. They gave the example of a cluster with no signal that is going to look the same as GGGGGG, which is only two bases off from TruSeq LT index 23 (GAGTGG). They wondered if this makes that index unusable on two-color chemistry?

By James| December 16th, 2016|Categories: Next-generation sequencing|20 Comments

About the Author: James

Permalink
Gallery

Roche emerges from the Lazarus pit

February 13th, 2025 | 0 Comments
Permalink
Gallery

Proteomics. Proteomics. Proteomics.

January 13th, 2025 | 0 Comments
Permalink
Gallery

@tagomics epigenomic profiling of non-methylated CpG’s

April 19th, 2024 | 0 Comments
Permalink
Gallery

2024 @novonordiskfond prize to Solexa: This is why we need a pub at Addenbrookes

March 15th, 2024 | 0 Comments
Permalink
Gallery

Mathias Mann Makes Multiomics Mega

August 17th, 2023 | 0 Comments

20 Comments

Keith E Robison 2016-12-16 at 7:21 pm - Reply

Glad you’re digging into this. It is frustrating how rarely problems with Illumina reagents are reported. One vendor couldn’t get NextSeq high output kits reliably this spring and in summer 2015 the MiSeq 2×300 kits went utterly to hell in terms of data quality. Knowing that issues exist is an important component to picking the right sequencing strategy and the right downstream analytics.
- James 2016-12-21 at 8:37 am - Reply
  
  Hi Keith,
  It has been frustrating trying to work out what sort of projects would be affected by this issue. Most are probably OK. But getting the data so we can determine the impact on our analysis is proving to be tough. We’re trying to determine the background noise in many experiments!
  
  The MiSeq long read issue is one my lab has been relatively unaffected by as most users do not need the paired-end 300bp reads. However I do know labs where projects have been derailed, and finally moved over to a special configuration of long reads on HiSeq 2500 Rapid Runs.
  
  James.
Tony 2017-02-02 at 4:58 pm - Reply

This phenomenon is not just limited to sequencers with ExAmp cluster gen.

https://sequencing.qcfail.com/articles/mixing-sample-types-in-a-flowcell-lane-generates-cross-contamination-artefacts/

We’ve also found that taking into account Q-scores of indexes when demultiplexing can reduce the amount of cross contamination. As fas as I’m aware, bcl2fastq just uses index sequence to demultiplex. Basically, throw out any data with an index read containing one or more bases with a Q-score less than 30. You loose around 5% of your data, but we estimate you reduce mis-assignment by an order of magnitude.
- James 2017-02-06 at 8:25 pm - Reply
  
  Hi Andrew,
  You wrote in your post that the mitigation is “[to mix libraries so that it won’t matter if a very small percent of reads leak from one library to another]”. However the current problems with PCR-free libraries on HiSeq 4000, X-Ten and probably NovaSeq are as bad as 1% contamination. Even if this is random and you’ve got 96 samples in the lane that’s still 0.01% which many circulating tumour groups are aiming for…including Illumina’s Grail.
  
  Decrease the number of samples to 8 and the contamination is at 0.125% – a much bigger problem.
  
  A wet lab mitigation would be to use unique-at-both-ends barcodes as the swapping is happening randomly at either end of the library molecule, the chances of both ends being affected is REALLY low.
Index-swapping with @illumina ExAmp clustering - Enseqlopedia 2017-04-10 at 9:30 pm - Reply

[…] This is an issue I’ve been writing about for a while but we’ve not seen any problems quite as obvious as the Stanford group reported – however they were running a single-cell RNA-Seq experiment where some genes were very highly expressed so their “signal” is amplified. An internal exome analysis showed that we were getting the <1% index-swapping Illumina suggested would be present (data shown in my Dec 16th 2016 post). […]
Geneticists Fear Illumina’s Sequencers May Distort Results - LI Tech News 2017-04-20 at 11:54 am - Reply

[…] since Illumina introduced the ExAmp technology. A genomics core manager at Cambridge University blogged about the problem, as did a Swedish bioinformatician in Stockholm. They used Illumina’s patents to hypothesize some […]
Geneticists Fear Illumina’s Sequencers May Distort Results – Newz Corner 2017-04-20 at 12:14 pm - Reply

[…] since Illumina introduced the ExAmp technology. A genomics core manager at Cambridge University blogged about the problem, as did a Swedish bioinformatician in Stockholm. They used Illumina’s patents to hypothesize some […]
Geneticists Fear Illumina’s Sequencers May Distort Results – Trending News 2017-04-20 at 12:15 pm - Reply

[…] since Illumina introduced the ExAmp technology. A genomics core manager at Cambridge University blogged about the problem, as did a Swedish bioinformatician in Stockholm. They used Illumina’s patents to hypothesize some […]
Geneticists Fear Illumina’s Sequencers May Distort Results… | Startup Wide 2017-04-20 at 12:43 pm - Reply

[…] since Illumina introduced the ExAmp technology. A genomics core manager at Cambridge University blogged about the problem, as did a Swedish bioinformatician in Stockholm. They used Illumina’s patents to hypothesize some […]
The Go-To Gene Sequencing Machine With Very Strange Results 2017-04-20 at 6:19 pm - Reply

[…] since Illumina introduced the ExAmp technology. A genomics core manager at Cambridge University blogged about the problem, as did a Swedish bioinformatician in Stockholm. They used Illumina’s patents to hypothesize […]
binay panda 2017-04-21 at 3:18 am - Reply

james, v useful post. i am late in catching up w the index mis-assignment because we stick to our old hiseq 2500 instrument and it appears that staying w the old instrument has it benefits. illumina has come up w a white paper a couple of days back on this topic (https://www.illumina.com/content/dam/illumina-marketing/documents/products/whitepapers/index-hopping-white-paper-770-2017-004.pdf?linkId=36607862). did you get a chance to look at it and what do you think? thank you. binay
- James 2017-04-21 at 6:54 am - Reply
  
  I’m writing up my thoughts after reading the white paper and the other blogs that have been covering this issue….check back soon.
Update on @illumina index-swapping - Enseqlopedia 2017-04-21 at 2:48 pm - Reply

[…] UCDavis first alerted their users to the issue in early April (thanks for including a link to Enseqlopedia), and most recently pointed users to Illumina white paper and attempt to reassure users that […]
Jessica Davies 2017-08-08 at 9:15 am - Reply

James, your posts are always brilliant. This was a really good and helpful read. Thank you!
From your previous sequencing trainee 😉
- James 2017-10-16 at 3:39 pm - Reply
  
  Hey Jess, thanks for the feedback and great to hear from you.
Christina Tsai 2017-09-21 at 10:19 pm - Reply

I recently found a company selling this “TailorMix Dual-Indexed PhiX Control Library (https://www.seqmatic.com/products/tailormix-dual-indexed-phix/)”. Is it something you was thinking about making?
- James 2017-10-16 at 3:40 pm - Reply
  
  We have thought about using this in the lab. There is a primer you can add to allow the normal PhiX to be read without buying this product. Ask your Illumina FAS.
The Go-To Gene Sequencing Machine With Very Strange Results - Tech News Archive 2019-04-22 at 5:23 am - Reply

[…] since Illumina introduced the ExAmp technology. A genomics core manager at Cambridge University blogged about the problem, as did a Swedish bioinformatician in Stockholm. They used Illumina’s patents to hypothesize some […]
The Go-To Gene Sequencing Machine With Very Strange Results – ganardinero.click 2021-02-28 at 7:44 am - Reply

[…] since Illumina introduced the ExAmp technology. A genomics core manager at Cambridge University blogged about the problem, as did a Swedish bioinformatician in Stockholm. They used Illumina’s patents to hypothesize some […]
@SingularGenomic is coming to a core facility near you in 2022 - Enseqlopedia 2022-01-10 at 5:04 pm - Reply

[…] flow cells. This probably means no ExAmp and therefore no significant problems with index hopping (first described by me in Dec 2016 and by Rahul Sinha in early 2017), which may carve the Singular technology a niche in ctDNA […]