A paper uploaded to BioRxiv yesterday is the first (almost) published report of index-swapping on Illumina’s Exclusion-Amplification (ExAmp) chemistry. The paper: Index Switching Causes “Spreading-Of-Signal” Among Multiplexed Samples In Illumina HiSeq 4000 DNA Sequencing, by a group from Stanford Functional Genomics Facility published a re-analysis of single-cell RNA-Seq data from SMART-Seq2 plate-based scRNA-Seq library prep. They show that up to 5-10% of sequencing reads are incorrectly assigned from a given sample to other samples in a multiplexed pool. Their paper presents the first (almost) published data supporting a hypothesis I proposed at the end of last year (probably when Stanford were running their experiments) that index-swapping is a result of the ExAmp clustering chemistry, and that free adapters/primers are the primary cause of the issue, but that mitigation strategies using unique-at-both-ends index combinations are likely to be required.
This is an issue I’ve been writing about for a while but we’ve not seen any problems quite as obvious as the Stanford group reported – however they were running a single-cell RNA-Seq experiment where some genes were very highly expressed so their “signal” is amplified. An internal exome analysis showed that we were getting the <1% index-swapping Illumina suggested would be present (data shown in my Dec 16th 2016 post).
Back in January last year (just after we’d bought our HiSeq 4000’s) I posted about the ExAmp chemistry – it was cool and I did not really think about the impact of a reaction that could proceed not only on the surface of the flowcell, but also in-solution (5). We saw an increase in read duplication which was an issue we spent some time looking into, but ultimately this was not a problem for most of our studies (4). Then in the Summer of 2016 we started to see PhiX in our demultiplexed FASTQ (it does not carry and index so should not be there) and started to think about index swapping between samples (3). As we were investigating this issue I started to do the same digging into Illumina’s patents (WO2013188582 A1 and US9169513 B2) that the Stanford group did to try and work out how ExAmp actually worked (2). My last post on the issue asked the question “should you buy a NovaSeq” and one of the reasons users need to consider is this index swapping (1).
- Should you buy a NovaSeq for your core lab (Jan 30th 2017)
- Index mis-assignment between samples on HiSeq 4000 and X-Ten (Dec 16th 2016)
- Index mis-assignment to Illumina’s PhiX control (Oct 7th 2016)
- Increased read duplication on patterned flowcells- understanding the impact of Exclusion Amplification (May 23rd 2016)
- (almost) everything you wanted to know about @illumina HiSeq 4000…and some stuff you didn’t (Jan 18th 2016)
Twitter storm?
Whilst “Twitter storm” may be an exaggeration there has certainly been quite a lot of activity around the paper…
If it's on the 4000, it's on the X and NovaSeq https://t.co/ckZQbwj7nc
— Mick Watson (@BioMickWatson) April 10, 2017
Also explains bizarre limiting of X10 platform to WGS only: to avoid multiplexing. Suggests Illumina have known about ExAmp issues for ages https://t.co/CE7zyj8uhf
— Daniel Gaffney (@danjgaffney) April 10, 2017
I think this is a genuine cluster fuck https://t.co/95GJuztQNN
— Gavin Sherlock (@gsherloc) April 9, 2017
https://twitter.com/marasawr/status/851514389329059844
Take home for microbial ecologists (et al): safest to stick w/ MiSeq or HiSeq 2500 – new chemistry on HiSeq 3000 & 4000 bad for multiplexing https://t.co/dQOyD7X6ZU
— Naupaka Zimmerman (@naupakaz) April 10, 2017
and perhaps most importantly…
Wow . I never fail to be amazed how long effects like these can go unnoticed despite large number of users @center4bfg https://t.co/H3XdgROvwe
— ben berman (@benbfly) April 9, 2017
What now:
Illumina have been working on this issue since the Summer of last year – almost 12 months. Unfortunately they have delayed any public announcement to users (similarly to MiSeq carry-over and long-read sequencing error rate issues) even though customers have been talking about it (at least one question was asked at #AGBT17). IMHO the time to discuss more openly this was several moths ago. Illumina missed an opportunity to explain the issue to users at their most recent UGMs, at AGBT or even pre-Christmas 2016 – bad form!
The Stanford group noticed the problem because they were running single-cell sequencing using SMART-Seq2 using dual-indexed adapters in 384-well plates. When analysing the same library run on either NextSeq or HiSeq 4000 they saw distinct “cross” patterns (anyone else think it somewhat ironic that it’s nearly Easter) where signal was bleeding from a source well to the samples in the row or column with either an index from the P5 or P7 end (see Fig 12 opposite). Their re-analysis of the intial results showed that “the 41 putative subpopulations of mHSCs identified by our subsequent computation analysis of the single cell expression data was not biologically meaningful, but was the result of the artifact or systematic technical error.”
- How many other Fluidigm C1 or plate-based SMART-Seq papers have been published? All of those will need checking.
- How many datasets have been generated that may need to be re run on Next-Seq or HiSeq 2500 v4?
- How much will this cost Illumina?
What next
First off it should be obvious that this paper need to be peer-reviewed and replicated by other groups (I’ll be speaking to our Bioinformatics core tomorrow about what we can do to more carefully assess data generated on our HiSeq 4000’s). If it is replicated then Illumina may have a big issue on their hands – potentially of the magnitude of the 2009ish problems with Paired-End kits. Whether this will have any impact on the investment communities bullish outlook for Illumina, or on sales of NextSeq remains to be seen. But if Illumina need to repeat the sequencing for the probably 100+ SMART-Seq studies already published, let alone being prepped right now, and repeat this on NextSeq there’ll be a big bill.
- NextSeq costs £5.71 per million reads compared to HiSeq 4000’s £1.69.
What about other methods
The Sinha et al paper reported on single-cell RNA-seq with the SMART-Seq method which uses plate-based library prep with dual-indexes. Any dual-index prep where indexes are not unique-at-both-ends is likely to suffer to some extent. The safest bet is to use unique index combinations, but that’s not great for people running highly-multiplexed RNA-Seq, Exome-Seq or similar.
I’m still thinking about the impact on experiments but for much of our bulk RNA-Seq, ChIP-seq, sWGS, and as many other experiments as we can manage, we run 3, 4 or even up to 6 replicates, or we’re running tens or hundreds of clinical samples as biological replicates. In these cases index swapping should be massaged out in the data analysis – we’re only going to proceed with variants that are robust.
I’d like to see Illumina, and other library-prep kit manufacturers, produce unique-at-both-ends adapter sets. For me this is still the most powerful way to remove index-switched samples before downstream analysis. Once this is broadly available (NEB, Agilent, Bioo, IDT, etc, etc, etc) this might just be a headache that goes away.
I’ll follow up on this post in a couple of weeks. I’m sure this story is just going get hotter until there are some clear mitigations in place.
[…] where a handful of erroneously indexed reads could have a big impact on inferences. (Blog posts here and here offer additional perspective on how big a deal this might turn out to be.) We’ll follow […]
Thanks for pointing this post out (Ethan?) another one to read which covers the BioRxiv paper is from UCDavis: http://dnatech.genomecenter.ucdavis.edu/2017/04/11/update-on-barcode-mis-assignment-issue
If the mechanism as described in Sinha is correct, then this would likely only be a problem with libraries amplified using indexed primers such as Nextera. Unamplified “TruSeq” libraries (unless they contain some unligated Y-adapters, which could serve as primers during the index swapping reaction) would probably suffer less from this. Also, libraries amplified using non-indexed primers such as amplified “TruSeq” libraries would likely not be affected by this mechanism.
How about an Exonuclease I treatment of the amplified library just before the final SPRI cleanup to destroy any remaining primers?
You’re right to be thinking along the lines of Exo treatment to remove excess free primer – remember ExoSAP for Sanger cleanups! I’ve been thinking how we might incorporate UNG into our NGS applications for a while. This was initially to remove all library after prep from the lab by starting each new reaction with a UNG step (similar to ABI AmpErase chemistry), or by ordering index primers/adapters with Uracils so they could be degraded before pooling (extended molecules would not contain Uracil).
To be honest, my tweet was more a play on words, rather than intending to indicate that I’d done any kind of analysis of our data, or that my lab is affected (we might be, but we’re still looking into it. Guess I should be more careful about what I tweet!
Thanks for commenting Gavin.
I loved the Tweet and it certainly stood out from some of the others! Don’t stop playing with those words…set them free!
James.
The extent of this effect should be quite easy to figure out empirically with one’s own runs with the appropriate controls. Like you said, PhiX contamination in libraries is the obvious choice for checking the extent of this effect.
The big question is: how much contamination is large enough to worry about? For variant calling it would obviously set a lower threshold on somatic variants. But for applications like RNA-Seq people are interested in large effects — even in single cell RNA-Seq experiments people are looking for huge changes because of the inherent lack of sensitivity.
It would really help if PhiX was uniquely indexed at both ends as then it would be simple to measure the extent of index-swapping in almost any experiment. Unfortunately it is not.