Index-swapping with @illumina ExAmp clustering

A paper uploaded to BioRxiv yesterday is the first (almost) published report of index-swapping on Illumina’s Exclusion-Amplification (ExAmp) chemistry. The paper: Index Switching Causes “Spreading-Of-Signal” Among Multiplexed Samples In Illumina HiSeq 4000 DNA Sequencing, by a group from Stanford Functional Genomics Facility published a re-analysis of single-cell RNA-Seq data from SMART-Seq2 plate-based scRNA-Seq library prep. They show that up to 5-10% of sequencing reads are incorrectly assigned from a given sample to other samples in a multiplexed pool. Their paper presents the first (almost) published data supporting a hypothesis I proposed at the end of last year (probably when Stanford were running their experiments) that index-swapping is a result of the ExAmp clustering chemistry, and that free adapters/primers are the primary cause of the issue, but that mitigation strategies using unique-at-both-ends index combinations are likely to be required.

This is an issue I’ve been writing about for a while but we’ve not seen any problems quite as obvious as the Stanford group reported – however they were running a single-cell RNA-Seq experiment where some genes were very highly expressed so their “signal” is amplified. An internal exome analysis showed that we were getting the <1% index-swapping Illumina suggested would be present (data shown in my Dec 16th 2016 post).

Back in January last year (just after we’d bought our HiSeq 4000’s) I posted about the ExAmp chemistry – it was cool and I did not really think about the impact of a reaction that could proceed not only on the surface of the flowcell, but also in-solution (5). We saw an increase in read duplication which was an issue we spent some time looking into, but ultimately this was not a problem for most of our studies (4). Then in the Summer of 2016 we started to see PhiX in our demultiplexed FASTQ (it does not carry and index so should not be there) and started to think about index swapping between samples (3). As we were investigating this issue I started to do the same digging into Illumina’s patents (WO2013188582 A1  and US9169513 B2) that the Stanford group did to try and work out how ExAmp actually worked (2). My last post on the issue asked the question “should you buy a NovaSeq” and one of the reasons users need to consider is this index swapping (1).

Should you buy a NovaSeq for your core lab (Jan 30th 2017)
Index mis-assignment between samples on HiSeq 4000 and X-Ten (Dec 16th 2016)
Index mis-assignment to Illumina’s PhiX control (Oct 7th 2016)
Increased read duplication on patterned flowcells- understanding the impact of Exclusion Amplification (May 23rd 2016)
(almost) everything you wanted to know about @illumina HiSeq 4000…and some stuff you didn’t (Jan 18th 2016)

Twitter storm?

Whilst “Twitter storm” may be an exaggeration there has certainly been quite a lot of activity around the paper…

If it's on the 4000, it's on the X and NovaSeq https://t.co/ckZQbwj7nc

— Mick Watson (@BioMickWatson) April 10, 2017

Also explains bizarre limiting of X10 platform to WGS only: to avoid multiplexing. Suggests Illumina have known about ExAmp issues for ages https://t.co/CE7zyj8uhf

— Daniel Gaffney (@danjgaffney) April 10, 2017

I think this is a genuine cluster fuck https://t.co/95GJuztQNN

— Gavin Sherlock (@gsherloc) April 9, 2017

https://twitter.com/marasawr/status/851514389329059844

Take home for microbial ecologists (et al): safest to stick w/ MiSeq or HiSeq 2500 – new chemistry on HiSeq 3000 & 4000 bad for multiplexing https://t.co/dQOyD7X6ZU

— Naupaka Zimmerman (@naupakaz) April 10, 2017

and perhaps most importantly…

Wow . I never fail to be amazed how long effects like these can go unnoticed despite large number of users @center4bfg https://t.co/H3XdgROvwe

— ben berman (@benbfly) April 9, 2017

What now:

Illumina have been working on this issue since the Summer of last year – almost 12 months. Unfortunately they have delayed any public announcement to users (similarly to MiSeq carry-over and long-read sequencing error rate issues) even though customers have been talking about it (at least one question was asked at #AGBT17). IMHO the time to discuss more openly this was several moths ago. Illumina missed an opportunity to explain the issue to users at their most recent UGMs, at AGBT or even pre-Christmas 2016 – bad form!

Comparison of Next-Seq and HiSeq 4000 index-swap rates.

The Stanford group noticed the problem because they were running single-cell sequencing using SMART-Seq2 using dual-indexed adapters in 384-well plates. When analysing the same library run on either NextSeq or HiSeq 4000 they saw distinct “cross” patterns (anyone else think it somewhat ironic that it’s nearly Easter) where signal was bleeding from a source well to the samples in the row or column with either an index from the P5 or P7 end (see Fig 12 opposite). Their re-analysis of the intial results showed that “the 41 putative subpopulations of mHSCs identified by our subsequent computation analysis of the single cell expression data was not biologically meaningful, but was the result of the artifact or systematic technical error.”

How many other Fluidigm C1 or plate-based SMART-Seq papers have been published? All of those will need checking.

How many datasets have been generated that may need to be re run on Next-Seq or HiSeq 2500 v4?

How much will this cost Illumina?

What next

First off it should be obvious that this paper need to be peer-reviewed and replicated by other groups (I’ll be speaking to our Bioinformatics core tomorrow about what we can do to more carefully assess data generated on our HiSeq 4000’s). If it is replicated then Illumina may have a big issue on their hands – potentially of the magnitude of the 2009ish problems with Paired-End kits. Whether this will have any impact on the investment communities bullish outlook for Illumina, or on sales of NextSeq remains to be seen. But if Illumina need to repeat the sequencing for the probably 100+ SMART-Seq studies already published, let alone being prepped right now, and repeat this on NextSeq there’ll be a big bill.

NextSeq costs £5.71 per million reads compared to HiSeq 4000’s £1.69.

What about other methods

The Sinha et al paper reported on single-cell RNA-seq with the SMART-Seq method which uses plate-based library prep with dual-indexes. Any dual-index prep where indexes are not unique-at-both-ends is likely to suffer to some extent. The safest bet is to use unique index combinations, but that’s not great for people running highly-multiplexed RNA-Seq, Exome-Seq or similar.

I’m still thinking about the impact on experiments but for much of our bulk RNA-Seq, ChIP-seq, sWGS, and as many other experiments as we can manage, we run 3, 4 or even up to 6 replicates, or we’re running tens or hundreds of clinical samples as biological replicates. In these cases index swapping should be massaged out in the data analysis – we’re only going to proceed with variants that are robust.

I’d like to see Illumina, and other library-prep kit manufacturers, produce unique-at-both-ends adapter sets. For me this is still the most powerful way to remove index-switched samples before downstream analysis. Once this is broadly available (NEB, Agilent, Bioo, IDT, etc, etc, etc) this might just be a headache that goes away.

I’ll follow up on this post in a couple of weeks. I’m sure this story is just going get hotter until there are some clear mitigations in place.

By James| April 10th, 2017|Categories: Uncategorized|8 Comments

About the Author: James

Permalink
Gallery

#AGBT23: what was hot in the hottest NGS tech conference?

February 16th, 2023 | 0 Comments
Permalink
Gallery

@Illumina and @Qiagen’s IVD deal: what does it mean for NGS?

October 8th, 2019 | 1 Comment
Permalink
Gallery

Will @Illumina succeed in buying @PacBio?

July 25th, 2019 | 8 Comments
Permalink
Gallery

@Illumina moving into baseball?

June 19th, 2019 | 1 Comment
Permalink
Gallery

#CRISPR Diagnostics part 2

April 18th, 2019 | 0 Comments

8 Comments

Right reads, wrong index? Concerns with data from Illumina’s HiSeq 4000 | 2017-04-11 at 8:41 pm - Reply

[…] where a handful of erroneously indexed reads could have a big impact on inferences. (Blog posts here and here offer additional perspective on how big a deal this might turn out to be.) We’ll follow […]
- James 2017-04-13 at 3:15 pm - Reply
  
  Thanks for pointing this post out (Ethan?) another one to read which covers the BioRxiv paper is from UCDavis: http://dnatech.genomecenter.ucdavis.edu/2017/04/11/update-on-barcode-mis-assignment-issue
- art vandelay 2017-04-14 at 8:59 pm - Reply
  
  If the mechanism as described in Sinha is correct, then this would likely only be a problem with libraries amplified using indexed primers such as Nextera. Unamplified “TruSeq” libraries (unless they contain some unligated Y-adapters, which could serve as primers during the index swapping reaction) would probably suffer less from this. Also, libraries amplified using non-indexed primers such as amplified “TruSeq” libraries would likely not be affected by this mechanism.
  
  How about an Exonuclease I treatment of the amplified library just before the final SPRI cleanup to destroy any remaining primers?
  - James 2017-04-21 at 6:51 am - Reply
    
    You’re right to be thinking along the lines of Exo treatment to remove excess free primer – remember ExoSAP for Sanger cleanups! I’ve been thinking how we might incorporate UNG into our NGS applications for a while. This was initially to remove all library after prep from the lab by starting each new reaction with a UNG step (similar to ABI AmpErase chemistry), or by ordering index primers/adapters with Uracils so they could be degraded before pooling (extended molecules would not contain Uracil).
Gavin Sherlock 2017-04-12 at 9:53 pm - Reply

To be honest, my tweet was more a play on words, rather than intending to indicate that I’d done any kind of analysis of our data, or that my lab is affected (we might be, but we’re still looking into it. Guess I should be more careful about what I tweet!
- James 2017-04-13 at 3:17 pm - Reply
  
  Thanks for commenting Gavin.
  
  I loved the Tweet and it certainly stood out from some of the others! Don’t stop playing with those words…set them free!
  
  James.
BTCL 2017-04-14 at 12:17 am - Reply

The extent of this effect should be quite easy to figure out empirically with one’s own runs with the appropriate controls. Like you said, PhiX contamination in libraries is the obvious choice for checking the extent of this effect.

The big question is: how much contamination is large enough to worry about? For variant calling it would obviously set a lower threshold on somatic variants. But for applications like RNA-Seq people are interested in large effects — even in single cell RNA-Seq experiments people are looking for huge changes because of the inherent lack of sensitivity.
- James 2017-04-21 at 6:52 am - Reply
  
  It would really help if PhiX was uniquely indexed at both ends as then it would be simple to measure the extent of index-swapping in almost any experiment. Unfortunately it is not.

Twitter storm?

What now:

What next

What about other methods

Like this:

Related

About the Author: James

8 Comments

Leave A Comment Cancel reply

Archive

Categories