Next-generation sequencing is fantastic technology and its use has revolutionised our understanding of biology, but it is not perfect, multiple issues occur in every lab from sample extraction through to the actual sequencing. Not all of these are well enough understood to be safely ignored and in this post I’m going to talk about one that I’m trying to better understand right now – duplication of sequences in datasets, and in particular Exclusion Amplicifation duplication on HiSeq 4000.

Sources of read duplicates in Illumina data – courtesy Illumina 2016

What are duplicates in NGS data: There are several types of duplication that can occur in an NGS experiment (see graphic from Illumina above) and understanding the difference between these should concern us if we are going to “fix” the problem in the lab or bioinformatically:

  1. true biological duplication – a biological sample contains millions of cells and therefore millions of copies of the genome, so in a PCR-free whole genome sequencing library virtually no sequences should occur more than once in any lane. Even for PCR-amplified whole genome sequencing libraries most sequences will occur appear once in the lane. Low level duplication may be due to repeats, but high levels of duplication would be suggestive of a technical issue e.g. PCR over amplification or target enrichment.
  2. PCR and enrichment duplicates – the easiest for everyone to get their head around. PCR amplifies DNA by copying the original molecules so you get duplicates. Start with a high enough input and you should not notice these PCR copies, but if you start with a very low input then duplicate reads in your final sequencing lane are much more likely and may be a problem. If there are biases in your PCR this will also over-amplify some regions leading to duplication that can affect biological interpretation. And if you are targeting a specific region of the genome then you will expect to see more duplication…as you have fewer unique regions in your sample to be analysed.
  3. Optical duplicates – these are only a problem for HiSeq 2500/MiSeq/NextSeq data. They come from large clusters being called as two separate clusters, or from local re-clustering of the original library molecule once it is released from polymerase copying. A read is called as an optical duplicate if the pair of reads are both on the same tile, and the distance between reads is less than the distance set in Picard’s “Mark Duplicates”.
  4. Exclusion Amplification duplicates (ExAmp) – these are only a problem for HiSeq 4000/X patterned flowcell data. They come from local re-clustering of the original library molecule once it is released from polymerase copying. Picard’s “Mark Duplicates” can still be used to call optical duplicate‘s, and it would be great if these could be marked in teh SAM file separately from PCR duplicates.
A bit more on where ExAmp duplicates come from: The clustering on Illumina’s patterned flowcells is very different from the original Solexa/Manteia developed method. Rather than random clustering with spacing determined by the concentration of library molecules, clustering takes place in controlled nanowells. A library molecule “lands” and is clustered beofre a second molecule can arrive in that location, but if a second molecule arrive early enough then a polyclonal cluster is formed which will be discarded in the chastity filtering. See this video on Illumina’s website – but look carefully at 1 minute into the video, you can see the original library molecule leaving the clustering area and floating off to create havoc elsewhere on the flowcell!

The ExAmp duplication is happening because library molecules that have made one cluster are free to go back into solution and create a second cluster, and they tend to do this in close proximity to the original cluster. As such they can be counted in much the same way as optical duplicates.

How to reduce ExAmp duplication: There is a balance to be struck in clustering on patterned flowcells between high %PF data i.e. no polyclonal clusters, and low % duplication. Ideally we’d like both but patterend flowcells don’t allow this. As you increase the loading concentration of an Illumina library onto a patterened flowcell you increase the rate at which molecules land for clustering, this means you get more polyclonal clusters. But if you load your libraries at too low a concentration you give those library molecules that have clustered an opportunity to recluster in unoccupied nanowells and this means you get higher %duplicates. In our HiSeq 4000 training we discussed this in some depth but the simple fact is users need to make a choice between fewer duplicates or more polyclonal clusters.

Not all duplicates are created equal: But all is not lost as duplicates can be understood and removed or accounted for in your downstream analysis. Although tools like FASTQC presents a graph showing sequence duplication levels understanding and interpreting duplication is complicated for many users by the fact they are working with single-end sequence data, e.g. most RNA-seq or ChIP-seq datasets. In the worst cases up to 70% of your RNA-seq reads can be called as PCR duplicates. After you’ve paid for 20M reads per sample, getting 6M unique  reads might leave you feeling a little short changed to say the least! But are those 14M duplicates really something you need to worry about?
Duplication presented in FastQC
FASTQC only analyses the first 100,000 reads in your FASTQ and as at a sequence level there is no way to distinguish between biological or technical and both are be reported as duplicates in the chart above. A FASTQC warning or error is context dependant. For PCR-free gneomes it should be taken seriously, but for RNA-seq and similar applications it may be an unecessary worry. In fact the FASTQC team specificall mention this in their documentation “In RNA-Seq libraries sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over-sequence high expressed transcripts, and this will potentially create large set of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins. This duplication will come from physically connected regions, and an examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication, but these distinctions are not possible from raw fastq files. A similar situation can arise in highly enriched ChIP-Seq libraries although the duplication there is less pronounced. Finally, if you have a library where the sequence start points are constrained (a library constructed around restriction sites for example, or an unfragmented small RNA library) then the constrained start sites will generate huge dupliction levels which should not be treated as a problem, nor removed by deduplication. In these types of library you should consider using a system such as random barcoding to allow the distinction of technical and biological duplicates. 
Identifying and dealing with duplicates: Fortunately most NGS analysis pipelines have options (usually recommended) to remove or mark duplicates them in downstream analysis e.g. the widely used Picard’s MarkDuplicates. This marks reads as duplicates based on their genome co-ordinates after comparing the alignment of their 5′ ends, reads with identical 5′ ends are duplicates and only the highest quality read is unmarked. Duplicate inserts are marked in the SAM file, allowing downstream GATK tools to exclude duplicates from analyses (most do this by default). But you should understand the impact of duplication, and/or the impact of removing duplicates on your experiments before you decide to use this information. Tere are lots of conversations out there to help – on SEQanswers Malachai Griffith and Heng Li made several comments on the thread Removing duplicates is it really necessary? And in a 2010 Genome Biology exome paper the authors showed data from both single end and paired end sequencing. Marking duplicates in both showed very different results of ~28% and ~8% duplicates respectively. But when they reanalysed paired-end data as single-end the % duplication jumped back up. This is becasue of how the duplicates are being marked (see Picard explanation below), and as such users running single-end experiments e.g. RNA-seq should be very careful not to throw out lots of useable data in the mistaken belief they have too much PCR, or other dupliction.

Picard “MarkDuplicates”: Picard MarkDuplicates does not distinguish between different types of biological or PCR duplication. But it can help identify if a sequencing dataset has a problem with optical, and probably ExAmp, duplicates allowing the user to proceed with caution. Ideally I’d like to be able to remove both optical and ExAmp duplicates for downstream analysis, by marking them separately in the SAM format.

You would not be expecting a highly diverse library if you’re sequencing targeted amplicons, or might expect a reduction in diversity if sequencing a low-input genome library. And both RNA-seq and CHIP-seq involve a lot of PCR amplification that may not necessarily impact downstream analysis and interpretation. But unless you are certain that duplicate marking is a waste of time, I’d recommend you do it anyway.

However whether you should remove those apparent duplicates is not necessarily clear. We don’t for our standard RNA-seq pipeline. And I’m going to focus my next post on how we decided not to remove duplicates in most RNA-seq experiments.