Indexed sequencing is vital to the delivering cost-effective, and statistically robust, experiments. Nearly all non-WGS projects are indexed to some degree so understanding how the indexing works is useful; fortunately Illumina produced this handy guide for users: Indexed sequencing overview. After indexed sequencing our reads are demultiplexed into sample specific fastq, and the reads that could not be assigned to an index in the samplesheet are dumped into a “lost reads” file. However we often get users asking us what the reads are that appear to be “lost”?

Usually only a small percentage of reads are “lost” but occasionally this can be 10% or higher. Remember that the “lost” reads are ones that could not be assigned to an index in the samplesheet, so looking carefully at the index sequences reported as lost usually enables us to work out what went wrong; and there are three main issues we watch for:

  1. Samplesheet error: the most common problem is simply that the wrong index was entered into the samplesheet. Because of this one of the indexes reported as being present will have zero reads, and one of the “lost” indexes will have the expected number of reads for the sample. It is usually obvious to the user once pointed out, it is easily fixed before the event (get your samplesheet right) but a little tricky to sort our afterwards.
  2. Unexpected indexes: Usually the unexpected indexes are at very low levels compared to the sample and are caused by sequencing errors and the like. However two cases can generate relatively high numbers of “lost” reads:
    1. PhiX: The PhiX control library does not have any index sequences and uses older adapters (Illumina should fix this for so many reasons; Seqmatic have an indexed version). The i7 indexing primer does not bind and the absence of signal generates the AAAAAAAA i7 read on most (all?) sequencers. The HiSeq 4000 uses the grafted p5 index primer which will typically generate the AGATCTCG sequence read from the PhiX adapter in the i5 read. The “lost reads” file contains a percentage of seequences, proportional to the PhiX loaded, with the AAAAAAAA,AGATCTCG index.
    2. Single-index contamination:  Similarly to PPhiX, but much less frequently, we can see indexes with the NNNNNNNN i5 and TCTTTCCC i7 reads. These have been previously identified as “contamination” of a dual indexed library with one that is single indexed (which wouldn’t have any i5 index, hence the NNNNNNNN sequence)
  3. Low quality index sequencing: If something happens during index sequencing and a base is lost or quality drops for some other reason then more reads can appear in the “lost reads” file. If this goes above 10% we’ll talk to the user first to find out if their analysis will be affected by either lower yield, or by increasing the number of index mismatches. If there are likely ot be problems then we would ask Illumina to replace the run due to the lower than expected yield.

PS: most of what I have written here is for HiSeq 4000 paired-end dual-indexed sequencing, read Illumina’s guide if you are doing something else!.

PPS: Illumina’s adapter sequences are often useful to refer to when you find something unusual.