In this post I describe an idea for using SNP fingerprinting for single-cell RNA-Seq to identify which sample each individual cell comes from in a multiplexed library prep. The reason for my thinking about this is that single-cell experiments are expensive. Although the cost-per-cell can be very low, most sample prep for single-cell RNA-Seq is around £1000. This is prohibitive for many labs so methods that reduce the cost per sample are likely to be attractive.
In a 2012 paper Andy Lynch in our Directors research group published a tool called BADGER. The tool was developed for a 2000 sample microarray project, which used Affy SNP6.0 and Illumina HT12 arrays for copy-number and gene-expression analysis. BADGER uses gene expression data to predict SNPs, based on eQTLS, and these predicted SNP calls were used to confirm that array data (and therefore DNA and RNA) came from the same patient sample. This method helped identify sample mix-ups. Today there are many genotyping-by-sequencing applications (see here, here and on Illumina’s website) which use amplicons, or enrichment, to identify a sample based on SNP calls. Novel methods to fingerprint cells using magnets in flow, or to assign cell stage show how much more can be got from single cell analysis.
Single-cell RNA-Seq fingerprinting
My idea would be to add a relatively small number of compatible oligos to the oligo-pools used in 10X, Drop-Seq, etc (SMART-Seq generates full-length cDNA so should already be capable of generating this information) to capture a panel of bi-allelic coding SNPs as part of the standard single-cell GX workflow.
If production of these targeted oligos is problematic in the split-pool synthesis used to create single-cell oligos it might be possible to produce a polyadenylated locus specific primer that would first amplify the locus and then be amplified by the single-cell 3’mRNA-Seq oligo-dT primers used for droplet sequencing.
Analysis of the SNP targeting reads would allow identification of which sample a cell came from. These cells would also have the standard cell-identification and unique molecular identifier barcodes as well. Assuming this works we would be able to identify cells based on their genotype as well as their expression signature. I believe the impact of this would be significant:
- It could allow confirmation of single-cells per droplet and reduce/remove cell-duplicates as an issue in single-cell experiments
- It could enable population studies buy allowing multiple samples to be run in a single GEMcode reactio
- It could reduce costs by allowing multiple samples to be pooled for single-cell analysis
- It could possibly be used to fingerprint specific somatic variants (e.g. TP53, KRAS, EGFR) to confirm if a single-cell were mutant or wild type allowing much closer inspection of tumours and their normal stroma and immune cell infiltrates
Better experiments, and cost-savings too
The ability to multiplex samples in single-cell RNA-Seq would have a dramatic impact on the quality and cost of experiments. Quality would be increased by removing many of the confounding effects in scRNA-Seq experiments. Hicks et al describe the problem in their 2015 BioRxiv paper and suggest replication as the main mitigating factor because “standard balanced experimental designs are not possible”. By combining the multiplexing of samples in a run and replication of the whole experiment it should be possible to get much more robust scRNA-Seq data.
Comparison of Fluidigm InDrop and 10X single cell systems
The cost savings could be very significant too. As an example I considered an 8 sample experiment which on 10X Genomics today would cost £9,600 (£1200 per sample) for the library prep to capture 1000 cells with a 1% doublet rate. By multiplexing all 8 samples into a single library we’d decrease costs 8-fold. It should also be possible to capture higher numbers of cells by using the SNP fingerprinting to remove doublets. Because doublets are essentially random there should be a higher probability of doublets forming from two different samples, than the same sample. As such it should be possible to load the maximum number of cells and still achieve an effective doublet rate of 1%. This would allow a cheaper experiment to achieve better cell coverage. Sounds almost too good to be true!