What are Unique Molecular Identifiers (UMIs)?
Unique Molecular Identifiers (UMIs) are short random nucleotide sequences that are added during library generation (prior to PCR). UMIs act as tags that allow the accurate identification of PCR duplicates in sequencing data.
PCR duplicates can arise during library amplification if too many cycles are used and can therefore generate bias in sequencing read counts for downstream analysis of gene expression levels. Removing PCR duplicates can therefore remove any such biases.
When RNA-Seq libraries are prepared from more restricted regions of the transcript (e.g., in amplicon sequencing, and also in QuantSeq – which generates a 3' tag per transcript), there is a higher chance that identical priming occurs on unique transcripts or first strand cDNA molecules, than for whole-transcriptome RNA-Seq protocols involving RNA fragmentation.
This results in sequencing reads originating from unique transcripts having identical mapping coordinates and sequences. Without UMIs, these reads would be wrongly classified as PCR duplicates and collapsed, resulting in inaccurate read count data.
Including UMIs during library generation clearly distinguishes unique priming events from PCR duplicates, and allows for accurate de-duplication of sequencing reads.
For QuantSeq FWD library prep, UMIs are added during second strand synthesis. With a length of 6 nucleotides, a total of 46, or 4,096, UMIs are possible.