Doublet Detection Exercise
Here, we will explore how to identify and remove doublets (two cells captured in one droplet) from single-cell RNA-seq data using computational methods.
Dataset
We will work with the 10x Genomics 10k PBMCs from a Healthy Donor (v3 chemistry). The higher cell count increases doublet probability (~8% expected doublet rate).
Part 1: Initial Data Processing
Step 1: Load and explore the dataset
- Import the 10k PBMC dataset
- Check initial cell and gene counts
- Note: With ~10,000 cells loaded, expect 600-800 doublets
Step 2: Standard preprocessing
- Calculate QC metrics (% mitochondrial, nGenes, nUMIs)
- Create QC plots:
- Scatter plot: nUMIs vs nGenes (color by % mitochondrial)
- Histogram of nUMIs per cell
Step 3: Process without doublet removal
- Filter cells using standard thresholds
- Normalize, find variable genes, scale
- Run PCA → UMAP → clustering
- Save this object as “pre_doublet_removal”
Part 2: Doublet Detection
For Seurat users:
Install scDblFinder and use it according to the package documentation.
For scanpy users:
Use scrublet via the built-in scanpy wrapper:
scanpy.external.pp.scrublet(adata, expected_doublet_rate=0.08)
Step 4: Analyze doublet predictions
- Add doublet scores/classifications to your object
- Create visualizations:
- UMAP colored by doublet score (continuous)
- UMAP colored by doublet classification (binary)
- Violin plot of nUMIs split by singlet/doublet
- Histogram of doublet scores with threshold line
Part 3: Biological Validation
Step 1: Check for heterotypic doublets
Look for cells expressing mutually exclusive markers:
- T cell + B cell markers:
CD3D + CD79A
- T cell + Monocyte markers:
CD3D + CD14
- B cell + NK markers:
CD79A + NKG7
Create feature plots:
- Plot these marker pairs
- Overlay predicted doublets
Step 2: Examine clustering patterns
- Which clusters have the highest proportion of predicted doublets?
- Do any clusters consist mainly of doublets?
- Calculate doublet percentage per cluster
Part 4: Remove Doublets and Re-analyze
Step 1: Filter doublets
- Remove cells classified as doublets
- Document how many cells were removed
Step 2: Re-process cleaned data
- Re-run the full pipeline (normalize → HVG → PCA → UMAP → cluster)
- Save this object as “post_doublet_removal”
Step 3: Compare before and after
Create side-by-side comparisons:
- Number of cells retained
- UMAP plots (before/after)
- Number of clusters
- Marker gene expression clarity
Part 5: Critical Evaluation
Discuss with a neighbor:
- What percentage of cells were identified as doublets?
- Were doublets randomly distributed or concentrated in specific areas?
- Did doublet removal reveal any new cell populations?
- Did any clusters disappear after doublet removal?
- How did doublet scores correlate with QC metrics (nUMIs, nGenes)?