Dataset Integration Exercise
We will integrate two PBMC datasets from different 10x Chromium chemistry versions to observe and correct batch effects.
Part 1: Data Acquisition
- Navigate to the 10x Genomics dataset page
- Download these two PBMC datasets (Note: you need to enter some information including an email address prior to being able to download data):
- Download the “Feature / cell matrix (filtered)” for each dataset
Part 2: Initial Processing Without Integration
Step 1: Load and combine datasets
- Import both datasets into your environment
- Label each dataset with its chemistry version (add metadata column: “batch” or “chemistry”)
- Filter cells (min genes, max genes, % mitochondrial)
- Merge/concatenate into a single object
Step 2: Standard preprocessings
- Normalize and log-transform
- Find highly variable genes
- Run PCA
Step 3: Clustering and visualization
- Compute nearest neighbors
- Run Leiden/Louvain clustering
- Generate UMAP
- Create visualizations:
- UMAP colored by batch/chemistry version
- UMAP colored by clusters
- Feature plots for key PBMC markers:
CD3D (T cells), CD14 (Monocytes), CD79A (B cells), NKG7 (NK cells)
Part 3: Batch Correction
For Seurat users:
- Use
IntegrateData() with default parameters
- Follow the standard Seurat integration workflow
For scanpy users:
- Use
scanpy.external.pp.harmony_integrate()
- Run on the PCA space you computed earlier
Part 4: Post-Integration Analysis
Step 1: Re-process integrated data
- Re-run clustering on integrated data
- Generate new UMAP from integrated embeddings
- Create same visualizations as before:
- UMAP colored by batch
- UMAP colored by clusters
- Feature plots for the same marker genes
Step 2: Quantitative evaluation
Calculate the integration Local Inverse Simpson’s Index (iLISI):
- For R users: Use the
lisi package
# Install: devtools::install_github("immunogenomics/lisi")
library(lisi)
res <- compute_lisi(embedding_matrix, metadata_df, "batch")
- For python users: Copy the function below and use it to calculate iLISI. You may need to first install
sklearn using pip install sklearn.
def calculate_lisi(adata, batch_key, embedding_key='X_pca', k=30):
from sklearn.neighbors import NearestNeighbors
import numpy as np
# Subsample to smallest batch size
batch_counts = adata.obs[batch_key].value_counts()
print(batch_counts)
min_size = batch_counts.min()
balanced_cells = []
for batch in batch_counts.index:
batch_cells = adata.obs[adata.obs[batch_key] == batch].index
sampled = np.random.choice(batch_cells, min_size, replace=False)
balanced_cells.extend(sampled)
# Use balanced subset
adata_subset = adata[balanced_cells]
X = adata_subset.obsm[embedding_key]
batch_labels = adata_subset.obs[batch_key].values
nbrs = NearestNeighbors(n_neighbors=k+1).fit(X)
distances, indices = nbrs.kneighbors(X)
lisi_scores = []
for i in range(len(X)):
neighbor_batches = batch_labels[indices[i][1:]] # exclude self
unique_batches, counts = np.unique(neighbor_batches, return_counts=True)
proportions = counts / len(neighbor_batches)
simpson_index = np.sum(proportions ** 2)
lisi_scores.append(1 / simpson_index)
return np.mean(np.array(lisi_scores))
- iLISI interpretation:
- Values close to 1 = poor mixing (batch effect remains)
- Values close to 2 = good mixing (successful integration)
Part 5: Comparison & Discussion
Discuss with a neighbor:
- How did cells cluster before vs. after integration?
- Are cell types now mixed across batches?
- What is your iLISI score? Does it support your visual assessment?
- Can you identify the major PBMC cell types after integration?
Bonus Challenges
- Try different integration parameters (e.g., different numbers of integration features)
- Compare with another integration method if time permits
- Calculate additional metrics like the k-nearest neighbor Batch Effect Test (kBET)