· Single-Cell Data Analysis Course @Charité

! currently experimental, not fully tested

Trajectory Inference Exercise

Objective

Learn to infer and analyze cellular differentiation trajectories using pseudotime analysis to understand developmental processes and cell fate decisions.

Dataset

10x Genomics Mouse Bone Marrow Mononuclear Cells

Download link: Human Bone Marrow Mononuclear Cells
Download the “Feature / cell matrix (filtered)”
This dataset contains hematopoietic progenitors and cells at various differentiation stages with clear developmental trajectories

Part 1: Data Preparation

Step 1: Load and preprocess

Import the bone marrow dataset
Perform standard QC filtering (min genes, max genes, % mitochondrial)
Normalize and log-transform
Find highly variable genes
Scale data
Run PCA (use 30-50 components)

Step 2: Initial clustering and visualization

Compute neighbors (k=30)
Run Leiden/Louvain clustering
Generate UMAP
Identify major lineages using key markers:
Myeloid: LYZ, CSF1R, CD14, CD68
Erythroid: GATA1, KLF1, HBB, GYPA
B cells: CD79A, CD79B, MS4A1, CD19
T cells: CD3D, CD3E, CD4, CD8A
Stem/Progenitors: CD34, KIT, PROM1, THY1

Step 3: Subset data for trajectory analysis

Focus on myeloid differentiation (monocyte/macrophage lineage)
Select clusters expressing LYZ and early markers like KIT
Re-run PCA on subset (20-30 components)
Re-compute neighbors and UMAP

Part 2: Trajectory Inference

For Python/scanpy users (Diffusion Pseudotime):

Compute diffusion map and pseudotime:

Run sc.tl.diffmap(adata_subset) to compute diffusion map
Identify root cell with high CD34/KIT and low LYZ/CD14
Set root: adata.uns['iroot'] = root_cell_index
Compute DPT: sc.tl.dpt(adata_subset)
Visualize: sc.pl.umap(adata_subset, color=['dpt_pseudotime'])

For R/Seurat users (Monocle3):

Install and load required packages:

Install: BiocManager::install('monocle3')
Install: remotes::install_github('satijalab/seurat-wrappers')
Load libraries: library(SeuratWrappers) and library(monocle3)

Convert and prepare data:

Convert to Monocle3: cds <- as.cell_data_set(seurat_subset)
Preprocess: cds <- preprocess_cds(cds, num_dim = 30)
Reduce dimensions: cds <- reduce_dimension(cds)
Cluster cells: cds <- cluster_cells(cds)
Learn trajectory: cds <- learn_graph(cds)

Part 3: Ordering Cells in Pseudotime

Step 1: Select root cells

Identify starting point of trajectory:

Look for cells with high stem/progenitor markers (CD34, KIT, PROM1)
Low expression of differentiated markers (LYZ, CD14, CD68)

For Python/scanpy users:

Find cells with high CD34 expression: check with sc.pl.umap(adata, color='Kit')
Identify appropriate root cell index
Set root: adata.uns['iroot'] = selected_index

For R/Monocle3 users:

Interactive selection: cds <- order_cells(cds) (opens plot for clicking)
Or programmatic: identify high CD34 cells and use order_cells(cds, root_cells = high_kit_cells)

Step 2: Visualize pseudotime

Create visualizations:

UMAP colored by pseudotime
UMAP colored by cluster
Pseudotime distribution histogram
Original markers overlaid on trajectory

Part 4: Gene Dynamics Along Pseudotime

Select key genes and create visualizations:

Early markers (should decrease): CD34, KIT, PROM1, FLT3
Mid markers (should peak midway): CEBPA, SPI1 (PU.1), MPO
Late markers (should increase): LYZ, CD14, CD68, CSF1R, ITGAM (CD11b)

Depending on your method/environment, check what other options for visualizations are available and try them out.

Part 5: Biological Interpretation

Validate trajectory makes biological sense:

Check marker progression:
- Do stem markers decrease along pseudotime?
- Do differentiation markers increase?
- Are intermediate states present?
Compare with known biology:
- Does trajectory match known myeloid development?
- Are transcription factors expressed at expected times?
Discuss to what extent it makes sense in systems like this to divide cells into discrete populations. What are the advantages and disadvantages?